**Speaker Recognition: Advancements and Challenges**

**Speaker Recognition: Advancements and Challenges**

Homayoon Beigi Additional information is available at the end of the chapter

Homayoon Beigi

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/52023

## 1. Introduction

Speaker Recognition is a multi-disciplinary branch of biometrics that may be used for *identification*, *verification*, and *classification* of individual speakers, with the capability of *tracking*, *detection*, and *segmentation* by extension. Recently, a comprehensive book on all aspects of speaker recognition was published [1]. Therefore, here we are not concerned with details of the standard modeling which is and has been used for the recognition task. In contrast, we present a review of the most recent literature and briefly visit the latest techniques which are being deployed in the various branches of this technology.

Most of the works being reviewed here have been published in the last two years. Some of the topics, such as alternative features and modeling techniques, are general and apply to all branches of speaker recognition. Some of these general techniques, such as whispered speech, are related to the advanced treatment of special forms of audio which have not received ample attention in the past. Finally, we will follow by a look at advancements which apply to specific branches of speaker recognition [1], such as verification, identification, classification, and diarization.

This chapter is meant to complement the summary of speaker recognition, presented in [2], which provided an overview of the subject. It is also intended as an update on the methods described in [1]. In the next section, for the sake of completeness, a brief history of speaker recognition is presented, followed by sections on specific progress as stated above, for globally applicable treatment and methods, as well as techniques which are related to specific branches of speaker recognition.

## 2. A brief history

The topic of speaker recognition [1] has been under development since the mid-twentieth century. The earliest known papers on the subject, published in the 1950s [3, 4], were in search of finding personal traits of the speakers, by analyzing their speech, with some statistical underpinning. With the advent of early communication networks, *Pollack, et al.* [3] noted the need for speaker identification. Although, they employed human listeners to do the identification of individuals and studied the importance of the duration of speech and other facets that help in the recognition of a speaker. In most of the early

©2012 Beigi, licensee InTech. This is an open access chapter distributed under the terms of the Creative

activities, a text-dependent analysis was made, in order to simplify the task of identification. In 1959, not long after Pollack's analysis, *Shearme, et al.* [4] started comparing the formants of speech, in order to facilitate the identification process. However, still a human expert would do the analysis. This first incarnation of speaker recognition, namely using human expertise, has been used to date, in order to handle forensic speaker identification [5, 6]. This class of approaches have been improved and used in a variety of criminal and forensic analyses by legal experts. [7, 8]

the elements in the upper triangle of Σ<sup>γ</sup> including the diagonal elements. On the other hand, if Σ<sup>γ</sup> is a

Σ<sup>γ</sup> = u<sup>γ</sup>

<sup>1</sup> ··· <sup>u</sup>*<sup>T</sup>*

where only (Γ − 1) mixture coefficients (prior probabilities), *p*(θ <sup>γ</sup> ), are included in ϕ, due to the

<sup>=</sup> (Σ<sup>γ</sup> )*dd* <sup>∀</sup> *<sup>d</sup>* ∈ {1, 2,··· ,*D*} (3)

<sup>Γ</sup> *p*(θ <sup>1</sup>) ··· *p*(θ <sup>Γ</sup>−1)

<sup>−</sup><sup>1</sup> (4)

Speaker Recognition: Advancements and Challenges

http://dx.doi.org/10.5772/52023

(5)

5

<sup>1</sup> , the log of

<sup>1</sup> , may be

(8)

(9)

*T*

*p*(ϕγ ) = 1 (6)

ln *p*(x*n*|ϕ) (7)

(u(Σ<sup>γ</sup> ))*<sup>d</sup>*

The parameter vector for the mixture model may be constructed as follows,

Thus the number of free parameters in the prior probabilities is only Γ−1.

ℓ(ϕ|{x}*<sup>N</sup>*

ℓ(ϕ|{x}*<sup>N</sup>*

Each multivariate distribution is represented by Equation 9.

*<sup>p</sup>*(x|<sup>θ</sup> <sup>γ</sup> ) = <sup>1</sup>

(2π) *D* 2 Σγ 1 2

<sup>1</sup> ) =

ϕ ∆ = µ*T* <sup>1</sup> ··· <sup>µ</sup>*<sup>T</sup>*

likelihood of the sequence may be written as follows,

written in terms of the mixture components,

using the *EM algorithm*.

∆

Therefore, we may always reconstruct Σ<sup>γ</sup> from u<sup>γ</sup> using the inverse transformation,

<sup>Γ</sup> u*<sup>T</sup>*

Γ ∑ γ=1

For a sequence of *independent and identically distributed* (*i.i.d.*) observations, {x}*<sup>N</sup>*

<sup>1</sup> ) = ln *<sup>N</sup>*

= *N* ∑ *n*=1

Assuming the mixture model, defined by Equation 1, the likelihood of the sequence, {x}*<sup>N</sup>*

*N* ∑ *n*=1 ln <sup>Γ</sup> ∑ γ=1

Since maximizing Equation 8 requires the maximization of the logarithm of a sum, we can utilize the incomplete data approach that is used in the development of the *EM algorithm* to simplify the solution. *Beigi* [1] shows the derivation of the incomplete data equivalent of the maximization of Equation 8

> *exp* −1 2

∏*n*=1 *p*(x*n*|ϕ)

*p*(x*n*|θ <sup>γ</sup> )*P*(θ <sup>γ</sup> )

(x−µγ )*T*Σ−<sup>1</sup>

<sup>γ</sup> (x−µγ )

diagonal matrix, then,

constraint that

Although it is always important to have a human expert available for important cases, such as those in forensic applications, the need for an automatic approach to speaker recognition was soon established. *Prunzansky, et al.* [9, 10] started by looking at an automatic statistical comparison of speakers using a text-dependent approach. This was done by analyzing a population of 10 speakers uttering several unique words. However, it is well understood that, at least for speaker identification, having a text-dependent analysis is not practical in the least [1]. Nevertheless, there are cases where there is some merit to having a text-dependent analysis done for the speaker verification problem. This is usually when there is limited computation resource and/or obtaining speech samples for longer than a couple of seconds is not feasible.

To date, still the most prevalent modeling techniques are the Gaussian mixture model (GMM) and support vector machine (SVM) approaches. Neural networks and other types of classifiers have also been used, although not in significant numbers. In the next two sections, we will briefly recap GMM and SVM approaches. See *Beigi* [1] for a detailed treatment of these and other classifiers.

#### 2.1. Gaussian Mixture Model (GMM) recognizers

In a GMM recognition engine, the models are the parameters for collections of multi-variate normal density functions which describe the distribution of the features [1] for speakers' enrollment data. The best results have been shown on many occasions, and by many research projects, to have come from the use of Mel-Frequency Cepstral Coefficient (MFCC) features [1]. Although, later we will review other features which may perform better for certain special cases.

The *Gaussian mixture model* (*GMM*) is a model that expresses the probability density function of a random variable in terms of a weighted sum of its components, each of which is described by a *Gaussian* (*normal*) density function. In other words,

$$p(\mathbf{x}|\boldsymbol{\Phi}) = \sum\_{\boldsymbol{\gamma}=\mathbf{l}}^{\Gamma} p(\mathbf{x}|\boldsymbol{\Theta}\_{\boldsymbol{\gamma}}) P(\boldsymbol{\Theta}\_{\boldsymbol{\gamma}}) \tag{1}$$

where the supervector of parameters, ϕ, is defined as an augmented set of Γ vectors constituting the free parameters associated with the Γ mixture components, θ <sup>γ</sup> , γ ∈ {1, 2,··· ,Γ} and the Γ −1 mixture weights, *P*(θ = θ <sup>γ</sup> ), γ = {1, 2,··· ,Γ − 1}, which are the prior probabilities of each of these mixture models known as the *mixing distribution* [11].

The parameter vectors associated with each mixture component, in the case of the Gaussian mixture model, are the parameters of the normal density function,

$$\boldsymbol{\Theta}\_{\mathcal{T}} = \begin{bmatrix} \boldsymbol{\mu}\_{\mathcal{T}}^T & \mathbf{u}^T(\mathbf{E}\_{\mathcal{T}}) \end{bmatrix}^T \tag{2}$$

where the *unique parameters* vector is an invertible transformation that stacks all the free parameters of a matrix into vector form. For example, if Σ<sup>γ</sup> is a full covariance matrix, then u(Σ<sup>γ</sup> ) is the vector of <sup>4</sup> New Trends and Developments in Biometrics Speaker Recognition: Advancements and Challenges 3 Speaker Recognition: Advancements and Challenges http://dx.doi.org/10.5772/52023 5

> the elements in the upper triangle of Σ<sup>γ</sup> including the diagonal elements. On the other hand, if Σ<sup>γ</sup> is a diagonal matrix, then,

$$\left(\left(\mathbf{u}(\mathbf{\Sigma}\_{\mathcal{Y}})\right)\_d \stackrel{\Delta}{=} \left(\mathbf{\Sigma}\_{\mathcal{Y}}\right)\_{dd} \forall d \in \{1, 2, \cdots, D\} \tag{3}$$

Therefore, we may always reconstruct Σ<sup>γ</sup> from u<sup>γ</sup> using the inverse transformation,

$$\mathbf{E}\_{\mathcal{T}} = \mathbf{u}\_{\mathcal{T}}^{-1} \tag{4}$$

The parameter vector for the mixture model may be constructed as follows,

2 New Trends and Developments in Biometrics

couple of seconds is not feasible.

a variety of criminal and forensic analyses by legal experts. [7, 8]

2.1. Gaussian Mixture Model (GMM) recognizers

features which may perform better for certain special cases.

(*normal*) density function. In other words,

models known as the *mixing distribution* [11].

model, are the parameters of the normal density function,

activities, a text-dependent analysis was made, in order to simplify the task of identification. In 1959, not long after Pollack's analysis, *Shearme, et al.* [4] started comparing the formants of speech, in order to facilitate the identification process. However, still a human expert would do the analysis. This first incarnation of speaker recognition, namely using human expertise, has been used to date, in order to handle forensic speaker identification [5, 6]. This class of approaches have been improved and used in

Although it is always important to have a human expert available for important cases, such as those in forensic applications, the need for an automatic approach to speaker recognition was soon established. *Prunzansky, et al.* [9, 10] started by looking at an automatic statistical comparison of speakers using a text-dependent approach. This was done by analyzing a population of 10 speakers uttering several unique words. However, it is well understood that, at least for speaker identification, having a text-dependent analysis is not practical in the least [1]. Nevertheless, there are cases where there is some merit to having a text-dependent analysis done for the speaker verification problem. This is usually when there is limited computation resource and/or obtaining speech samples for longer than a

To date, still the most prevalent modeling techniques are the Gaussian mixture model (GMM) and support vector machine (SVM) approaches. Neural networks and other types of classifiers have also been used, although not in significant numbers. In the next two sections, we will briefly recap GMM

In a GMM recognition engine, the models are the parameters for collections of multi-variate normal density functions which describe the distribution of the features [1] for speakers' enrollment data. The best results have been shown on many occasions, and by many research projects, to have come from the use of Mel-Frequency Cepstral Coefficient (MFCC) features [1]. Although, later we will review other

The *Gaussian mixture model* (*GMM*) is a model that expresses the probability density function of a random variable in terms of a weighted sum of its components, each of which is described by a *Gaussian*

> Γ ∑ γ=1

where the supervector of parameters, ϕ, is defined as an augmented set of Γ vectors constituting the free parameters associated with the Γ mixture components, θ <sup>γ</sup> , γ ∈ {1, 2,··· ,Γ} and the Γ −1 mixture weights, *P*(θ = θ <sup>γ</sup> ), γ = {1, 2,··· ,Γ − 1}, which are the prior probabilities of each of these mixture

The parameter vectors associated with each mixture component, in the case of the Gaussian mixture

where the *unique parameters* vector is an invertible transformation that stacks all the free parameters of a matrix into vector form. For example, if Σ<sup>γ</sup> is a full covariance matrix, then u(Σ<sup>γ</sup> ) is the vector of

<sup>γ</sup> <sup>u</sup>*<sup>T</sup>* (Σ<sup>γ</sup> )

*T*

*p*(x|θ <sup>γ</sup> )*P*(θ <sup>γ</sup> ) (1)

(2)

and SVM approaches. See *Beigi* [1] for a detailed treatment of these and other classifiers.

*p*(x|ϕ) =

θ <sup>γ</sup> = µ*T*

$$\boldsymbol{\mathfrak{g}} \stackrel{\Delta}{=} \begin{bmatrix} \boldsymbol{\mu}\_1^T \cdots \boldsymbol{\mu}\_\Gamma^T \ \mathbf{u}\_1^T \cdots \ \mathbf{u}\_\Gamma^T \ \boldsymbol{p}(\boldsymbol{\Theta}\_1) \cdots \ \boldsymbol{p}(\boldsymbol{\Theta}\_{\Gamma-1}) \end{bmatrix}^T \tag{5}$$

where only (Γ − 1) mixture coefficients (prior probabilities), *p*(θ <sup>γ</sup> ), are included in ϕ, due to the constraint that

$$\sum\_{\mathcal{T}=1}^{\Gamma} p(\mathfrak{g}\_{\mathcal{T}}) = 1 \tag{6}$$

Thus the number of free parameters in the prior probabilities is only Γ−1.

For a sequence of *independent and identically distributed* (*i.i.d.*) observations, {x}*<sup>N</sup>* <sup>1</sup> , the log of likelihood of the sequence may be written as follows,

$$\ell(\boldsymbol{\Phi}|\{\mathbf{x}\}\_{1}^{N}) = \ln \left( \prod\_{n=1}^{N} p(\mathbf{x}\_{n}|\boldsymbol{\Phi}) \right)$$

$$= \sum\_{n=1}^{N} \ln p(\mathbf{x}\_{n}|\boldsymbol{\Phi}) \tag{7}$$

Assuming the mixture model, defined by Equation 1, the likelihood of the sequence, {x}*<sup>N</sup>* <sup>1</sup> , may be written in terms of the mixture components,

$$\ell(\boldsymbol{\mathfrak{q}} | \{\mathbf{x}\}\_{1}^{N}) = \sum\_{n=1}^{N} \ln \left( \sum\_{\mathcal{T}=1}^{\Gamma} p(\mathbf{x}\_{n} | \boldsymbol{\mathfrak{e}}\_{\mathcal{T}}) P(\boldsymbol{\mathfrak{e}}\_{\mathcal{T}}) \right) \tag{8}$$

Since maximizing Equation 8 requires the maximization of the logarithm of a sum, we can utilize the incomplete data approach that is used in the development of the *EM algorithm* to simplify the solution. *Beigi* [1] shows the derivation of the incomplete data equivalent of the maximization of Equation 8 using the *EM algorithm*.

Each multivariate distribution is represented by Equation 9.

$$p(\mathbf{x}|\boldsymbol{\theta}\_{\mathcal{T}}) = \frac{1}{(2\pi)^{\frac{D}{2}} \left| \boldsymbol{\Sigma}\_{\mathcal{T}} \right|^{\frac{1}{2}}} \exp\left\{ -\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu}\_{\mathcal{T}})^T \boldsymbol{\Sigma}\_{\mathcal{T}}^{-1} (\mathbf{x} - \boldsymbol{\mu}\_{\mathcal{T}}) \right\} \tag{9}$$

where x, µγ ∈ R*<sup>D</sup>* and Σ<sup>γ</sup> : R*<sup>D</sup>* �→ R*D*.

In Equation 9, µγ is the mean vector for cluster γ computed from the vectors in that cluster, where,

$$
\mu\_{\mathcal{I}} \stackrel{\Delta}{=} \mathcal{E} \left\{ \mathbf{x} \right\} \stackrel{\Delta}{=} \int\_{-\infty}^{\infty} \mathbf{x} \, p(\mathbf{x}) d\mathbf{x} \tag{10}
$$

The *sample mean* approximation for Equation 10 is,

$$\boldsymbol{\mu}\_{\mathcal{T}} \approx \frac{1}{N} \sum\_{i=1}^{N} \mathbf{x}\_{i} \tag{11}$$

preferred in the literature. It has the characteristic that if a data point is accepted by the decision function of more than one class, then it is deemed as *not classified*. Furthermore, it is not classified if no decision function claims that data point to be in its class. This characteristic has both positive and negative connotations. It allows for better rejection of outliers, but then it may also be viewed as giving up on

Speaker Recognition: Advancements and Challenges

http://dx.doi.org/10.5772/52023

7

In application to speaker recognition, experimental results have shown that *SVM* implementations of speaker recognition may perform similarly or sometimes even be slightly inferior to the less complex and less resource intensive *GMM* approaches. However, it has also been noted that systems which combine *GMM* and *SVM* approaches often enjoy a higher accuracy, suggesting that part of the

The problem of *overtraining* (*overfitting*) plagues many learning techniques, and it has been one of the driving factors for the development of support vector machines [1]. In the process of developing the concept of *capacity* and eventually SVM, *Vapnik* considered the generalization capacity of learning machines, especially *neural networks*. The main goal of support vector machines is to maximize the generalization capability of the learning algorithm, while keeping good performance on the training patterns. This is the basis for the *Vapnik-Chervonenkis theory* (*CV theory*) [12], which computes bounds on the risk, *R*(*o*), according to the definition of the *VC dimension* and the *empirical risk* – see *Beigi* [1]. The multiclass classification problem is also quite important, since it is the basis for the speaker identification problem. In Section 10.10 of his book, *Vapnik* [12] proposed a simple approach where one class was compared to all other classes and then this is done for each class. This approach converts a Γ-class problem to Γ two-class problems. This is the most popular approach for handling multi-class SVM and has been dubbed the *one-against-all*<sup>1</sup> *approach* [1]. There is also, the *one-against-one approach* which transforms the problem into Γ(Γ + 1)/2 two-class SVM problems. In Section 6.2.1

One of the most important challenges in speaker recognition stems from inconsistencies in the different types of audio and their quality. One such problem, which has been the focus of most research and publications in the field, is the problem of channel mismatch, in which the enrollment audio has been gathered using one apparatus and the test audio has been produced by a different channel. It is important to note that the sources of mismatch vary and are generally quite complicated. They could be any combination and usually are not limited to mismatch in the handset or recording apparatus, the network capacity and quality, noise conditions, illness related conditions, stress related conditions, transition between different media, etc. Some approaches involve normalization of some kind to either transform the data (raw or in the feature space) or to transform the model parameters. Chapter 18 of *Beigi* [1] discusses many different channel compensation techniques in order to resolve this issue. *Vogt, et al.* [14]

One such problem is to obtain ample coverage for the different types of phonation in the training and enrollment phases, in order to have a better performance for situations when different phonation types are uttered. An example is the handling of whispered phonation which is, in general, very hard to collect and is not available under natural speech scenarios. Whisper is normally used by individuals who desire to have more privacy. This may happen under normal circumstances when the user is on a telephone and does not want others to either hear his/her conversation or does not wish to bother others in the

information revealed by the two approaches may be complementary [13].

we will see more recent techniques for handling multi-class SVM.

provide a good coverage of methods for handling modeling mismatch.

handling outliers.

3. Challenging audio

<sup>1</sup> Also known as one-against-rest.

where *N* is the number of samples and x*<sup>i</sup>* are the MFCC [1].

The *Covariance* matrix is defined as,

$$\boldsymbol{\Sigma}\_{\mathcal{T}} \stackrel{\Delta}{=} \mathcal{E}\left\{ \left( \mathbf{x} - \mathcal{E}\left\{ \mathbf{x} \right\} \right) \left( \mathbf{x} - \mathcal{E}\left\{ \mathbf{x} \right\} \right)^{T} \right\} = \mathcal{E}\left\{ \mathbf{x} \mathbf{x}^{T} \right\} - \boldsymbol{\mu}\_{\mathcal{T}} \boldsymbol{\mu}\_{\mathcal{T}}^{T} \tag{12}$$

The diagonal elements of Σ<sup>γ</sup> are the variances of the individual dimensions of x. The off-diagonal elements are the covariances across the different dimensions.

The *unbiased estimate* of <sup>Σ</sup><sup>γ</sup> , <sup>Σ</sup>˜ <sup>γ</sup> , is given by the following,

$$\mathbf{\dot{E}}\_{\mathcal{I}} = \frac{1}{N - 1} \left[ \mathbf{S}\_{\mathcal{I}} |\_{N} - N(\boldsymbol{\mu} \boldsymbol{\mu}^{T}) \right] \tag{13}$$

where the *sample mean*, µγ , is given by Equation 11 and the *second order sum matrix (Scatter Matrix)*, S<sup>γ</sup> |*N*, is given by,

$$\mathbf{S}\_{\mathcal{Y}}|\_N \stackrel{\Delta}{=} \sum\_{i=1}^N \mathbf{x}\_i \mathbf{x}\_i^T \tag{14}$$

Therefore, in a general GMM model, the above statistical parameters are computed and stored for the set of Gaussians along with the corresponding mixture coefficients, to represent each speaker. The features used by the recognizer are *Mel-Frequency Cepstral Coefficients* (*MFCC*). *Beigi* [1] describes details of such a GMM-based recognizer.

#### 2.2. Support Vector Machine (SVM) recognizers

In general, *SVM* are formulated as *two-class* classifiers. Γ-class classification problems are usually reduced to Γ two-class problems [12], where the γ*th* two-class problem compares the γ*th* class with the rest of the classes combined. There are also other generalizations of the SVM formulation which are geared toward handling Γ-class problems directly. *Vapnik* has proposed such formulations in Section 10.10 of his book [12]. He also credits *M. Jaakkola* and *C. Watkins, et al.* for having proposed similar generalizations independently. For such generalizations, the constrained optimization problem becomes much more complex. For this reason, the approximation using a set of Γ two-class problems has been preferred in the literature. It has the characteristic that if a data point is accepted by the decision function of more than one class, then it is deemed as *not classified*. Furthermore, it is not classified if no decision function claims that data point to be in its class. This characteristic has both positive and negative connotations. It allows for better rejection of outliers, but then it may also be viewed as giving up on handling outliers.

In application to speaker recognition, experimental results have shown that *SVM* implementations of speaker recognition may perform similarly or sometimes even be slightly inferior to the less complex and less resource intensive *GMM* approaches. However, it has also been noted that systems which combine *GMM* and *SVM* approaches often enjoy a higher accuracy, suggesting that part of the information revealed by the two approaches may be complementary [13].

The problem of *overtraining* (*overfitting*) plagues many learning techniques, and it has been one of the driving factors for the development of support vector machines [1]. In the process of developing the concept of *capacity* and eventually SVM, *Vapnik* considered the generalization capacity of learning machines, especially *neural networks*. The main goal of support vector machines is to maximize the generalization capability of the learning algorithm, while keeping good performance on the training patterns. This is the basis for the *Vapnik-Chervonenkis theory* (*CV theory*) [12], which computes bounds on the risk, *R*(*o*), according to the definition of the *VC dimension* and the *empirical risk* – see *Beigi* [1].

The multiclass classification problem is also quite important, since it is the basis for the speaker identification problem. In Section 10.10 of his book, *Vapnik* [12] proposed a simple approach where one class was compared to all other classes and then this is done for each class. This approach converts a Γ-class problem to Γ two-class problems. This is the most popular approach for handling multi-class SVM and has been dubbed the *one-against-all*<sup>1</sup> *approach* [1]. There is also, the *one-against-one approach* which transforms the problem into Γ(Γ + 1)/2 two-class SVM problems. In Section 6.2.1 we will see more recent techniques for handling multi-class SVM.

## 3. Challenging audio

4 New Trends and Developments in Biometrics

where x, µγ ∈ R*<sup>D</sup>* and Σ<sup>γ</sup> : R*<sup>D</sup>* �→ R*D*.

The *Covariance* matrix is defined as,

S<sup>γ</sup> |*N*, is given by,

Σγ ∆ = E 

details of such a GMM-based recognizer.

2.2. Support Vector Machine (SVM) recognizers

The *sample mean* approximation for Equation 10 is,

where *N* is the number of samples and x*<sup>i</sup>* are the MFCC [1].

elements are the covariances across the different dimensions. The *unbiased estimate* of <sup>Σ</sup><sup>γ</sup> , <sup>Σ</sup>˜ <sup>γ</sup> , is given by the following,

In Equation 9, µγ is the mean vector for cluster γ computed from the vectors in that cluster, where,

µγ <sup>≈</sup> <sup>1</sup> *N*

(x−E {x}) (x−E {x})*<sup>T</sup>*

<sup>Σ</sup>˜ <sup>γ</sup> <sup>=</sup> <sup>1</sup> *N* −1 

> S<sup>γ</sup> |*<sup>N</sup>* ∆ = *N* ∑ *i*=1 x*i*x*<sup>i</sup>*

*N* ∑ *i*=1

> = E xx*<sup>T</sup>* −µγµγ

S<sup>γ</sup> |*<sup>N</sup>* −*N*(µµ*<sup>T</sup>* )

The diagonal elements of Σ<sup>γ</sup> are the variances of the individual dimensions of x. The off-diagonal

where the *sample mean*, µγ , is given by Equation 11 and the *second order sum matrix (Scatter Matrix)*,

Therefore, in a general GMM model, the above statistical parameters are computed and stored for the set of Gaussians along with the corresponding mixture coefficients, to represent each speaker. The features used by the recognizer are *Mel-Frequency Cepstral Coefficients* (*MFCC*). *Beigi* [1] describes

In general, *SVM* are formulated as *two-class* classifiers. Γ-class classification problems are usually reduced to Γ two-class problems [12], where the γ*th* two-class problem compares the γ*th* class with the rest of the classes combined. There are also other generalizations of the SVM formulation which are geared toward handling Γ-class problems directly. *Vapnik* has proposed such formulations in Section 10.10 of his book [12]. He also credits *M. Jaakkola* and *C. Watkins, et al.* for having proposed similar generalizations independently. For such generalizations, the constrained optimization problem becomes much more complex. For this reason, the approximation using a set of Γ two-class problems has been

x *p*(x)*d*x (10)

x*<sup>i</sup>* (11)

*<sup>T</sup>* (14)

*<sup>T</sup>* (12)

(13)

µγ ∆ <sup>=</sup> <sup>E</sup> {x} <sup>∆</sup> = ˆ <sup>∞</sup> −∞

> One of the most important challenges in speaker recognition stems from inconsistencies in the different types of audio and their quality. One such problem, which has been the focus of most research and publications in the field, is the problem of channel mismatch, in which the enrollment audio has been gathered using one apparatus and the test audio has been produced by a different channel. It is important to note that the sources of mismatch vary and are generally quite complicated. They could be any combination and usually are not limited to mismatch in the handset or recording apparatus, the network capacity and quality, noise conditions, illness related conditions, stress related conditions, transition between different media, etc. Some approaches involve normalization of some kind to either transform the data (raw or in the feature space) or to transform the model parameters. Chapter 18 of *Beigi* [1] discusses many different channel compensation techniques in order to resolve this issue. *Vogt, et al.* [14] provide a good coverage of methods for handling modeling mismatch.

> One such problem is to obtain ample coverage for the different types of phonation in the training and enrollment phases, in order to have a better performance for situations when different phonation types are uttered. An example is the handling of whispered phonation which is, in general, very hard to collect and is not available under natural speech scenarios. Whisper is normally used by individuals who desire to have more privacy. This may happen under normal circumstances when the user is on a telephone and does not want others to either hear his/her conversation or does not wish to bother others in the

<sup>1</sup> Also known as one-against-rest.

vicinity, while interacting with the speaker recognition system. In Section 3.1, we will briefly review the different styles of phonation. Section 3.2 will then cover some work which has been done, in order to be able to handle whispered speech.

*languages* <sup>2</sup> (e.g., *Comanche* [16] and *Tlingit* – spoken in Alaska) and some old languages, voiceless

Speaker Recognition: Advancements and Challenges

http://dx.doi.org/10.5772/52023

9

An example of a whispered phone in English is the *egressive pulmonic whisper* [1] which is the sound that an [h] makes in the word, "home." However, any utterance may be produced by relaxing the vocal folds and generating a whispered version of the utterance. This partial relaxation of the vocal folds can significantly change the vocal characteristics of the speaker. Without ample data in whisper mode, it

*Pollack, et al.* [3] say that we need about three times as much speech samples for whispered speech in order to obtain an equivalent accuracy to that of normal speech. This assessment was made according to a comparison, done using human listeners and identical speech content, as well as an attempted

*Jin, et al.* [17] deal with the insufficient amount of whisper data by creating two GMM models for each individual, assuming that ample data is available for the normal-speech mode for any target speaker. Then, in the test phase, they use the *frame-based score competition* (*FSC*) method, comparing each frame of audio to the two models for every speaker (normal and whispered) and only using the result for that frame, from the model which produces the higher score. Otherwise, they continue with the

*Jin, et al.* [17] conducted experiments on whispered speech when almost no whisper data was available for the enrollment phase. The experiments showed that noise greatly impacts recognition with whispered speech. Also, they concentrate on using a throat microphone which happens to be more robust in terms of noise, but it also picks up more resonance for whispered speech. In general, using the

*Fan, et al.* [18] have looked into the differences between whisper and neutral speech. By neutral speech, they mean normal speech which is recorded in a modal (voiced) speech setting in a quiet recording studio. They use the fact that the unvoiced consonants are quite similar in the two types of speech and that most of the differences stem from the remaining phones. Using this, they separate whispered speech into two parts. The first part includes all the unvoiced consonants, and the second part includes the rest of the phones. Furthermore, they show better performance for unvoiced consonants in the whispered speech, when using *linear frequency cepstral coefficients* (*LFCC*) and *exponential frequency cepstral coefficients* (*EFCC*) – see Section 4.3. In contrast, the rest of the phones show better performance with MFCC features. Therefore, they detect *unvoiced consonants* and treat them using LFCC/EFCC features. They send the rest of the phones (e.g., voiced consonants, vowels, diphthongs, triphthongs, glides, liquids) through an MFCC-based system. Then they combine the scores from the two segments

The unvoiced consonant detection which is proposed by [18], uses two measures for determining the frames stemming from unvoiced consonants. For each frame, *l*, the energy of the frame in the

4000*Hz* < *f* ≤ 8000*Hz* respectively) are computed, along with the total energy of the frame, *El*, to be used for normalization. The relative energy of the lower frequency is then computed for each frame by

> *Rl* <sup>=</sup> *<sup>E</sup>*(*l*) *l El*

*<sup>l</sup>* , and that of the higher part of the band, *<sup>E</sup>*(*h*)

*<sup>l</sup>* , (for *f* ≤ 4000*Hz* and

(15)

two-model approach with FSC, [17] show significant reduction in the error rate.

vocoids exist and carry independent meaning from their voiced counterparts [1].

would be hard to identify the speaker.

equivalence in the recording volume levels.

standard process of recognition.

to make a speaker recognition decision.

<sup>2</sup> Languages spoken by native inhabitants of the Americas.

lower part of the spectrum, *E*(*l*)

Equation 15.

Another challenging issue with audio is to handle multiple speakers with possibly overlapping speech. The most difficult scenario would be the presence of multiple speakers on a single microphone, say a telephone handset, where each speaker is producing similar level of audio at the same time. This type of cross-talk is very hard to handle and indeed it is very difficult to identify the different speakers while they speak simultaneously. A somewhat simpler scenario is the one which generally happens in a conference setting, in a room, in which case, a far-field microphone (or microphone array) is capturing the audio. When multiple speakers speak in such a setting, there are some solutions which have worked out well in reducing the interference of other speakers, when focusing on the speech of a certain individual. In Section 3.4, we will review some work that has been done in this field.

## 3.1. Different styles of phonation

*Phonation* deals with the acoustic energy generated by the vocal folds at the larynx. The different kinds of phonation are *unvoiced*, *voiced*, and *whisper*.

*Unvoiced phonation* may be either in the form of *nil phonation* which corresponds to zero energy or *breath phonation* which is based on relaxed vocal folds passing a turbulent air stream.

Majority of voiced sounds are generated through *normal voiced* phonation which happens when the vocal folds are vibrating at a periodic rate and generate certain resonance in the upper chamber of the vocal tract. Another category of voiced phonation is called *laryngealization* (*creaky voice*). It is when the arytenoid cartilages fix the posterior portion of the vocal folds, only allowing the anterior part of the vocal folds to vibrate. Yet another type voiced phonation is a falsetto which is basically the un-natural creation of a high pitched voice by tightening the basic shape of the vocal folds to achieve a false high pitch.

In another view, the emotional condition of the speaker may affect his/her phonation. For example, speech under stress may manifest different phonetic qualities than that of, so-called, *neutral speech* [15]. Whispered speech also changes the general condition of phonation. It is thought that this does not affect unvoiced consonants as much. In Sections 3.2 and 3.3 we will briefly look at whispered speech and speech under stressful conditions.

## 3.2. Treatment of whispered speech

Whispered phonation happens when the speaker acts like generating a voiced phonation with the exception that the vocal folds are made more relaxed so that a greater flow of air can pass through them, generating more of a turbulent airstream compared to a voiced resonance. However, the vocal folds are not relaxed enough to generate an unvoiced phonation.

As early as the first known paper on speaker identification [3], the challenges of whispered speech were apparent. The general text-independent analysis of speaker characteristics relies mainly on the *normal voiced phonation* as the primary source of speaker-dependent information.[1] This is due to the high-energy periodic signal which is generated with rich resonance information. Normally, very little natural whisper data is available for training. However, in some languages, such as *Amerindian* *languages* <sup>2</sup> (e.g., *Comanche* [16] and *Tlingit* – spoken in Alaska) and some old languages, voiceless vocoids exist and carry independent meaning from their voiced counterparts [1].

6 New Trends and Developments in Biometrics

to be able to handle whispered speech.

3.1. Different styles of phonation

speech under stressful conditions.

3.2. Treatment of whispered speech

folds are not relaxed enough to generate an unvoiced phonation.

pitch.

of phonation are *unvoiced*, *voiced*, and *whisper*.

vicinity, while interacting with the speaker recognition system. In Section 3.1, we will briefly review the different styles of phonation. Section 3.2 will then cover some work which has been done, in order

Another challenging issue with audio is to handle multiple speakers with possibly overlapping speech. The most difficult scenario would be the presence of multiple speakers on a single microphone, say a telephone handset, where each speaker is producing similar level of audio at the same time. This type of cross-talk is very hard to handle and indeed it is very difficult to identify the different speakers while they speak simultaneously. A somewhat simpler scenario is the one which generally happens in a conference setting, in a room, in which case, a far-field microphone (or microphone array) is capturing the audio. When multiple speakers speak in such a setting, there are some solutions which have worked out well in reducing the interference of other speakers, when focusing on the speech of a certain individual. In

*Phonation* deals with the acoustic energy generated by the vocal folds at the larynx. The different kinds

*Unvoiced phonation* may be either in the form of *nil phonation* which corresponds to zero energy or

Majority of voiced sounds are generated through *normal voiced* phonation which happens when the vocal folds are vibrating at a periodic rate and generate certain resonance in the upper chamber of the vocal tract. Another category of voiced phonation is called *laryngealization* (*creaky voice*). It is when the arytenoid cartilages fix the posterior portion of the vocal folds, only allowing the anterior part of the vocal folds to vibrate. Yet another type voiced phonation is a falsetto which is basically the un-natural creation of a high pitched voice by tightening the basic shape of the vocal folds to achieve a false high

In another view, the emotional condition of the speaker may affect his/her phonation. For example, speech under stress may manifest different phonetic qualities than that of, so-called, *neutral speech* [15]. Whispered speech also changes the general condition of phonation. It is thought that this does not affect unvoiced consonants as much. In Sections 3.2 and 3.3 we will briefly look at whispered speech and

Whispered phonation happens when the speaker acts like generating a voiced phonation with the exception that the vocal folds are made more relaxed so that a greater flow of air can pass through them, generating more of a turbulent airstream compared to a voiced resonance. However, the vocal

As early as the first known paper on speaker identification [3], the challenges of whispered speech were apparent. The general text-independent analysis of speaker characteristics relies mainly on the *normal voiced phonation* as the primary source of speaker-dependent information.[1] This is due to the high-energy periodic signal which is generated with rich resonance information. Normally, very little natural whisper data is available for training. However, in some languages, such as *Amerindian*

*breath phonation* which is based on relaxed vocal folds passing a turbulent air stream.

Section 3.4, we will review some work that has been done in this field.

An example of a whispered phone in English is the *egressive pulmonic whisper* [1] which is the sound that an [h] makes in the word, "home." However, any utterance may be produced by relaxing the vocal folds and generating a whispered version of the utterance. This partial relaxation of the vocal folds can significantly change the vocal characteristics of the speaker. Without ample data in whisper mode, it would be hard to identify the speaker.

*Pollack, et al.* [3] say that we need about three times as much speech samples for whispered speech in order to obtain an equivalent accuracy to that of normal speech. This assessment was made according to a comparison, done using human listeners and identical speech content, as well as an attempted equivalence in the recording volume levels.

*Jin, et al.* [17] deal with the insufficient amount of whisper data by creating two GMM models for each individual, assuming that ample data is available for the normal-speech mode for any target speaker. Then, in the test phase, they use the *frame-based score competition* (*FSC*) method, comparing each frame of audio to the two models for every speaker (normal and whispered) and only using the result for that frame, from the model which produces the higher score. Otherwise, they continue with the standard process of recognition.

*Jin, et al.* [17] conducted experiments on whispered speech when almost no whisper data was available for the enrollment phase. The experiments showed that noise greatly impacts recognition with whispered speech. Also, they concentrate on using a throat microphone which happens to be more robust in terms of noise, but it also picks up more resonance for whispered speech. In general, using the two-model approach with FSC, [17] show significant reduction in the error rate.

*Fan, et al.* [18] have looked into the differences between whisper and neutral speech. By neutral speech, they mean normal speech which is recorded in a modal (voiced) speech setting in a quiet recording studio. They use the fact that the unvoiced consonants are quite similar in the two types of speech and that most of the differences stem from the remaining phones. Using this, they separate whispered speech into two parts. The first part includes all the unvoiced consonants, and the second part includes the rest of the phones. Furthermore, they show better performance for unvoiced consonants in the whispered speech, when using *linear frequency cepstral coefficients* (*LFCC*) and *exponential frequency cepstral coefficients* (*EFCC*) – see Section 4.3. In contrast, the rest of the phones show better performance with MFCC features. Therefore, they detect *unvoiced consonants* and treat them using LFCC/EFCC features. They send the rest of the phones (e.g., voiced consonants, vowels, diphthongs, triphthongs, glides, liquids) through an MFCC-based system. Then they combine the scores from the two segments to make a speaker recognition decision.

The unvoiced consonant detection which is proposed by [18], uses two measures for determining the frames stemming from unvoiced consonants. For each frame, *l*, the energy of the frame in the lower part of the spectrum, *E*(*l*) *<sup>l</sup>* , and that of the higher part of the band, *<sup>E</sup>*(*h*) *<sup>l</sup>* , (for *f* ≤ 4000*Hz* and 4000*Hz* < *f* ≤ 8000*Hz* respectively) are computed, along with the total energy of the frame, *El*, to be used for normalization. The relative energy of the lower frequency is then computed for each frame by Equation 15.

$$R\_l = \frac{E\_l^{(l)}}{E\_l} \tag{15}$$

<sup>2</sup> Languages spoken by native inhabitants of the Americas.

It is assumed that most of spectral energy of unvoiced consonants is concentrated in the higher half of the frequency spectrum, compared to the rest of the phones. In addition, the Jeffreys' divergence [1] of the higher portion of the spectrum relative to the previous frame is computed using Equation 16.

$$\log\_I(l \leftrightarrow l-1) = -P\_{l-1}^{(h)} \log\_2(P\_l^{(h)}) - P\_l^{(h)} \log\_2(P\_{l-1}^{(h)}) \tag{16}$$

where

$$P\_l^{(h)} \stackrel{\Delta}{=} \frac{E\_l^{(h)}}{E\_l} \tag{17}$$

3.5. Channel mismatch

matrix.

(DTW).

4. Alternative features

Section 5.1 for more on the i-vector approach.

Many publications deal with the problem of channel mismatch, since it is the most important challenge in speaker recognition. Early approaches to the treatment of this problem concentrated on normalization of the features or the score. *Vogt, et al.* [14] present a good coverage of different normalization techniques. *Barras, et al.* [24] compare cepstral mean subtraction (CMS) and variance normalization, Feature Warping, T-Norm, Z-Norm and the cohort methods. Later approaches started by using techniques from factor analysis or discriminant analysis to transform features such that they convey the most information about speaker differences and least about channel differences. Most GMM techniques use some variation of *joint factor analysis* (*JFA*) [25]. An offshoot of JFA is the i-vector technique which does away with the channel part of the model and falls back toward a PCA approach [26]. See

Speaker Recognition: Advancements and Challenges

http://dx.doi.org/10.5772/52023

11

SVM systems use techniques such as *nuisance attribute projection* (*NAP*) [27]. NAP [13] modifies the original *kernel*, used for a *support vector machine* (*SVM*) formulation, to one with the ability of telling specific channel information apart. The premise behind this approach is that by doing so, in both training and recognition stages, the system will not have the ability to distinguish channel specific information. This channel specific information is what is dubbed nuisance by *Solomonoff, et al.* [13]. *NAP* is a projection technique which assumes that most of the information related to the channel is stored in specific low-dimensional subspaces of the higher dimensional space to which the original features are mapped. Furthermore, these regions are assumed to be somewhat distinct from the regions which carry speaker information. This is quite similar to the idea of *joint factor analysis*. *Seo, et al.* [28] use the statistics of the eigenvalues of background speakers to come up with discriminative weight for each background speaker and to decide on the between class scatter matrix and the within-class scatter

*Shanmugapriya, et al.* [29] propose a *fuzzy wavelet network* (*FWN*) which is a neural network with a wavelet activation function (known as a *Wavenet*). A fuzzy neural network is used in this case, with the wavelet activation function. Unfortunately, [29] only provides results for the TIMIT database [1] which

*Villalba, et al.* [30] attempt to detect two types of low-tech spoofing attempts. The first one is the use of a far-field microphone to record the victim's speech and then to play it back into a telephone handset. The second type is the concatenation of segments of short recordings to build the input required for a text-dependent speaker verification system. The former is handled by using an SVM classifier for spoof and non-spoof segments trained based on some training data. The latter is detected by comparing the pitch and MFCC feature contours of the enrollment and test segments using dynamic time warping

As seen in the past, most classic features used in speech and speaker recognition are based on LPC, LPCC, or MFCC. In Section 6.3 we see that *Dhanalakshmi, et al.* [19] report trying these three classic features and have shown that MFCC outperforms the other two. Also, *Beigi* [1] discusses many other features such as those generated by *wavelet filterbanks*, *instantaneous frequencies*, *EMD*, etc. In this section, we will discuss several new features, some of which are variations of cepstral coefficients with a different frequency scaling, such as *CFCC*, *LFCC*, *EFCC*, and *GFCC*. In Section 6.2 we will also see the *RMFCC* which was used to handle speaker identification for gaming applications. Other features

is a database acquired under a clean and controlled environment and is not very challenging.

Two separate thresholds may be set for *Rl* and D*<sup>J</sup>* (*l* ↔ *l* −1), in order to detect unvoiced consonants from the rest of the phones.

#### 3.3. Speech under stress

As noted earlier, the phonation undergoes certain changes when the speaker is under stressful conditions. *Bou-Ghazale, et al.* [15] have shown that this may effect the significance of certain frequency bands, making MFCC features miss certain nuances in the speech of the individual under stress. They propose a new frequency scale which it calls the *exponential-logarithmic* (*expo-log*) scale. In Section 4.3 we will describe this scale in more detail since it is also used by *Bou-Ghazale, et al.* [18] to handle the unvoiced consonants. On another note, although research has generally shown that cepstral coefficients derived from FFT are more robust for the handling of neutral speech [19], *Bou-Ghazale, et al.* [15] suggest that for speech, recorded under stressful conditions, cepstral coefficients derived from the linear predictive model [1] perform better.

#### 3.4. Multiple sources of speech and far-field audio capture

This problem has been addressed in the presence of microphone arrays, to handle cases when sources are semi-stationary in a room, say in a conference environment. The main goal would amount to extracting the source(s) of interest from a set of many sources of audio and to reduce the interference from other sources in the process [20]. For instance, *Kumatani, et al.* [21] address the problem using the, so called, beamforming technique[20, 22] for two speakers speaking simultaneously in a room. They construct a generalized sidelobe canceler (GSC) for each source and adjusts the active weight vectors of the two GSCs to extract two speech signals with *minimum mutual information* [1] between the two. Of course, this makes a few essential assumptions which may not be true in most situations. The first assumption is that the number of speakers is known. The second assumption is that they are semi-stationary and sitting in different angles from the microphone array. *Kumatani, et al.* [21] show performance results on the far-field PASCAL speech separation challenge, by performing speech recognition trials.

One important part of the above task is to localize the speakers. *Takashima, et al.* [23] use an HMM-based approach to separate the acoustic transfer function so that they can separate the sources, using a single microphone. It is done by using an HMM model of the speech of each speaker to estimate the acoustic transfer function from each position in the room. They have experimented with up to 9 different source positions and have shown that their accuracy of localization decreases with increasing number of positions.

#### 3.5. Channel mismatch

8 New Trends and Developments in Biometrics

where

from the rest of the phones.

3.3. Speech under stress

number of positions.

the linear predictive model [1] perform better.

3.4. Multiple sources of speech and far-field audio capture

It is assumed that most of spectral energy of unvoiced consonants is concentrated in the higher half of the frequency spectrum, compared to the rest of the phones. In addition, the Jeffreys' divergence [1] of the higher portion of the spectrum relative to the previous frame is computed using Equation 16.

*<sup>l</sup>*−<sup>1</sup> log2(*P*(*h*)

∆ <sup>=</sup> *<sup>E</sup>*(*h*) *l El*

Two separate thresholds may be set for *Rl* and D*<sup>J</sup>* (*l* ↔ *l* −1), in order to detect unvoiced consonants

As noted earlier, the phonation undergoes certain changes when the speaker is under stressful conditions. *Bou-Ghazale, et al.* [15] have shown that this may effect the significance of certain frequency bands, making MFCC features miss certain nuances in the speech of the individual under stress. They propose a new frequency scale which it calls the *exponential-logarithmic* (*expo-log*) scale. In Section 4.3 we will describe this scale in more detail since it is also used by *Bou-Ghazale, et al.* [18] to handle the unvoiced consonants. On another note, although research has generally shown that cepstral coefficients derived from FFT are more robust for the handling of neutral speech [19], *Bou-Ghazale, et al.* [15] suggest that for speech, recorded under stressful conditions, cepstral coefficients derived from

This problem has been addressed in the presence of microphone arrays, to handle cases when sources are semi-stationary in a room, say in a conference environment. The main goal would amount to extracting the source(s) of interest from a set of many sources of audio and to reduce the interference from other sources in the process [20]. For instance, *Kumatani, et al.* [21] address the problem using the, so called, beamforming technique[20, 22] for two speakers speaking simultaneously in a room. They construct a generalized sidelobe canceler (GSC) for each source and adjusts the active weight vectors of the two GSCs to extract two speech signals with *minimum mutual information* [1] between the two. Of course, this makes a few essential assumptions which may not be true in most situations. The first assumption is that the number of speakers is known. The second assumption is that they are semi-stationary and sitting in different angles from the microphone array. *Kumatani, et al.* [21] show performance results on the far-field PASCAL speech separation challenge, by performing speech recognition trials.

One important part of the above task is to localize the speakers. *Takashima, et al.* [23] use an HMM-based approach to separate the acoustic transfer function so that they can separate the sources, using a single microphone. It is done by using an HMM model of the speech of each speaker to estimate the acoustic transfer function from each position in the room. They have experimented with up to 9 different source positions and have shown that their accuracy of localization decreases with increasing

*P*(*h*) *l*

*<sup>l</sup>* )−*P*(*h*)

*<sup>l</sup>* log2(*P*(*h*)

*<sup>l</sup>*−1) (16)

(17)

<sup>D</sup>*<sup>J</sup>* (*<sup>l</sup>* <sup>↔</sup> *<sup>l</sup>* <sup>−</sup>1) <sup>=</sup> <sup>−</sup>*P*(*h*)

Many publications deal with the problem of channel mismatch, since it is the most important challenge in speaker recognition. Early approaches to the treatment of this problem concentrated on normalization of the features or the score. *Vogt, et al.* [14] present a good coverage of different normalization techniques. *Barras, et al.* [24] compare cepstral mean subtraction (CMS) and variance normalization, Feature Warping, T-Norm, Z-Norm and the cohort methods. Later approaches started by using techniques from factor analysis or discriminant analysis to transform features such that they convey the most information about speaker differences and least about channel differences. Most GMM techniques use some variation of *joint factor analysis* (*JFA*) [25]. An offshoot of JFA is the i-vector technique which does away with the channel part of the model and falls back toward a PCA approach [26]. See Section 5.1 for more on the i-vector approach.

SVM systems use techniques such as *nuisance attribute projection* (*NAP*) [27]. NAP [13] modifies the original *kernel*, used for a *support vector machine* (*SVM*) formulation, to one with the ability of telling specific channel information apart. The premise behind this approach is that by doing so, in both training and recognition stages, the system will not have the ability to distinguish channel specific information. This channel specific information is what is dubbed nuisance by *Solomonoff, et al.* [13]. *NAP* is a projection technique which assumes that most of the information related to the channel is stored in specific low-dimensional subspaces of the higher dimensional space to which the original features are mapped. Furthermore, these regions are assumed to be somewhat distinct from the regions which carry speaker information. This is quite similar to the idea of *joint factor analysis*. *Seo, et al.* [28] use the statistics of the eigenvalues of background speakers to come up with discriminative weight for each background speaker and to decide on the between class scatter matrix and the within-class scatter matrix.

*Shanmugapriya, et al.* [29] propose a *fuzzy wavelet network* (*FWN*) which is a neural network with a wavelet activation function (known as a *Wavenet*). A fuzzy neural network is used in this case, with the wavelet activation function. Unfortunately, [29] only provides results for the TIMIT database [1] which is a database acquired under a clean and controlled environment and is not very challenging.

*Villalba, et al.* [30] attempt to detect two types of low-tech spoofing attempts. The first one is the use of a far-field microphone to record the victim's speech and then to play it back into a telephone handset. The second type is the concatenation of segments of short recordings to build the input required for a text-dependent speaker verification system. The former is handled by using an SVM classifier for spoof and non-spoof segments trained based on some training data. The latter is detected by comparing the pitch and MFCC feature contours of the enrollment and test segments using dynamic time warping (DTW).

## 4. Alternative features

As seen in the past, most classic features used in speech and speaker recognition are based on LPC, LPCC, or MFCC. In Section 6.3 we see that *Dhanalakshmi, et al.* [19] report trying these three classic features and have shown that MFCC outperforms the other two. Also, *Beigi* [1] discusses many other features such as those generated by *wavelet filterbanks*, *instantaneous frequencies*, *EMD*, etc. In this section, we will discuss several new features, some of which are variations of cepstral coefficients with a different frequency scaling, such as *CFCC*, *LFCC*, *EFCC*, and *GFCC*. In Section 6.2 we will also see the *RMFCC* which was used to handle speaker identification for gaming applications. Other features are also discussed, which are more fundamentally different, such as *missing feature theory (MFT)*, and *local binary features*.

Each wavelet basis function,according to the scaling and translation parameters *a* > 0 and *b* > 0 is,

−2π*hL*β

In Equation 21, α and β are strictly positive parameters which define the shape and the bandwidth of the cochlear filter in the frequency domain. *Li, et al.* [36] determine them empirically for each filter in

1 ∀ *t* ≥ 0

*t* −*b a*

cos

2π*hL*

*t* −*b a* +θ 

Speaker Recognition: Advancements and Challenges

http://dx.doi.org/10.5772/52023

<sup>0</sup> <sup>∀</sup> *<sup>t</sup>* <sup>&</sup>lt; <sup>0</sup> (22)

*<sup>k</sup>* −1)*c* ∀ 0 ≤ *f* ≤ 8000*Hz* (23)

*<sup>k</sup>* ln(10)

 

<sup>50000</sup> −1) (26)

(21)

13

(24)

(25)

therefore, given by Equation 21.

mapping is given by

<sup>ψ</sup>(*a*,*<sup>b</sup>*)(*t*) = <sup>1</sup>


*t* −*b a*

α exp

the filter bank. *u*(*t*) is the units step (Heaviside) function defined by Equation 22.

*<sup>u</sup>*(*t*) <sup>∆</sup> =

used and the scores of two recognition engines are combined to reach the final decision.

*<sup>E</sup>* = (<sup>10</sup> *<sup>f</sup>*

(<sup>10</sup> <sup>8000</sup>

{*c*,*k*} <sup>=</sup> min

Therefore, the exponential scale function is given by Equation 26.

where the two constants, *c* and *k*, are computed by solving Equations 24 and 25.

 (<sup>10</sup> <sup>4000</sup>

4.3. Linear and Exponential Frequency Cepstral Coefficients (LFCC and EFCC)

Some experiments have shown that using *linear frequency cepstral coefficients* (LFCC) and *exponential frequency cepstral coefficients* (EFCC) for processing unvoiced consonants may produce better results for speaker recognition. For instance, *Fan, et al.* [18] use an unvoiced consonant detector to separate frames which contain such phones and to use LFCC and EFCC features for these frames (see Section 3.2). These features are then used to train up a GMM-based speaker recognition system. In turn, they send the remaining frames to a GMM-based recognizer using MFCC features. The two recognizers are treated as separate systems. At the recognition stage, the same segregation of frames is

The EFCC scale was proposed by *Bou-Ghazale, et al.* [15] and later used by *Fan, et al.* [18]. This

*<sup>k</sup>* <sup>−</sup>1)*<sup>c</sup>* <sup>=</sup> 2595log

*<sup>k</sup>* <sup>−</sup>1)<sup>−</sup> <sup>4000</sup>

Equation 24 comes from the requirement that the exponential and Mel scale functions should be equal at the Nyquist frequency and Equation 24 is the result of minimizing the absolute values of the partial derivatives of *E* in Equation 23 with respect to *c* and *k* for *f* = 4000*Hz* [18]. The resulting *c* and *k* which would satisfy Equations 24 and 25 are computed by *Fan, et al.* [18] to be *c* = 6375 and *k* = 50000.

*<sup>E</sup>* <sup>=</sup> <sup>6375</sup>×(<sup>10</sup> *<sup>f</sup>*

1+

*<sup>k</sup>*<sup>2</sup> *<sup>c</sup>*×<sup>10</sup> <sup>4000</sup>

8000 <sup>700</sup>

#### 4.1. Multitaper MFCC features

Standard MFCC features are usually computed using a periodogram estimate of the spectrum, with a window function, such as the Hamming window. [1] MFCC features computed by this method portray a large variance. To reduce the variance, multitaper spectrum estimation techniques [31] have been used. They show lower bias and variance for the multitaper estimate of the spectrum. Although bias terms are generally small with the windowed periodogram estimate, the reduction in the variance, using multitaper estimation, seems to be significant.

A multitaper estimate of a spectrum is made by using the mean value of periodogram estimates of the spectrum using a set of orthogonal windows (known as tapers). The multitaper approach has been around since early 1980*s*. Examples of such taper estimates are *Thomson* [32], *Tukey's split cosine taper* [33], *sinusoidal taper* [34], and *peak matched estimates* [35]. However, their use in computing MFCC features seems to be new. In Section 5.1, we will see that they have been recently used in accordance with the i-vector formulation and have also shown promising results.

## 4.2. Cochlear Filter Cepstral Coefficients (CFCC)

*Li, et al.* [36] present results for speaker identification using *cochlear filter cepstral coefficients* (*CFCC*) based on an auditory transform [37] while trying to emulate natural cochlear signal processing. They maintain that the CFCC features outperform MFCC, PLP, and RASTA-PLP features [1] under conditions with very low signal to noise ratios. Figure 1 shows the block diagram of the CFCC feature extraction proposed by *Li, et al.* [36]. The *auditory transform* is a *wavelet transform* which was proposed by *Li, et al.* [37]. It may be implemented in the form of a filter bank, as it is usually done for the extraction of MFCC features [1]. Equations 18 and 19 show a generic wavelet transform associated with one such filter.

**Figure 1.** Block Diagram of Cochlear Filter Cepstral Coefficient (CFCC) Feature Extraction – proposed by *Li, et al.* [36]

$$T(a,b) = \int\_{-\infty}^{\infty} h(t)\Psi\_{(a,b)}(t)dt\tag{18}$$

where

$$
\Psi\_{(a,b)}(t) = \frac{1}{\sqrt{|a|}} \Psi\left(\frac{t-b}{a}\right) \tag{19}
$$

The *wavelet basis functions* [1], {ψ(*a*,*<sup>b</sup>*)(*t*)}, are defined by *Li, et al.* [37], based on the *mother wavelet*, ψ(*t*) (Equation 20), which mimics the cochlear impulse response function.

$$\Psi(t) \stackrel{\Delta}{=} t^{\alpha} \exp\left[-2\pi h\_L \beta t\right] \cos\left[2\pi h\_L t + \theta\right] \tag{20}$$

Each wavelet basis function,according to the scaling and translation parameters *a* > 0 and *b* > 0 is, therefore, given by Equation 21.

10 New Trends and Developments in Biometrics

4.1. Multitaper MFCC features

multitaper estimation, seems to be significant.

4.2. Cochlear Filter Cepstral Coefficients (CFCC)

*local binary features*.

with one such filter.

where

are also discussed, which are more fundamentally different, such as *missing feature theory (MFT)*, and

Standard MFCC features are usually computed using a periodogram estimate of the spectrum, with a window function, such as the Hamming window. [1] MFCC features computed by this method portray a large variance. To reduce the variance, multitaper spectrum estimation techniques [31] have been used. They show lower bias and variance for the multitaper estimate of the spectrum. Although bias terms are generally small with the windowed periodogram estimate, the reduction in the variance, using

A multitaper estimate of a spectrum is made by using the mean value of periodogram estimates of the spectrum using a set of orthogonal windows (known as tapers). The multitaper approach has been around since early 1980*s*. Examples of such taper estimates are *Thomson* [32], *Tukey's split cosine taper* [33], *sinusoidal taper* [34], and *peak matched estimates* [35]. However, their use in computing MFCC features seems to be new. In Section 5.1, we will see that they have been recently used in

*Li, et al.* [36] present results for speaker identification using *cochlear filter cepstral coefficients* (*CFCC*) based on an auditory transform [37] while trying to emulate natural cochlear signal processing. They maintain that the CFCC features outperform MFCC, PLP, and RASTA-PLP features [1] under conditions with very low signal to noise ratios. Figure 1 shows the block diagram of the CFCC feature extraction proposed by *Li, et al.* [36]. The *auditory transform* is a *wavelet transform* which was proposed by *Li, et al.* [37]. It may be implemented in the form of a filter bank, as it is usually done for the extraction of MFCC features [1]. Equations 18 and 19 show a generic wavelet transform associated

**Figure 1.** Block Diagram of Cochlear Filter Cepstral Coefficient (CFCC) Feature Extraction – proposed by *Li, et al.* [36]

ˆ <sup>∞</sup> −∞

> |*a*| ψ *t* −*b a*

The *wavelet basis functions* [1], {ψ(*a*,*<sup>b</sup>*)(*t*)}, are defined by *Li, et al.* [37], based on the *mother wavelet*,

*h*(*t*)ψ(*a*,*<sup>b</sup>*)(*t*)*dt* (18)

<sup>α</sup> exp[−2π*hL*β*t*] cos[2π*hLt* +θ] (20)

(19)

*T* (*a*,*b*) =

ψ(*t*) (Equation 20), which mimics the cochlear impulse response function.

<sup>ψ</sup>(*t*) <sup>∆</sup> = *t*

<sup>ψ</sup>(*a*,*<sup>b</sup>*)(*t*) = <sup>1</sup>

accordance with the i-vector formulation and have also shown promising results.

$$\Psi\_{(a,b)}(t) = \frac{1}{\sqrt{|a|}} \left(\frac{t-b}{a}\right)^a \exp\left[-2\pi h\_L \mathcal{B}\left(\frac{t-b}{a}\right)\right] \cos\left[2\pi h\_L \left(\frac{t-b}{a}\right) + \theta\right] \tag{21}$$

In Equation 21, α and β are strictly positive parameters which define the shape and the bandwidth of the cochlear filter in the frequency domain. *Li, et al.* [36] determine them empirically for each filter in the filter bank. *u*(*t*) is the units step (Heaviside) function defined by Equation 22.

$$u(t) \stackrel{\Delta}{=} \begin{cases} 1 \; \forall \; t \ge 0 \\ 0 \; \forall \; t < 0 \end{cases} \tag{22}$$

#### 4.3. Linear and Exponential Frequency Cepstral Coefficients (LFCC and EFCC)

Some experiments have shown that using *linear frequency cepstral coefficients* (LFCC) and *exponential frequency cepstral coefficients* (EFCC) for processing unvoiced consonants may produce better results for speaker recognition. For instance, *Fan, et al.* [18] use an unvoiced consonant detector to separate frames which contain such phones and to use LFCC and EFCC features for these frames (see Section 3.2). These features are then used to train up a GMM-based speaker recognition system. In turn, they send the remaining frames to a GMM-based recognizer using MFCC features. The two recognizers are treated as separate systems. At the recognition stage, the same segregation of frames is used and the scores of two recognition engines are combined to reach the final decision.

The EFCC scale was proposed by *Bou-Ghazale, et al.* [15] and later used by *Fan, et al.* [18]. This mapping is given by

$$E = (10^{\frac{l}{7}} - 1)c \quad \forall \ 0 \le f \le 8000 Hz \tag{23}$$

where the two constants, *c* and *k*, are computed by solving Equations 24 and 25.

$$(10^{\frac{8000}{k}} - 1)c = 2595 \log \left( 1 + \frac{8000}{700} \right) \tag{24}$$

$$\{c, k\} = \min \left\{ \left| \left( 10^{\frac{4000}{k}} - 1 \right) - \frac{4000}{k^2} c \times 10^{\frac{4000}{k}} \ln(10) \right| \right\} \tag{25}$$

Equation 24 comes from the requirement that the exponential and Mel scale functions should be equal at the Nyquist frequency and Equation 24 is the result of minimizing the absolute values of the partial derivatives of *E* in Equation 23 with respect to *c* and *k* for *f* = 4000*Hz* [18]. The resulting *c* and *k* which would satisfy Equations 24 and 25 are computed by *Fan, et al.* [18] to be *c* = 6375 and *k* = 50000. Therefore, the exponential scale function is given by Equation 26.

$$E = 6375 \times \left(10^{\frac{l}{9000}} - 1\right) \tag{26}$$

*Fan el al.* [18] show better accuracy for unvoiced consonants, when EFCC is used over MFCC. However, it shows even better accuracy when LFCC is used for these frames!

#### 4.4. Gammatone Frequency Cepstral Coefficients (GFCC)

*Shao, et al.* [38] use *gammatone frequency cepstral coefficients* (*GFCC*) as features, which are the products of a cochlear filter bank, based on psychophysical observations of the total auditory system. The Gammatone filter bank proposed by *Shao, et al.* [38] has 128 filters, centered from 50*Hz* to 8*kHz*, at equal partitions on the *equivalent rectangular bandwidth* (ERB) [39, 40] scale (Equation 28)3.

$$E\_c = \frac{1000}{(24.7 \times 4.37)} \ln(4.37 \times 10^3 f + 1) \tag{27}$$

$$=21.4\log\left(4.37 \times 10^3 f + 1\right)\tag{28}$$

estimating the reliable regions of the spectrogram of speech and then using these reliable portions to perform speech recognition. They do this by estimating the noise spectrum and the SNR and by creating a mask that would remove the noisy part from the spectrogram. In a related approach, some feature selection methods use Bayesian estimation to estimate a spectrographic mask which would remove unwanted part of the spectrogram, therefore removing features which are attributed to the noisy part of

Speaker Recognition: Advancements and Challenges

http://dx.doi.org/10.5772/52023

15

The goal of these techniques is to be able to handle non-stationary noise. *Seltzer, et al.* [46] propose one such Bayesian technique. This approach concentrates on extracting as much useful information from the noisy speech as it can, rather than trying to estimate the noise and to subtract it from the signal, as it is done by *Vizinho, et al.* [45]. However, there are many parameters which need to be optimized, making the process quite expensive, calling for suboptimal search. *Pullella, et al.* [47] have combined the two techniques of spectrographic mask estimation and dynamic feature selection to improve the accuracy of speaker recognition under noisy conditions. *Lim, et al.* [48] propose an optimal mask estimation and

The idea of statistical boosting is not new and was proposed by several researchers, starting with *Schapire* [49] in 1990. The *Adaboost algorithm* was introduced by *Freund, et al.* [50] in 1996 as one specific boosting algorithm. The idea behind statistical boosting is that a combination of weak

*Rodriguez* [51] used the statistical boosting idea and several extensions of the Adaboost algorithm to introduce face detection and verification algorithms which would use features based on local differences

Inspired by [51], *Roy, et al.* [52] created local binary features according to the differences between the bands of the *discrete Fourier transform* (*DFT*) values to compare two models. One important claim of this classifier is that it is less prone to overfitting issues and that it performs better than conventional systems under low SNR values. The resulting features are binary because they are based on a threshold which categorizes the difference between different bands of the FFT to either 0 or 1. The classifier of [52] has a built-in discriminant nature, since it uses certain data as those coming from impostors, in contrast with the data which is generated by the target speaker. The labels of impostor versus target allow for this built-in discrimination. The authors of [52] call these features, *boosted binary features* (BBF). In a more recent paper [53], *Roy, et al.* refined their approach and renamed the method a *slice classifier*. They show similar results with this classifier, compared to the state of the art, but they explain that the method is less computationally intensive and is more suitable for use in mobile devices with

Classic modeling techniques for speaker recognition have used *Gaussian mixture models* (*GMM*), *support vector machines* (*SVM*), and *neural networks* [1]. In Section 6 we will see some other modeling techniques such as non-negative matrix factorization. Also, in Section 4, new modeling implementations were used in applying the new features presented in the section. Generally, most new modeling techniques use some transformation of the features in order to handle mismatch conditions, such as joint factor analysis (JFA), Nuisance attribute projection (NAP), and principal component

between pixels in a 9×9 pixel grid, compared to the central pixel of the grid.

the signal.

feature selection algorithm.

limited resources.

5. Alternative speaker modeling

4.6. Local binary features (slice classifier)

classifiers may be combined to build a strong one.

where *f* is the frequency in Hertz and *E* is the number of ERBs, in a similar fashion as Barks or Mels are defined [1]. The bandwidth, *Eb*, associated with each center frequency, *f* , is then given by Equation 29. Both *f* and *Eb* are in *Hertz (Hz)* [40].

$$E\_b = 24.7(4.37 \times 10^3 f + 1)\tag{29}$$

The impulse response of each filter is given by Equation 30.

$$g(f,t) \stackrel{\Delta}{=} \begin{cases} t^{(a-1)}e^{-2\pi bt}\cos(2\pi ft) & t \ge 0\\ 0 & \text{Otherwise} \end{cases} \tag{30}$$

where *t* denotes the time and *f* is the center frequency of the filter of interest. *a* is the order of the filter and is taken to be *a* = 4 [38], and *b* is the filter bandwidth.

In addition, as it is done with other models such as MFCC, LPCC, and PLP, the magnitude also needs to be warped. *Shao, et al.* [38] base their magnitude warping on the method of cubic root warping (magnitude to loudness conversion) used in PLP [1].

The same group that published [38], followed by using a *computational auditory scene analysis*(CASA) front-end [43] to estimate a binary spectrographical mask to determine the useful part of the signal (see Section 4.5), based on *auditory scene analysis* (ASA) [44]. They claim great improvements in noisy environments, over standard speaker recognition approaches.

#### 4.5. Missing Feature Theory (MFT)

Missing feature theory (MFT) tries to deal with bandlimited speech in the presence of non-stationary background noise. Such missing data techniques have been used in the speech community, mostly to handle applications of noisy speech recognition. *Vizinho, et al.* [45] describe such techniques by

<sup>3</sup> The ERB scale is similar to the Bark and Mel scales [1] and is computed by integrating an empirical differential equation proposed by *Moore and Glasberg* in *1983* [39] and then modified by them in *1990* [41]. It uses a set of rectangular filters to approximate human cochlear hearing and provides a more accurate approximation to the psychoacoustical scale (Bark scale) of *Zwicker* [42].

estimating the reliable regions of the spectrogram of speech and then using these reliable portions to perform speech recognition. They do this by estimating the noise spectrum and the SNR and by creating a mask that would remove the noisy part from the spectrogram. In a related approach, some feature selection methods use Bayesian estimation to estimate a spectrographic mask which would remove unwanted part of the spectrogram, therefore removing features which are attributed to the noisy part of the signal.

The goal of these techniques is to be able to handle non-stationary noise. *Seltzer, et al.* [46] propose one such Bayesian technique. This approach concentrates on extracting as much useful information from the noisy speech as it can, rather than trying to estimate the noise and to subtract it from the signal, as it is done by *Vizinho, et al.* [45]. However, there are many parameters which need to be optimized, making the process quite expensive, calling for suboptimal search. *Pullella, et al.* [47] have combined the two techniques of spectrographic mask estimation and dynamic feature selection to improve the accuracy of speaker recognition under noisy conditions. *Lim, et al.* [48] propose an optimal mask estimation and feature selection algorithm.

## 4.6. Local binary features (slice classifier)

12 New Trends and Developments in Biometrics

Both *f* and *Eb* are in *Hertz (Hz)* [40].

The impulse response of each filter is given by Equation 30.

*<sup>g</sup>*(*<sup>f</sup>* ,*t*) <sup>∆</sup> = *t*

and is taken to be *a* = 4 [38], and *b* is the filter bandwidth.

environments, over standard speaker recognition approaches.

(magnitude to loudness conversion) used in PLP [1].

4.5. Missing Feature Theory (MFT)

*Fan el al.* [18] show better accuracy for unvoiced consonants, when EFCC is used over MFCC.

*Shao, et al.* [38] use *gammatone frequency cepstral coefficients* (*GFCC*) as features, which are the products of a cochlear filter bank, based on psychophysical observations of the total auditory system. The Gammatone filter bank proposed by *Shao, et al.* [38] has 128 filters, centered from 50*Hz* to 8*kHz*, at equal partitions on the *equivalent rectangular bandwidth* (ERB) [39, 40] scale (Equation 28)3.

where *f* is the frequency in Hertz and *E* is the number of ERBs, in a similar fashion as Barks or Mels are defined [1]. The bandwidth, *Eb*, associated with each center frequency, *f* , is then given by Equation 29.

(*a*−1)*e*−2π*bt* cos(2π *ft*) *t* ≥ 0

where *t* denotes the time and *f* is the center frequency of the filter of interest. *a* is the order of the filter

In addition, as it is done with other models such as MFCC, LPCC, and PLP, the magnitude also needs to be warped. *Shao, et al.* [38] base their magnitude warping on the method of cubic root warping

The same group that published [38], followed by using a *computational auditory scene analysis*(CASA) front-end [43] to estimate a binary spectrographical mask to determine the useful part of the signal (see Section 4.5), based on *auditory scene analysis* (ASA) [44]. They claim great improvements in noisy

Missing feature theory (MFT) tries to deal with bandlimited speech in the presence of non-stationary background noise. Such missing data techniques have been used in the speech community, mostly to handle applications of noisy speech recognition. *Vizinho, et al.* [45] describe such techniques by <sup>3</sup> The ERB scale is similar to the Bark and Mel scales [1] and is computed by integrating an empirical differential equation proposed by *Moore and Glasberg* in *1983* [39] and then modified by them in *1990* [41]. It uses a set of rectangular filters to approximate human cochlear hearing and provides a more accurate approximation to the psychoacoustical scale (Bark scale) of *Zwicker* [42].

ln(4.37×103 *f* +1) (27)

<sup>0</sup> *Otherwise* (30)

= 21.4log(4.37×103 *f* +1) (28)

*Eb* = 24.7(4.37×103 *f* +1) (29)

However, it shows even better accuracy when LFCC is used for these frames!

*Ec* <sup>=</sup> <sup>1000</sup>

(24.7×4.37)

4.4. Gammatone Frequency Cepstral Coefficients (GFCC)

The idea of statistical boosting is not new and was proposed by several researchers, starting with *Schapire* [49] in 1990. The *Adaboost algorithm* was introduced by *Freund, et al.* [50] in 1996 as one specific boosting algorithm. The idea behind statistical boosting is that a combination of weak classifiers may be combined to build a strong one.

*Rodriguez* [51] used the statistical boosting idea and several extensions of the Adaboost algorithm to introduce face detection and verification algorithms which would use features based on local differences between pixels in a 9×9 pixel grid, compared to the central pixel of the grid.

Inspired by [51], *Roy, et al.* [52] created local binary features according to the differences between the bands of the *discrete Fourier transform* (*DFT*) values to compare two models. One important claim of this classifier is that it is less prone to overfitting issues and that it performs better than conventional systems under low SNR values. The resulting features are binary because they are based on a threshold which categorizes the difference between different bands of the FFT to either 0 or 1. The classifier of [52] has a built-in discriminant nature, since it uses certain data as those coming from impostors, in contrast with the data which is generated by the target speaker. The labels of impostor versus target allow for this built-in discrimination. The authors of [52] call these features, *boosted binary features* (BBF). In a more recent paper [53], *Roy, et al.* refined their approach and renamed the method a *slice classifier*. They show similar results with this classifier, compared to the state of the art, but they explain that the method is less computationally intensive and is more suitable for use in mobile devices with limited resources.

## 5. Alternative speaker modeling

Classic modeling techniques for speaker recognition have used *Gaussian mixture models* (*GMM*), *support vector machines* (*SVM*), and *neural networks* [1]. In Section 6 we will see some other modeling techniques such as non-negative matrix factorization. Also, in Section 4, new modeling implementations were used in applying the new features presented in the section. Generally, most new modeling techniques use some transformation of the features in order to handle mismatch conditions, such as joint factor analysis (JFA), Nuisance attribute projection (NAP), and principal component analysis (PCA) techniques such as the i-vector implementation.[1] In the next few sections, we will briefly look at some recent developments in these and other techniques.

## 5.1. The i-vector model (total variability space)

*Dehak, et al.* [54] recombined the *channel variability space* in the JFA formulation [25] with the *speaker variability space*, since they discovered that there was considerable leakage from the speaker space into the channel space. The combined space produces a new projection (Equation 31) which resembles a PCA, rather than a factor analysis process.

$$\mathbf{y}\_n = \boldsymbol{\mu} + \mathbf{V}\boldsymbol{\theta}\_n\tag{31}$$

*5.3.1. Frame-based score competition (FSC):*

*5.3.2. SNR-Matched Recognition:*

present such an algorithm.

6. Branch-specific progress

the next section we will review some such systems.

degradation.

6.1. Verification

Therefore, it is called a frame-based score competition (FSC) method.

*deciBels* given by Equations 32 and 33 – see [1] for more.

signal amplitude is roughly the same as the energy of the noise.

In Section 3.2 we discussed the fact that *Jin, et al.* [17] used two separate models, one based on the normal speech (neutral speech) model and the second one based on whisper data. Then, at the recognition stage, each frame is evaluated against the two models and the higher score is used. [17]

After performing voice activity detection (VAD), *Bartos, et al.* [67] estimate the signal to noise ratio (SNR) of that part of the signal which contains speech. This value is used to load models which have been created with data recorded under similar SNR conditions. Generally, the SNR is computed in

*SNR* <sup>=</sup> 10 log10 <sup>P</sup>*<sup>s</sup>*

P*<sup>n</sup>* 


Speaker Recognition: Advancements and Challenges

http://dx.doi.org/10.5772/52023

<sup>=</sup> 20 log10 <sup>|</sup>*Hs*(ω)<sup>|</sup>

*Bartos, et al.* [67] consider an SNR of 30dB or higher to be clean speech. An SNR of 30dB happens to be equivalent to the signal amplitude being about 30 times that of the noise. When the SNR is 0, the

Of course, to evaluate the SNR from Equation 32 or 33, we would need to know the power or amplitude of the noise as well as the true signal. Since this is not possible, estimation techniques are used to come up with an instantaneous SNR and to average that value over the whole signal. *Bartos, et al.* [67]

Once the SNR of the speech signal is computed, it is categorized within a quantization of 4dB segments and then identification or verification is done using models which have been enrolled with similar SNR values. This, according to [67], allows for a lower equal error rate in case of speaker verification trials. In order to generate speaker models for different SNR levels (of 4dB steps), [67] degrades clean speech iteratively, using some additive noise, amplified by a constant gain associated with each 4db level of

In this section, we will quickly review the latest developments for the main branches of speaker recognition as listed at the beginning of this chapter. Some of these have already been reviewed in the above sections. Most of the work on speaker recognition is performed on speaker verification. In

As we mentioned in Section 4, *Roy, et al.* [52, 53] used the so-called boosted binary features (slice classifier) for speaker verification. Also, we reviewed several developments regarding the i-vector

(32)

17

(33)

They called the new space *total variability space* and in their later works [55–57], they referred to the projections of feature vectors into this space, *i-vectors*. *Speaker factor coefficients* are related to the speaker coordinates, in which each speaker is represented as a point. This space is defined by the *Eigenvoice matrix*. These speaker factor vectors are relatively short, having in the order of about 300 elements [58], which makes them desirable for use with *support vector machines*, as the observed vector in the observation space (x).

Generally, in order to use an i-vector approach, several recording sessions are needed from the same speaker, to be able to compute the within class covariance matrix in order to do within class covariance normalization (WCCN). Also, methods using *linear discriminant analysis* (*LDA*) along with WCCN [57] and recently, *probabilistic LDA* (*PLDA*) with WCCN [59–62] have also shown promising results.

*Alam, et al.* [63] examined the use of *multitaper MFCC features* (see Section 4.1) in conjunction with the i-vector formulation. They show improved performance using multitaper MFCC features, compared to standard MFCC features which have been computed using a *Hamming window* [1].

*Glembek, et al.* [26] provide simplifications to the formulation of the *i-vectors* to reduce the memory usage and to increase the speed of computing the vectors. *Glembek, et al.* [26] also explore linear transformations using principal component analysis (PCA) and Heteroscedastic Linear Discriminant Analysis<sup>4</sup> (HLDA) [64] to achieve orthogonality of the components of the Gaussian mixture.

#### 5.2. Non-negative matrix factorization

In Section 6.3, we will see several implementations of extensions of non-negative matrix factorization [65, 66]. These techniques have been successfully applied to classification problems. More detail is give in Section 6.3.

#### 5.3. Using multiple models

In Section 3.2 we briefly covered a few model combination and selection techniques that would use different specialized models to achieve better recognition rates. For example, *Fan, et al.* [18] used two different models to handle unvoiced consonants and the rest of the phones. Both models had similar form, but they used slightly different types of features (MFCC vs. EFCC/LFCC). Similar ideas will be discuss in this section.

<sup>4</sup> Also known as Heteroscedastic Discriminant Analysis (HDA) [64]

#### *5.3.1. Frame-based score competition (FSC):*

In Section 3.2 we discussed the fact that *Jin, et al.* [17] used two separate models, one based on the normal speech (neutral speech) model and the second one based on whisper data. Then, at the recognition stage, each frame is evaluated against the two models and the higher score is used. [17] Therefore, it is called a frame-based score competition (FSC) method.

#### *5.3.2. SNR-Matched Recognition:*

14 New Trends and Developments in Biometrics

analysis (PCA) techniques such as the i-vector implementation.[1] In the next few sections, we will

*Dehak, et al.* [54] recombined the *channel variability space* in the JFA formulation [25] with the *speaker variability space*, since they discovered that there was considerable leakage from the speaker space into the channel space. The combined space produces a new projection (Equation 31) which resembles a

They called the new space *total variability space* and in their later works [55–57], they referred to the projections of feature vectors into this space, *i-vectors*. *Speaker factor coefficients* are related to the speaker coordinates, in which each speaker is represented as a point. This space is defined by the *Eigenvoice matrix*. These speaker factor vectors are relatively short, having in the order of about 300 elements [58], which makes them desirable for use with *support vector machines*, as the observed vector

Generally, in order to use an i-vector approach, several recording sessions are needed from the same speaker, to be able to compute the within class covariance matrix in order to do within class covariance normalization (WCCN). Also, methods using *linear discriminant analysis* (*LDA*) along with WCCN [57] and recently, *probabilistic LDA* (*PLDA*) with WCCN [59–62] have also shown promising

*Alam, et al.* [63] examined the use of *multitaper MFCC features* (see Section 4.1) in conjunction with the i-vector formulation. They show improved performance using multitaper MFCC features, compared

*Glembek, et al.* [26] provide simplifications to the formulation of the *i-vectors* to reduce the memory usage and to increase the speed of computing the vectors. *Glembek, et al.* [26] also explore linear transformations using principal component analysis (PCA) and Heteroscedastic Linear Discriminant

In Section 6.3, we will see several implementations of extensions of non-negative matrix factorization [65, 66]. These techniques have been successfully applied to classification problems.

In Section 3.2 we briefly covered a few model combination and selection techniques that would use different specialized models to achieve better recognition rates. For example, *Fan, et al.* [18] used two different models to handle unvoiced consonants and the rest of the phones. Both models had similar form, but they used slightly different types of features (MFCC vs. EFCC/LFCC). Similar ideas will be

to standard MFCC features which have been computed using a *Hamming window* [1].

Analysis<sup>4</sup> (HLDA) [64] to achieve orthogonality of the components of the Gaussian mixture.

y*<sup>n</sup>* = µ +Vθ *<sup>n</sup>* (31)

briefly look at some recent developments in these and other techniques.

5.1. The i-vector model (total variability space)

PCA, rather than a factor analysis process.

5.2. Non-negative matrix factorization

<sup>4</sup> Also known as Heteroscedastic Discriminant Analysis (HDA) [64]

More detail is give in Section 6.3.

5.3. Using multiple models

discuss in this section.

in the observation space (x).

results.

After performing voice activity detection (VAD), *Bartos, et al.* [67] estimate the signal to noise ratio (SNR) of that part of the signal which contains speech. This value is used to load models which have been created with data recorded under similar SNR conditions. Generally, the SNR is computed in *deciBels* given by Equations 32 and 33 – see [1] for more.

$$\text{LNR} = 10 \, \log\_{10} \left( \frac{\mathcal{P}\_s}{\mathcal{P}\_n} \right) \tag{32}$$

$$=20\quad\log\_{10}\left(\frac{|H\_s(\mathfrak{o})|}{|H\_n(\mathfrak{o})|}\right) \tag{33}$$

*Bartos, et al.* [67] consider an SNR of 30dB or higher to be clean speech. An SNR of 30dB happens to be equivalent to the signal amplitude being about 30 times that of the noise. When the SNR is 0, the signal amplitude is roughly the same as the energy of the noise.

Of course, to evaluate the SNR from Equation 32 or 33, we would need to know the power or amplitude of the noise as well as the true signal. Since this is not possible, estimation techniques are used to come up with an instantaneous SNR and to average that value over the whole signal. *Bartos, et al.* [67] present such an algorithm.

Once the SNR of the speech signal is computed, it is categorized within a quantization of 4dB segments and then identification or verification is done using models which have been enrolled with similar SNR values. This, according to [67], allows for a lower equal error rate in case of speaker verification trials. In order to generate speaker models for different SNR levels (of 4dB steps), [67] degrades clean speech iteratively, using some additive noise, amplified by a constant gain associated with each 4db level of degradation.

#### 6. Branch-specific progress

In this section, we will quickly review the latest developments for the main branches of speaker recognition as listed at the beginning of this chapter. Some of these have already been reviewed in the above sections. Most of the work on speaker recognition is performed on speaker verification. In the next section we will review some such systems.

#### 6.1. Verification

As we mentioned in Section 4, *Roy, et al.* [52, 53] used the so-called boosted binary features (slice classifier) for speaker verification. Also, we reviewed several developments regarding the i-vector formulation in Section 5.1. The i-vector has basically been used for speaker verification. Many recent papers have dealt with aspects such as LDA, PLDA, and other discriminative aspects of the training.

In the next section, we will look at the multi-class SVM which is used to perform speaker identification.

Speaker Recognition: Advancements and Challenges

http://dx.doi.org/10.5772/52023

19

In Section 2.2 we discussed the popular one-against-all technique for handling multi-class SVM. There have been other more recent techniques which have been proposed in the last few years. One such technique is due to *Platt, et al.* [75], who proposed the, so-called, *decision directed acyclic graph* (*DDAG*) which produces a classification node for each pair of classes, in a Γ-class problem. This leads

*Wang* [76] presents a tree-based multi-class SVM which reduces the number of matches to the order of log(Γ). Although at the training phase, the number of SVM are similar to that of DDAG, namely, Γ(Γ−1)/2. This can significantly reduce the amount of computation for speaker identification.

Aside from the more prominent research on speaker verification and identification, audio source and gender classification are also quite important in most audio processing systems including speaker and

In many practical audio processing systems, it is important to determine the type of audio. For instance, consider a telephone-based system which includes a speech recognizer. Such recognition engines would produce spurious results if they were presented with non-speech, say music. These results may be detrimental to the operation of an automated process. This is also true for speaker identification and verification systems which expect to receive human speech. They may be confused if they are presented with music or other types of audio such as noise. For *text-independent speaker identification* systems, this may result in mis-identifying the audio as a viable choice in the database and resulting in dire

Similarly, some systems are only interested in processing music. An example is a music search system which would look for a specific music or one resembling the presented segment. These systems may be confused, if presented with human speech, uttered inadvertently, while only music is expected.

As an example, an important goal for audio source classification research is to develop filters which would tag a segment of audio as speech, music, noise, or silence [77]. Sometimes, we would also look into classifying the genre of audio or video such as movie, cartoon, news, advertisement, etc. [19].

The basic problem contains two separate parts. The first part is the segmentation of the audio stream into segments of similar content. This work has been under development for the past few decades with

The second part is the classification of each segment into relevant classes such as speech, music, or the rejection of the segment as silence or noise. Furthermore, when the audio type is *human speech*, it is desirable to do a further classification to determine the gender of the individual speaker. *Gender classification* [77] is helpful in choosing appropriate models for conducting better speech recognition, more accurate speaker verification, and reducing the computation load in large-scale speaker identification. For the speaker diarization problem, the identity of the speaker also needs to

*Dhanalakshmi, et al.* [19] report developments in classifying the genre of audio, as stemming from different video sources, containing movies, cartoons, news, etc. *Beigi* [77] uses a *text* and *language*

to Γ(Γ−1)/2 classifiers and results in the creation of the *DAGSVM* algorithm [75].

*6.2.1. Multi-Class SVM*

speech recognition.

consequences!

some good results [78–80].

be recognized.

6.3. Classification and diarization

*Salman, et al.* [68] use a neural network architecture with very deep number of layers to perform a greedy discriminative learning for the speaker verification problem. The *deep neural architecture* (*DNA*), proposed by [68], uses two identical subnets, to process two MFCC feature vectors respectively, for providing discrimination results between two speakers. They show promising results using this network.

*Sarkar, et al.* [69] use multiple background models associated with different *vocal tract length* (*VTL*) [1] estimates for the speakers, using MAP [1] to derive these background models from a root background model. Once the best VTL-based background model for the training or test audio is computed, the transformation to get from that universal background model (UBM) to the root UBM is used to transform the features of the segment to those associated with the VTL of the root UBM. *Sarkar, et al.* [69] show that the results of this single UBM system is comparable to a multiple background model system.

#### 6.2. Identification

In Section 5.3.2 we discussed new developments on SNR-matched recognition. The work of *Bartos, et al.* [67] was applied to improving speaker identification based on a matched SNR condition.

*Bharathi, et al.* [70] try to identify phonetic content for which specific speakers may be efficiently recognized. Using these speaker-specific phonemes, a special text is created to enhance the discrimination capability for the target speaker. The results are presented for the TIMIT database [1] which is a clean and controlled database and not very challenging. However, the idea seems to have merit.

*Cai, et al.* [71] use some of the features described in Section 4, such as MFCC and GFCC in order to identify the voice of signers from a monophonic recording of songs in the presence of sounds of music from several instruments.

*Do, et al.*[72] examine the speaker identification problem for identifying the person playing a computer game. The specific challenges are the fact that the recording is done through a far-field microphone (see Section 3.4) and that the audio is generally short, apparently based on the commands used for gaming. To handle the reverberation and background noise, *Do, et al.* [72] argue for the use of the, so-called, *reverse Mel frequency cepstral coefficients* (*RMFCC*). They propose this set of features by reversing the triangular filters [1] used for computing the MFCC, such that the lower frequency filters have larger bandwidths and the higher frequency filters have smaller bandwidths. This is exactly the opposite of the filters being used for MFCC. They also use LPC and *F*0 (the fundamental frequency) as additional features.

In Section 3.2 we saw the treatment of speaker identification for whispered speech in some detail. Also, *Ghiurcau, et al.* [73] study the emotional state of speakers on the results of speaker identification. The study treats happiness, anger, fear, boredom, sadness, and neutral conditions; it shows that these emotions significantly affect identification results. Therefore, they [73] propose using emotion detection and having emotion-specific models. Once the emotion is identified, the proper model is used to identify the test speaker.

*Liu, et al.* [74] use the Hilbert Huang Transform to come up with new acoustic features. This is the use of intrinsic mode decomposition described in detail in [1].

In the next section, we will look at the multi-class SVM which is used to perform speaker identification.

#### *6.2.1. Multi-Class SVM*

16 New Trends and Developments in Biometrics

network.

system.

merit.

features.

the test speaker.

of intrinsic mode decomposition described in detail in [1].

6.2. Identification

from several instruments.

formulation in Section 5.1. The i-vector has basically been used for speaker verification. Many recent papers have dealt with aspects such as LDA, PLDA, and other discriminative aspects of the training. *Salman, et al.* [68] use a neural network architecture with very deep number of layers to perform a greedy discriminative learning for the speaker verification problem. The *deep neural architecture* (*DNA*), proposed by [68], uses two identical subnets, to process two MFCC feature vectors respectively, for providing discrimination results between two speakers. They show promising results using this

*Sarkar, et al.* [69] use multiple background models associated with different *vocal tract length* (*VTL*) [1] estimates for the speakers, using MAP [1] to derive these background models from a root background model. Once the best VTL-based background model for the training or test audio is computed, the transformation to get from that universal background model (UBM) to the root UBM is used to transform the features of the segment to those associated with the VTL of the root UBM. *Sarkar, et al.* [69] show that the results of this single UBM system is comparable to a multiple background model

In Section 5.3.2 we discussed new developments on SNR-matched recognition. The work of *Bartos, et*

*Bharathi, et al.* [70] try to identify phonetic content for which specific speakers may be efficiently recognized. Using these speaker-specific phonemes, a special text is created to enhance the discrimination capability for the target speaker. The results are presented for the TIMIT database [1] which is a clean and controlled database and not very challenging. However, the idea seems to have

*Cai, et al.* [71] use some of the features described in Section 4, such as MFCC and GFCC in order to identify the voice of signers from a monophonic recording of songs in the presence of sounds of music

*Do, et al.*[72] examine the speaker identification problem for identifying the person playing a computer game. The specific challenges are the fact that the recording is done through a far-field microphone (see Section 3.4) and that the audio is generally short, apparently based on the commands used for gaming. To handle the reverberation and background noise, *Do, et al.* [72] argue for the use of the, so-called, *reverse Mel frequency cepstral coefficients* (*RMFCC*). They propose this set of features by reversing the triangular filters [1] used for computing the MFCC, such that the lower frequency filters have larger bandwidths and the higher frequency filters have smaller bandwidths. This is exactly the opposite of the filters being used for MFCC. They also use LPC and *F*0 (the fundamental frequency) as additional

In Section 3.2 we saw the treatment of speaker identification for whispered speech in some detail. Also, *Ghiurcau, et al.* [73] study the emotional state of speakers on the results of speaker identification. The study treats happiness, anger, fear, boredom, sadness, and neutral conditions; it shows that these emotions significantly affect identification results. Therefore, they [73] propose using emotion detection and having emotion-specific models. Once the emotion is identified, the proper model is used to identify

*Liu, et al.* [74] use the Hilbert Huang Transform to come up with new acoustic features. This is the use

*al.* [67] was applied to improving speaker identification based on a matched SNR condition.

In Section 2.2 we discussed the popular one-against-all technique for handling multi-class SVM. There have been other more recent techniques which have been proposed in the last few years. One such technique is due to *Platt, et al.* [75], who proposed the, so-called, *decision directed acyclic graph* (*DDAG*) which produces a classification node for each pair of classes, in a Γ-class problem. This leads to Γ(Γ−1)/2 classifiers and results in the creation of the *DAGSVM* algorithm [75].

*Wang* [76] presents a tree-based multi-class SVM which reduces the number of matches to the order of log(Γ). Although at the training phase, the number of SVM are similar to that of DDAG, namely, Γ(Γ−1)/2. This can significantly reduce the amount of computation for speaker identification.

#### 6.3. Classification and diarization

Aside from the more prominent research on speaker verification and identification, audio source and gender classification are also quite important in most audio processing systems including speaker and speech recognition.

In many practical audio processing systems, it is important to determine the type of audio. For instance, consider a telephone-based system which includes a speech recognizer. Such recognition engines would produce spurious results if they were presented with non-speech, say music. These results may be detrimental to the operation of an automated process. This is also true for speaker identification and verification systems which expect to receive human speech. They may be confused if they are presented with music or other types of audio such as noise. For *text-independent speaker identification* systems, this may result in mis-identifying the audio as a viable choice in the database and resulting in dire consequences!

Similarly, some systems are only interested in processing music. An example is a music search system which would look for a specific music or one resembling the presented segment. These systems may be confused, if presented with human speech, uttered inadvertently, while only music is expected.

As an example, an important goal for audio source classification research is to develop filters which would tag a segment of audio as speech, music, noise, or silence [77]. Sometimes, we would also look into classifying the genre of audio or video such as movie, cartoon, news, advertisement, etc. [19].

The basic problem contains two separate parts. The first part is the segmentation of the audio stream into segments of similar content. This work has been under development for the past few decades with some good results [78–80].

The second part is the classification of each segment into relevant classes such as speech, music, or the rejection of the segment as silence or noise. Furthermore, when the audio type is *human speech*, it is desirable to do a further classification to determine the gender of the individual speaker. *Gender classification* [77] is helpful in choosing appropriate models for conducting better speech recognition, more accurate speaker verification, and reducing the computation load in large-scale speaker identification. For the speaker diarization problem, the identity of the speaker also needs to be recognized.

*Dhanalakshmi, et al.* [19] report developments in classifying the genre of audio, as stemming from different video sources, containing movies, cartoons, news, etc. *Beigi* [77] uses a *text* and *language* *independent* speaker recognition engine to achieve these goals by performing audio classification. The classification problem is posed by *Beigi* [77] as an identification problem among a series of speech, music, and noise models.

gender classification accuracy of about 96% and an average age classification accuracy of about 48%. Although it is dependent on the data being used, but an accuracy of 96% for the gender classification case is not necessarily a great result. It is hard to make a qualitative assessment without running the same algorithms under the same conditions and on exactly the same data. But *Beigi* [77] shows 98.1%

Speaker Recognition: Advancements and Challenges

http://dx.doi.org/10.5772/52023

21

In [77], 700 male and 700 female speakers were selected, completely at random, from over 70, 000 speakers. The speakers were non-native speakers of English, at a variety of proficiency levels, speaking freely. This introduced significantly higher number of pauses in each recording, as well as more than average number of humming sounds while the candidates would think about their speech. The segments were live responses of these non-native speakers to test questions in English, aimed at evaluating their

*Dhanalakshmi, et al.* [19] also present a method based on an *auto-associative neural network* (*AANN*) for performing audio source classification. AANN is a special branch of feedforward neural networks which tries to learn the nonlinear principal components of a feature vector. The way this is accomplished is that the network consists of three layers, an input layer, an output layer of the same size, and a hidden layer with a smaller number of neurons. The input and output neurons generally have linear activation

In the training phase, the input and target output vectors are identical. This is done to allow for the system to learn the principal components that have built the patterns which most likely have built-in redundancies. Once such a network is trained, a feature vector undergoes a dimensional reduction and is then mapped back to the same dimensional space as the input space. If the training procedure is able to achieve a good reduction in the output error over the training samples and if the training samples are representative of the reality and span the operating conditions of the true system, the network can learn the essential information in the input signal. Autoassociative networks (AANN) have also been

Class Name Advertisement Cartoon Movie News Songs Sports

*Dhanalakshmi, et al.* [19] use the audio classes represented in Table 3. It considers three different front-end processors for extracting features, used with two different modeling techniques. The features are LPC, LPCC, and MFCC features [1]. The models are Gaussian mixture models (GMM) and autoassociative neural networks (AANN) [1]. According to these experiments, *Dhanalakshmi, et al.* [19] show consistently higher classification accuracies with MFCC features over LPC and LPCC features. The comparison between AANN and GMM is somewhat inconclusive and both systems seem to portray similar results. Although, the accuracy of AANN with LPC and LPCC seems to be higher than that of GMM modeling, for the case when MFCC features are used, the difference seems somewhat insignificant. Especially, given the fact that GMM are simpler to implement than AANN and are less prone to problems such as encountering local minima, it makes sense to conclude that the combination of MFCC and GMM still provides the best results in audio classification. A combination of GMM with MFCC and performing Maximum a-Posteriori (MAP) adaptation provides very simple and considerable

functions and the hidden (middle) layer has nonlinear functions.

successfully used in speaker verification [88].

**Table 3.** Audio Classification Categories used by [19]

results for gender classification, as seen in [77].

accuracy for gender classification.

linguistic proficiency.

#### *6.3.1. Age and Gender Classification*

Another goal for classification is to be able to classify age groups. *Bocklet, et al.* [81] categorized the age of the individuals, in relation to their voice quality, into 4 categories (classes). These classes are given by Table 1. With the natural exception of the child group (13 years or younger), each group is further split into the two male and female genders, leading to 7 total age-gender classes.


**Table 1.** Age Categories According to Vocal Similarities – From [81]


**Table 2.** Age Categories According to Vocal Similarities – From [82]

*Bahari, et al.* [82] use a slightly different definition of age groups, compared to those used by [81]. They use 3 age groups for each gender, not considering individuals who are less than 18 years old. These age categories are given in Table 2.

They use *weighted supervised non-negative matrix factorization* (*WSNMF*) to classify the age and gender of the individual. This technique combines *weighted non-negative matrix factorization* (*WNMF*) [83] and *supervised non-negative matrix factorization* (*SNMF*) [84] which are themselves extensions of *non-negative matrix factorization* (*NMF*) [65, 66]. NMF techniques have also been successfully used in other classification implementations such as that of the identification of musical instruments [85].

NMF distinguishes itself as a method which only allows additive components that are considered to be parts of the information contained in an entity. Due to their additive and positive nature, the components are considered to, each, be part of the information that builds up a description. In contrast, methods such as principal component analysis and vector quantization techniques are considered to be learning holistic information and hence are not considered to be parts-based [66]. According to the image recognition example presented by *Lee, et al.* [66], a PCA method such as Eigenfaces [86, 87] provide a distorted version of the whole face, whereas the NMF provides localized features that are related to the parts of each face.

Subsequent to applying WSNMF, according to the age and gender, *Bahari, et al.* [82] use a *general regression neural network* (*GRNN*) to estimate the age of the individual. *Bahari, et al.* [82] show a gender classification accuracy of about 96% and an average age classification accuracy of about 48%. Although it is dependent on the data being used, but an accuracy of 96% for the gender classification case is not necessarily a great result. It is hard to make a qualitative assessment without running the same algorithms under the same conditions and on exactly the same data. But *Beigi* [77] shows 98.1% accuracy for gender classification.

In [77], 700 male and 700 female speakers were selected, completely at random, from over 70, 000 speakers. The speakers were non-native speakers of English, at a variety of proficiency levels, speaking freely. This introduced significantly higher number of pauses in each recording, as well as more than average number of humming sounds while the candidates would think about their speech. The segments were live responses of these non-native speakers to test questions in English, aimed at evaluating their linguistic proficiency.

*Dhanalakshmi, et al.* [19] also present a method based on an *auto-associative neural network* (*AANN*) for performing audio source classification. AANN is a special branch of feedforward neural networks which tries to learn the nonlinear principal components of a feature vector. The way this is accomplished is that the network consists of three layers, an input layer, an output layer of the same size, and a hidden layer with a smaller number of neurons. The input and output neurons generally have linear activation functions and the hidden (middle) layer has nonlinear functions.

In the training phase, the input and target output vectors are identical. This is done to allow for the system to learn the principal components that have built the patterns which most likely have built-in redundancies. Once such a network is trained, a feature vector undergoes a dimensional reduction and is then mapped back to the same dimensional space as the input space. If the training procedure is able to achieve a good reduction in the output error over the training samples and if the training samples are representative of the reality and span the operating conditions of the true system, the network can learn the essential information in the input signal. Autoassociative networks (AANN) have also been successfully used in speaker verification [88].


**Table 3.** Audio Classification Categories used by [19]

18 New Trends and Developments in Biometrics

*6.3.1. Age and Gender Classification*

**Table 1.** Age Categories According to Vocal Similarities – From [81]

**Table 2.** Age Categories According to Vocal Similarities – From [82]

categories are given in Table 2.

instruments [85].

parts of each face.

music, and noise models.

*independent* speaker recognition engine to achieve these goals by performing audio classification. The classification problem is posed by *Beigi* [77] as an identification problem among a series of speech,

Another goal for classification is to be able to classify age groups. *Bocklet, et al.* [81] categorized the age of the individuals, in relation to their voice quality, into 4 categories (classes). These classes are given by Table 1. With the natural exception of the child group (13 years or younger), each group is

> Child Age ≤ 13 years old Young 14 years ≤ Age ≤ 19 years Adult 20 years ≤ Age ≤ 64 years Senior 65 years ≥ Age

further split into the two male and female genders, leading to 7 total age-gender classes.

Class Name Age

Class Name Age

Young 18 years ≤ Age ≤ 35 years Adult 36 years ≤ Age ≤ 45 years Senior 46 years ≤ Age ≤ 81 years

*Bahari, et al.* [82] use a slightly different definition of age groups, compared to those used by [81]. They use 3 age groups for each gender, not considering individuals who are less than 18 years old. These age

They use *weighted supervised non-negative matrix factorization* (*WSNMF*) to classify the age and gender of the individual. This technique combines *weighted non-negative matrix factorization* (*WNMF*) [83] and *supervised non-negative matrix factorization* (*SNMF*) [84] which are themselves extensions of *non-negative matrix factorization* (*NMF*) [65, 66]. NMF techniques have also been successfully used in other classification implementations such as that of the identification of musical

NMF distinguishes itself as a method which only allows additive components that are considered to be parts of the information contained in an entity. Due to their additive and positive nature, the components are considered to, each, be part of the information that builds up a description. In contrast, methods such as principal component analysis and vector quantization techniques are considered to be learning holistic information and hence are not considered to be parts-based [66]. According to the image recognition example presented by *Lee, et al.* [66], a PCA method such as Eigenfaces [86, 87] provide a distorted version of the whole face, whereas the NMF provides localized features that are related to the

Subsequent to applying WSNMF, according to the age and gender, *Bahari, et al.* [82] use a *general regression neural network* (*GRNN*) to estimate the age of the individual. *Bahari, et al.* [82] show a *Dhanalakshmi, et al.* [19] use the audio classes represented in Table 3. It considers three different front-end processors for extracting features, used with two different modeling techniques. The features are LPC, LPCC, and MFCC features [1]. The models are Gaussian mixture models (GMM) and autoassociative neural networks (AANN) [1]. According to these experiments, *Dhanalakshmi, et al.* [19] show consistently higher classification accuracies with MFCC features over LPC and LPCC features. The comparison between AANN and GMM is somewhat inconclusive and both systems seem to portray similar results. Although, the accuracy of AANN with LPC and LPCC seems to be higher than that of GMM modeling, for the case when MFCC features are used, the difference seems somewhat insignificant. Especially, given the fact that GMM are simpler to implement than AANN and are less prone to problems such as encountering local minima, it makes sense to conclude that the combination of MFCC and GMM still provides the best results in audio classification. A combination of GMM with MFCC and performing Maximum a-Posteriori (MAP) adaptation provides very simple and considerable results for gender classification, as seen in [77].

#### *6.3.2. Music Modeling*

*Beigi* [77] classifies musical instruments along with noise and gender of speakers. Much in the same spirit as described in Section 6.3.1, [77] has made an effort to choose a variety of different instruments or sets of instruments to be able to cover most types of music. Table 4 shows these choices. A total of 14 different music models were trained to represent all music, with an attempt to cover different types of timbre [89].

Author details

Homayoon Beigi

References

978-0-387-77591-3.

Mar 1963.

0-471-03003-1.

Mechanical Engineering at Columbia University

*and Speech*, 2(3):123–131, 1959.

York, 1983. ISBN: 0-521-24486-2.

*Counter-Terrorism*. Springer, Heidelberg, 2012.

analysis of vaiance. 35(11):1877–, Apr 1963.

Recognition Technologies, Inc., Yorktown Heights, New York, USA

Open Access Publisher, Croatia, 2011. ISBN: 978-953-307-618-8.

*the Acoustical Society of America*, 26(3):403–406, May 1954.

President of Recognition Technologies, Inc. and an Adjunct Professor of Computer Science and

Speaker Recognition: Advancements and Challenges

http://dx.doi.org/10.5772/52023

23

[1] Homayoon Beigi. *Fundamentals of Speaker Recognition*. Springer, New York, 2011. ISBN:

[2] Homayoon Beigi. Speaker recognition. In Jucheng Yang, editor, *Biometrics*, pages 3–28. Intech

[3] I. Pollack, J. M. Pickett, and W.H. Sumby. On the identification of speakers by voice. *Journal of*

[4] J. N. Shearme and J. N. Holmes. An experiment concerning the recognition of voices. *Language*

[5] Francis Nolan. *The Phonetic Bases of Speaker Recognition*. Cambridge University Press, New

[6] Harry Hollien. *The Acoustics of Crime: The New Science of Forensic Phonetics (Applied*

[8] Amy Neustein and Hemant A. Patil. *Forensic Speakr Recognition – Law Enforcement and*

[9] Sandra Pruzansky. Pattern matching procedure for automatic talker recognition. 35(3):354–358,

[10] Sandra Pruzansky, Max. V. Mathews, and P.B. Britner. Talker-recognition procedure based on

[11] Geoffrey J. McLachlan and David Peel. *Finite Mixture Models*. Wiley Series in Probability and Statistics. John Wiley & Sons, New York, 2nd edition, 2000. ISBN: 0-471-00626-2.

[12] Vladimir Naumovich Vapnik. *Statistical learning theory*. John Wiley, New York, 1998. ISBN:

[13] A. Solomonoff, W. Campbell, and C. Quillen. Channel compensation for svm speaker recognition. In *The Speaker and Language Recognition Workshop Odyssey 2004*, volume 1, pages 57–62, 2004.

[14] Robbie Vogt and Sridha Sridharan. Explicit modelling of session variability for speaker

verification. *Computer Speech and Language*, 22(1):17–38, Jan. 2008.

[7] Harry Hollien. *Forensic Voice Identification*. Academic Press, San Diego, CA, USA, 2001.

*Psycholinguistics and Communication Disorder)*. Springer, Heidelberg, 1990.

An equal amount of music was chosen by *Beigi* [77] to create a balance in the quantity of data, reducing any bias toward speech or music. The music was downsampled from its original quality to 8*kHz*, using 8-bit µ-Law amplitude encoding, in order to match the quality of speech. The 1400 segments of music were chosen at random from European style classical music, as well as jazz, Persian classical, Chinese classical, folk, and instructional performances. Most of the music samples were orchestral pieces, with some solos and duets present.

Although a very low quality audio, based on highly compressed telephony data (AAC compressed [1]), was used by *Beigi* [77], the system achieved a 1% error rate in discriminating between speech and music and a 1.9% error in determining the gender of individual speakers once the audio is tagged as speech.


**Table 4.** Audio Models used for Classification

*Beigi* [77] has shown that MAP adaptation techniques used with GMM models and MFCC features may be used successfully for the classification of audio into speech and music and to further classify the speech by the gender of the speaker and the music by the type of instrument being played.

## 7. Open problems

With all the new accomplishments in the last couple of years, covered here and many that did not make it to our list due to shortage of space, there is still a lot more work to be done. Although incremental improvements are made every day, in all branches of speaker recognition, still the channel and audio type mismatch seem to be the biggest hurdles in reaching perfect results in speaker recognition. It should be noted that perfect results are *asymptotes* and will probably never be reached. Inherently, as the size of the population in a speaker database grows, the intra-speaker variations exceed the inter-speaker variations. This is the main source of error for large-scale speaker identification, which is the holy grail of the different goals in speaker recognition. In fact, if large-scale speaker identification approaches acceptable results, most other branches of the field may be considered trivial. However, this is quite a complex problem and will definitely need a lot more time to be perfected, if it is indeed possible to do so. In the meanwhile, we seem to still be at infancy when it comes to large-scale identification.

## Author details

20 New Trends and Developments in Biometrics

*Beigi* [77] classifies musical instruments along with noise and gender of speakers. Much in the same spirit as described in Section 6.3.1, [77] has made an effort to choose a variety of different instruments or sets of instruments to be able to cover most types of music. Table 4 shows these choices. A total of 14 different music models were trained to represent all music, with an attempt to cover different types

An equal amount of music was chosen by *Beigi* [77] to create a balance in the quantity of data, reducing any bias toward speech or music. The music was downsampled from its original quality to 8*kHz*, using 8-bit µ-Law amplitude encoding, in order to match the quality of speech. The 1400 segments of music were chosen at random from European style classical music, as well as jazz, Persian classical, Chinese classical, folk, and instructional performances. Most of the music samples were orchestral pieces, with

Although a very low quality audio, based on highly compressed telephony data (AAC compressed [1]), was used by *Beigi* [77], the system achieved a 1% error rate in discriminating between speech and music and a 1.9% error in determining the gender of individual speakers once the audio is tagged as speech.

> Category Model Category Model Category Model Noise Noise Speech Female Speech Male Music Accordion Music Bassoon Music Clarinet Music Clavier Music Gamelon Music Guzheng Music Guitar Music Oboe Music Orchestra Music Piano Music Pipa Music Tar

*Beigi* [77] has shown that MAP adaptation techniques used with GMM models and MFCC features may be used successfully for the classification of audio into speech and music and to further classify

With all the new accomplishments in the last couple of years, covered here and many that did not make it to our list due to shortage of space, there is still a lot more work to be done. Although incremental improvements are made every day, in all branches of speaker recognition, still the channel and audio type mismatch seem to be the biggest hurdles in reaching perfect results in speaker recognition. It should be noted that perfect results are *asymptotes* and will probably never be reached. Inherently, as the size of the population in a speaker database grows, the intra-speaker variations exceed the inter-speaker variations. This is the main source of error for large-scale speaker identification, which is the holy grail of the different goals in speaker recognition. In fact, if large-scale speaker identification approaches acceptable results, most other branches of the field may be considered trivial. However, this is quite a complex problem and will definitely need a lot more time to be perfected, if it is indeed possible to do so. In the meanwhile, we seem to still be at infancy when it comes to large-scale identification.

the speech by the gender of the speaker and the music by the type of instrument being played.

Music Throat Music Violin

*6.3.2. Music Modeling*

some solos and duets present.

**Table 4.** Audio Models used for Classification

7. Open problems

of timbre [89].

Homayoon Beigi

President of Recognition Technologies, Inc. and an Adjunct Professor of Computer Science and Mechanical Engineering at Columbia University

Recognition Technologies, Inc., Yorktown Heights, New York, USA

## References


[15] Sahar E. Bou-Ghazale and John H. L. Hansen. A comparative study of traditional and newly proposed features for recognition of speech under stress. *IEEE Transactions on Speech and Audio Processing*, 8(4):429–442, Jul 2002.

[29] P. Shanmugapriya and Y. Venkataramani. Implementation of speaker verification system using fuzzy wavelet network. In *Communications and Signal Processing (ICCSP), 2011 International*

Speaker Recognition: Advancements and Challenges

http://dx.doi.org/10.5772/52023

25

[30] J. Villalba and E. Lleida. Preventing replay attacks on speaker verification systems. In *Security Technology (ICCST), 2011 IEEE International Carnahan Conference on*, pages 1–8, Oct 2011.

[31] Johan Sandberg, Maria Hansson-Sandsten, Tomi Kinnunen, Rahim Saeidi Patrick Flandrin, , and Pierre Borgnat. Multitaper estimation of frequency-warped cepstra with application to speaker

[32] David J. Thomson. Spectrum estimation and harmonic analysis. *Proceedings of the IEEE*,

[33] Kurt S. Riedel, Alexander Sidorenko, and David J. Thomson. Spectral estimation of plasma

[34] Kurt S. Riedel. Minimum bias multiple taper spectral estimation. *IEEE Transactions on Signal*

[35] Maria Hansson and Göran Salomonsson. A multiple window method for estimation of peaked

[36] Qi Li and Yan Huang. An auditory-based feature extraction algorithm for robust speaker identification under mismatched conditions. *Audio, Speech, and Language Processing, IEEE*

[37] Qi Peter Li. An auditory-based transform for audio signal processing. In *IEEE Workshop on Applications of Signal Processing to audio and Acoustics*, pages 181–184, Oct 2009.

[38] Yang Shao and DeLiang Wang. Robust speaker identification using auditory features and computational auditory scene analysis. In *Acoustics, Speech and Signal Processing, 2008. ICASSP*

[39] Brian C. J. Moore and Brian R. Glasberg. Suggested formulae for calculating auditory-filter bandwidths and excitation. *Journal of Aciystical Society of America*, 74(3):750–753, 1983.

[40] Brian C. J. Moore and Brian R. Glasberg. A revision of zwicker's loudness model. *Acta Acustica*,

[41] Brian R. Glasberg and Brian C. J. Moore. Derivation of auditory filter shapes from notched-noise

[42] E. Zwicker, G. Flottorp, and Stanley Smith Stevens. Critical band width in loudness summation.

[43] Xiaojia Zhao, Yang Shao, and DeLiang Wang. Robust speaker identification using a casa front-end. In *Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International*

fluctuations. i. comparison of methods. *Physics of Plasma*, 1(3):485–500, 1994.

spectra. *IEEE Transactions on Signal Processing*, 45(3):778–781, Mar 1997.

*2008. IEEE International Conference on*, pages 1589–1592, 2008.

*Journal of the Acoustical Society of America*, 29(5):548–557, 1957.

verification. *IEEE Signal Processing Letters*, 17(4):343–346, Apr 2010.

*Conference on*, pages 460–464, Feb 2011.

70(9):1055–1096, Sep 1982.

*Processing*, 43(1):188–195, Jan 1995.

82(2):335–345, Mar/Apr 1996.

data. *Hearing Research*, 47(1–2):103–138, 1990.

*Conference on*, pages 5468–5471, May 2011.

*Transactions on*, 19(6):1791–1801, Aug 2011.


[29] P. Shanmugapriya and Y. Venkataramani. Implementation of speaker verification system using fuzzy wavelet network. In *Communications and Signal Processing (ICCSP), 2011 International Conference on*, pages 460–464, Feb 2011.

22 New Trends and Developments in Biometrics

*Processing*, 8(4):429–442, Jul 2002.

*Processing*, 10(6):352–362, Sep 2002.

*(ASRU)*, pages 357–362, Nov 2005.

Technical report, CRIM, Jan 2006.

pages IV–217–IV–220, Apr 2007.

19(6):1631–1641, Aug 2011.

[15] Sahar E. Bou-Ghazale and John H. L. Hansen. A comparative study of traditional and newly proposed features for recognition of speech under stress. *IEEE Transactions on Speech and Audio*

[16] Eliott D. Canonge. Voiceless vowels in comanche. *International Journal of American Linguistics*,

[17] Qin Jin, Szu-Chen Stan Jou, and T. Schultz. Whispering speaker identification. In *Multimedia*

[18] Xing Fan and J.H.L. Hansen. Speaker identification within whispered speech audio streams. *Audio, Speech, and Language Processing, IEEE Transactions on*, 19(5):1408–1421, Jul 2011.

[19] P. Dhanalakshmi, S. Palanivel, and V. Ramalingam. Classification of audio signals using aann and

[20] Lucas C. Parra and Christopher V. Alvino. Geometric source separation: merging convolutive source separation with geometric beamforming. *IEEE Transactions on Speech and Audio*

[21] K. Kumatani, U. Mayer, T. Gehrig, E. Stoimenov, and M. Wolfel. Minimum mutual information beamforming for simultaneous active speakers. In *IEEE Workshop on Automatic Speech*

[22] M. Lincoln. The multi-channel wall street journal audio visual corpus (mc-wsj-av): Specification and initial experiments. In *IEEE Workshop on Automatic Speech Recognition and Understanding*

[23] R. Takashima, T. Takiguchi, and Y. Ariki. Hmm-based separation of acoustic transfer function for

[24] C. Barras and J.-L. Gauvain. Feature and score normalization for speaker verification of cellular data. In *Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03). 2003 IEEE*

[25] P. Kenny. Joint factor analysis of speaker and session varaiability: Theory and algorithms.

[26] Ondrej Glembek, Lukas Burget, Pavel Matejka, Martin Karafiat, and Patrick Kenny. Simplification and optimization of i-vector extraction. pages 4516–4519, May 2011.

[27] W.M. Campbell, D.E. Sturim, W. Shen, D.A. Reynolds, and J. Navratil. The mit-ll/ibm 2006 speaker recognition system: High-performance reduced-complexity recognition. In *Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on*, volume 4,

[28] Hyunson Seo, Chi-Sang Jung, and Hong-Goo Kang. Robust session variability compensation for svm speaker verification. *Audio, Speech, and Language Processing, IEEE Transactions on*,

23(2):63–67, Apr 1957. Published by: The University of Chicago Press.

gmm. *Applied Soft Computing*, 11(1):716 – 723, 2011.

*Recognition and Understanding (ASRU)*, pages 71–76, Dec 2007.

single-channel sound source localization. pages 2830–2833, Mar 2010.

*International Conference on*, volume 2, pages II–49–52, Apr 2003.

*and Expo, 2007 IEEE International Conference on*, pages 1027–1030, Jul 2007.


[44] Albert S. Bergman. *Auditory Scene Analysis: The Perceptual Organization of Sound*. Bradford, 1994.

[58] Najim Dehak, Patrick Kenny, Réda Dehak, O. Glembek, Pierre Dumouchel, L. Burget, V. Hubeika, and F. Castaldo. Support vector machines and joint factor analysis for speaker

Speaker Recognition: Advancements and Challenges

http://dx.doi.org/10.5772/52023

27

[59] M. Senoussaoui, P. Kenny, P. Dumouchel, and F. Castaldo. Well-calibrated heavy tailed bayesian speaker verification for microphone speech. In *Acoustics, Speech and Signal Processing*

[60] L. Burget, O. Plchot, S. Cumani, O. Glembek, P. Matejka, and N. Briimmer. Discriminatively trained probabilistic linear discriminant analysis for speaker verification. In *Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on*, pages 4832–4835,

[61] S. Cumani, N. Brummer, L. Burget, and P. Laface. Fast discriminative speaker verification in the i-vector space. In *Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International*

[62] P. Matejka, O. Glembek, F. Castaldo, M.J. Alam, O. Plchot, P. Kenny, L. Burget, and J. Cernocky. Full-covariance ubm and heavy-tailed plda in i-vector speaker verification. In *Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on*, pages 4828–4831, May

[63] M.J. Alam, T. Kinnunen, P. Kenny, P. Ouellet, and D. O'Shaughnessy. Multi-taper mfcc features for speaker verification using i-vectors. In *Automatic Speech Recognition and Understanding*

[64] Nagendra Kumar and Andreas G. Andreou. Heteroscedastic discriminant analysis and reduced rank hmms for improved speech recognition. *Speech Communication*, 26(4):283–297, 1998.

[65] D. D. Lee and H. S. Seung. Learning the parts of objects by nonnegative matrix factorization.

[66] Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix factorization.

[67] A.L. Bartos and D.J. Nelson. Enabling improved speaker recognition by voice quality estimation. In *Signals, Systems and Computers (ASILOMAR), 2011 Conference Record of the Forty Fifth*

[68] A. Salman and Ke Chen. Exploring speaker-specific characteristics with deep learning. In *Neural Networks (IJCNN), The 2011 International Joint Conference on*, pages 103–110, 2011.

[69] A.K. Sarkar and S. Umesh. Use of vtl-wise models in feature-mapping framework to achieve performance of multiple-background models in speaker verification. In *Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on*, pages 4552–4555, May

[70] B. Bharathi, P. Vijayalakshmi, and T. Nagarajan. Speaker identification using utterances correspond to speaker-specific-text. In *Students' Technology Symposium (TechSym), 2011 IEEE*,

*Advances in Neural Information Processing Systems*, 13:556–562, 2001.

*(ICASSP), 2011 IEEE International Conference on*, pages 4824–4827, May 2011.

verification. pages 4237–4240, Apr 2009.

*Conference on*, pages 4852–4855, May 2011.

*Nature*, 401(6755):788–791, 1999.

*(ASRU), 2011 IEEE Workshop on*, pages 547–552, Dec 2011.

*Asilomar Conference on*, pages 595–599, Nov 2011.

May 2011.

2011.

2011.

pages 171–174, Jan 2011.


[58] Najim Dehak, Patrick Kenny, Réda Dehak, O. Glembek, Pierre Dumouchel, L. Burget, V. Hubeika, and F. Castaldo. Support vector machines and joint factor analysis for speaker verification. pages 4237–4240, Apr 2009.

24 New Trends and Developments in Biometrics

2407–2410, Sep 1999.

57(3):1245–1250, Aug 2011.

1994.

2004.

2012.

[44] Albert S. Bergman. *Auditory Scene Analysis: The Perceptual Organization of Sound*. Bradford,

[45] A. Vizinho, P. Green, M. Cooke, and L. Josifovski. Missing data theory, spectral subtraction and signal-to-noise estimation for robust asr: An integrated study. In *Eurospeech 1999*, pages

[46] Michael L. Seltzer, Bhiksha Raj, and Richard M. Stern. A bayesian classifier for spectrographic mask estimation for missing feature speech recognition. *Speech Communication*, 43(4):379–393,

[47] D. Pullella, M. Kuhne, and R. Togneri. Robust speaker identification using combined feature selection and missing data recognition. In *Acoustics, Speech and Signal Processing, 2008. ICASSP*

[48] Shin-Cheol Lim, Sei-Jin Jang, Soek-Pil Lee, and Moo Young Kim. Hard-mask missing feature theory for robust speaker recognition. *Consumer Electronics, IEEE Transactions on*,

[50] Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In *Proceedings of the Thirteenth International Conference on Machine Learning (ICML)*, pages 148–156, 1996.

[51] Yann Rodriguez. *Face Detection and Verification Using Local Binary Patterns*. Ecole

[52] Anindya Roy, Mathew Magimai-Doss, and Sébastien Marcel. Boosted binary features for

[53] A. Roy, M. M. Doss, and S. Marcel. A fast parts-based approach to speaker verification using boosted slice classifiers. *IEEE Transactions on Information Forensic and Security*, 7(1):241–254,

[54] Najim Dehak, Réda Dehak, Patrick Kenny, Niko Brummer, Pierre Ouellet, and Pierre Dumouchel. Support vector machines versus fast scoring in the low-dimensional total variability space for

[55] Najim Dehak, Reda Dehak, James Glass, Douglas Reynolds, and Patrick Kenny. Cosine similarity scoring without score normalization techniques. In *The Speaker and Language Recognition*

[56] Mohammed Senoussaoui, Patrick Kenny, Najim Dehak, and Pierre Dumouchel. An i-vector extractor suitable for speaker recognition with both microphone and telephone speech. In *The*

[57] N. Dehak, P.J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet. Front-end factor analysis for speaker verification. *IEEE Transactions on Audio, Speech and Language Processing*,

*Speaker and Language Recognition Workshop (Odyssey 2010)*, pages 28–33, June 2010.

noise-robust speaker verification. volume 6, pages 4442–4445, Mar 2010.

speaker verification. In *InterSpeech*, pages 1559–1562, Sep 2009.

*Workshop (Odyssey 2010)*, pages 15–19, Jun-Jul 2010.

19(4):788–798, May 2011.

[49] R. E. Schapire. The strength of weak learnability. *Machine Learning*, 5(2):197–227, 1990.

*2008. IEEE International Conference on*, pages 4833–4836, 2008.

Polytechnique Fédérale de Lausanne, 2006. PhD Thesis.


[71] Wei Cai, Qiang Li, and Xin Guan. Automatic singer identification based on auditory features. In *Natural Computation (ICNC), 2011 Seventh International Conference on*, volume 3, pages 1624–1628, Jul 2011.

[85] Emmanouil Benetos, Margarita Kotti, and Constantine Kotropoulos. Large scale musical instrument identification. In *Proceedings of the 4th Sound and Music Computing Conference*,

Speaker Recognition: Advancements and Challenges

http://dx.doi.org/10.5772/52023

29

[86] M. Kirby and L. Sirovich. Application of the karhunen-loeve procedure for the characterization of human faces. *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 12(1):103–108,

[87] M. Turk and A. Pentland. Eigenfaces for recognition. *Journal of Cognitive Neuroscience*, 3:71–86,

[88] S.P. Kishore and B. Yegnanarayana. Speaker verification: minimizing the channel effects using autoassociative neural network models. In *Acoustics, Speech, and Signal Processing, 2000. ICASSP '00. Proceedings. 2000 IEEE International Conference on*, volume 2, pages

[89] Keith Dana Martin. *Sound-Source Recognition: A Theory and Computational Model*.

Massachusetts Institute of Technology, Cambridge, MA, 1999. PhD Thesis.

pages 283–286, Jul 2007.

II1101–II1104, Jun 2000.

Jan. 1990.

1991.


[85] Emmanouil Benetos, Margarita Kotti, and Constantine Kotropoulos. Large scale musical instrument identification. In *Proceedings of the 4th Sound and Music Computing Conference*, pages 283–286, Jul 2007.

26 New Trends and Developments in Biometrics

1624–1628, Jul 2011.

pages 5436–5439, May 2011.

*Conference on*, pages 145–148, Nov 2011.

*processing Systems*. MIT Press, Boston, 2000.

Web, Feb 2011. Report No. RTI-20110201-01.

Technical Report, 1997.

*Research Center*, 1998.

pages 1605–1608, Apr 2008.

Louvain, 2008. PhD Thesis.

pages 2554–2557, Sep 2008.

[71] Wei Cai, Qiang Li, and Xin Guan. Automatic singer identification based on auditory features. In *Natural Computation (ICNC), 2011 Seventh International Conference on*, volume 3, pages

[72] Hoang Do, I. Tashev, and A. Acero. A new speaker identification algorithm for gaming scenarios. In *Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on*,

[73] M.V. Ghiurcau, C. Rusu, and J. Astola. A study of the effect of emotional state upon text-independent speaker identification. In *Acoustics, Speech and Signal Processing (ICASSP),*

[74] Jia-Wei Liu, Jia-Ching Wang, and Chang-Hong Lin. Speaker identification using hht spectrum features. In *Technologies and Applications of Artificial Intelligence (TAAI), 2011 International*

[75] John C. Platt, Nello Cristianini, and John Shawe-Taylor. Large margin dags for multiclass classification. In S.A. Solla, T.K. Leen, and K.R. Müller, editors, *Advances in Neural Information*

[76] Yuguo Wang. A tree-based multi-class svm classifier for digital library document. In *International Conference on MultiMedia and Information Technology (MMIT)*, pages 15–18, Dec 2008.

[77] Homayoon Beigi. Audio source classification using speaker recognition techniques. World Wide

[78] Stephane H. Maes Homayoon S. M. Beigi. Speaker, channel and environment change detection.

[79] Homayoon S.M. Beigi and Stephane S. Maes. Speaker, channel and environment change detection. In *Proceedings of the World Congress on Automation (WAC1998)*, May 1998.

[80] Scott Shaobing Chen and Ponani S Gopalakrishnan. Speaker, environment and channel change detection and clustering via the bayesian inromation criterion. In *IBM Techical Report, T.J. Watson*

[81] Tobia Bocklet, Andreas Maier, Josef G. Bauer, Felix Burkhardt, and Elmar Nöth. Age and gender recognition for telephone applications based on gmm supervectors and support vector machines.

[82] M.H. Bahari and H. Van Hamme. Speaker age estimation and gender detection based on supervised non-negative matrix factorization. In *Biometric Measurements and Systems for Security and Medical Applications (BIOMS), 2011 IEEE Workshop on*, pages 1–6, Sep 2011.

[83] N. Ho. *Nonnegative Martix Factorization Algorithms and Applications*. Université Catholique de

[84] H. Van-Hamme. Hac-models: A novel approach to continuous speech recognition. In *Interspeech*,

*2011 IEEE International Conference on*, pages 4944–4947, May 2011.


**Chapter 2**

**3D and Thermo-Face Fusion**

Štěpán Mráček, Jan Váňa, Radim Dvořák,

Martin Drahanský and Svetlana Yanushkevich

Additional information is available at the end of the chapter

the three-fold source of information, as shown in Figure 1.

Most biometric-based systems use a combination of various biometrics to improve reliability of decision. These systems are called multi-modal biometric systems. For example, they can include video, infrared, and audio data for identification of appearance (encompassing natu‐ ral changes such as aging, and intentional ones, such as surgical changes), physiological characteristics (temperature, blood flow rate), and behavioral features (voice and gait) [1]. Biometric technologies, in a narrow sense, are tools and techniques for identification of hu‐ mans, and in a wide sense, they can be used for detection of alert information, prior to, or together with, the identification. For example, biometric data such as temperature, blood pulse, pressure, and 3D topology of a face (natural or changed topology using various artifi‐ cial implants, etc.) must be detected first at distance, while the captured face can be further used for identification. Detection of biometric features, which are ignored in identification, is useful in design of Physical Access Security Systems (PASS) [2][3]. In the PASS, the situa‐ tional awareness data (including biometrics) is used at the first phase, and the available re‐ sources for identification of person (including biometrics) are utilized at the second phase. Conceptually, a new generation of the biometric-based systems shall include a set of biomet‐ ric-based assistants; each of them deals with uncertainty independently, and maximizes its contribution to a joint decision. In this design concept, the biometric system possesses such properties as modularity, reconfiguration, aggregation, distribution, parallelism, and mobili‐ ty. Decision-making in such a system is based on the concept of fusion. In a complex system, the fusion is performed at several levels. In particular, the face biometrics is considered to be

In this chapter, we consider two types of the biometric-based assistants, or modules, within

© 2012 Mráček et al.; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© 2012 Mráček et al.; licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

http://dx.doi.org/10.5772/3420

**1. Introduction**

a biometric system:

## **Chapter 2**

## **3D and Thermo-Face Fusion**

Štěpán Mráček, Jan Váňa, Radim Dvořák, Martin Drahanský and Svetlana Yanushkevich

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/3420

## **1. Introduction**

Most biometric-based systems use a combination of various biometrics to improve reliability of decision. These systems are called multi-modal biometric systems. For example, they can include video, infrared, and audio data for identification of appearance (encompassing natu‐ ral changes such as aging, and intentional ones, such as surgical changes), physiological characteristics (temperature, blood flow rate), and behavioral features (voice and gait) [1].

Biometric technologies, in a narrow sense, are tools and techniques for identification of hu‐ mans, and in a wide sense, they can be used for detection of alert information, prior to, or together with, the identification. For example, biometric data such as temperature, blood pulse, pressure, and 3D topology of a face (natural or changed topology using various artifi‐ cial implants, etc.) must be detected first at distance, while the captured face can be further used for identification. Detection of biometric features, which are ignored in identification, is useful in design of Physical Access Security Systems (PASS) [2][3]. In the PASS, the situa‐ tional awareness data (including biometrics) is used at the first phase, and the available re‐ sources for identification of person (including biometrics) are utilized at the second phase.

Conceptually, a new generation of the biometric-based systems shall include a set of biomet‐ ric-based assistants; each of them deals with uncertainty independently, and maximizes its contribution to a joint decision. In this design concept, the biometric system possesses such properties as modularity, reconfiguration, aggregation, distribution, parallelism, and mobili‐ ty. Decision-making in such a system is based on the concept of fusion. In a complex system, the fusion is performed at several levels. In particular, the face biometrics is considered to be the three-fold source of information, as shown in Figure 1.

In this chapter, we consider two types of the biometric-based assistants, or modules, within a biometric system:

© 2012 Mráček et al.; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2012 Mráček et al.; licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


We illustrate concept of fusion at the recognition component, which is a part of more com‐ plex decision-making level. Both methods are described in terms of data acquisition, image processing and recognition algorithms. The general facial recognition approach, based on the algorithmic fusion of the two methods, is presented, and its performance is evaluated on both 3D and thermal face databases.

The following sections provide an overview of how 3D and infrared facial biometrics work, and what is needed in terms of data acquisition and algorithms. The first section deals with the 3D face recognition. Thermal face recognition is described in the second section. Next, a general method for recognition, of both 3D and thermal images, is presented. The fusion on a decision (recognition score) level is investigated. Finally, the performance of the proposed

3D and Thermo-Face Fusion http://dx.doi.org/10.5772/3420 33

The three-dimensional (3D) face recognition is a natural extension of the classical two-di‐ mensional approach. Contrary to 2D face recognition, additional possibilities for the recog‐ nition are available, due to the added dimension. Another advantage, for example, is a more robust system, in terms of pose variations. An overview of the advantages of the biometric

lighting conditions do not affect recognition performance.

The wide range of applications of 3D face recognition systems is limited by a high acquisi‐ tion cost of special scanning devices. Moreover, a 3D face is targeted more at access control systems, rather than surveillance applications, due to the limited optimal distance range be‐

Most facial 3D scanners use structured light in order to obtain the three-dimensional shape of a face. Structured light scanners project certain light pattern onto the object's surface, which is simultaneously captured by a camera from a different angle. The exact surface is then computed from the projected light pattern distortion, caused by the surface shape. The most commonly structured light pattern in 3D scanning devices consists of many narrow stripes lying side by side. Other methods, either using a different pattern, or one without the structured light, can be also used [4], however, they are not common in biometric systems.

The pattern can be projected using visible or infra-red light spectrum. An advantage to the infrared light is its non-disturbing effect on the user's eyes. On the other hand, it is more

Due to the 3D form of the data, the face can be easily rotated into a predefined position.

Many 3D scanners work in infra-red spectra or emit their own light, for inappropriate

It is much more difficult to spoof fake data on a 3D sensor. While in 2D face recognition, simple systems may be fooled by a photograph or video, it is much more difficult to create

fusion approach is evaluated on several existing databases.

**2. Three dimensional face recognition**

system, based on a 3D face, is shown in Table 1.

an authentic 3D face model.

Pose variation robustness

Lighting condition robustness

Out-of-the-box liveness detection

**Table 1.** Advantages of 3D face recognition.

**2.1. Acquisition of 3D data**

tween the scanned subject and the sensor.

**Figure 1.** Thee sources of information in facial biometrics: a 3D face model (left), a thermal image (center) and a visual model with added texture (right).

Facial biometric, based on 3D data and infrared images, enchnace the classical face recogni‐ tion. Adding depth information, as well as the information about the surface temperature, may reveal additional discriminative abilities, and thus improve recognition performance. Furthermore, it is much harder to forge a 3D, or thermal model, of the face.

The following sections provide an overview of how 3D and infrared facial biometrics work, and what is needed in terms of data acquisition and algorithms. The first section deals with the 3D face recognition. Thermal face recognition is described in the second section. Next, a general method for recognition, of both 3D and thermal images, is presented. The fusion on a decision (recognition score) level is investigated. Finally, the performance of the proposed fusion approach is evaluated on several existing databases.

## **2. Three dimensional face recognition**

**•** A thermal, or infrared range assistant,

We illustrate concept of fusion at the recognition component, which is a part of more com‐ plex decision-making level. Both methods are described in terms of data acquisition, image processing and recognition algorithms. The general facial recognition approach, based on the algorithmic fusion of the two methods, is presented, and its performance is evaluated on

**Figure 1.** Thee sources of information in facial biometrics: a 3D face model (left), a thermal image (center) and a visual

Facial biometric, based on 3D data and infrared images, enchnace the classical face recogni‐ tion. Adding depth information, as well as the information about the surface temperature, may reveal additional discriminative abilities, and thus improve recognition performance.

Furthermore, it is much harder to forge a 3D, or thermal model, of the face.

**•** A 3D visual range assistant.

32 New Trends and Developments in Biometrics

both 3D and thermal face databases.

model with added texture (right).

The three-dimensional (3D) face recognition is a natural extension of the classical two-di‐ mensional approach. Contrary to 2D face recognition, additional possibilities for the recog‐ nition are available, due to the added dimension. Another advantage, for example, is a more robust system, in terms of pose variations. An overview of the advantages of the biometric system, based on a 3D face, is shown in Table 1.


**Table 1.** Advantages of 3D face recognition.

The wide range of applications of 3D face recognition systems is limited by a high acquisi‐ tion cost of special scanning devices. Moreover, a 3D face is targeted more at access control systems, rather than surveillance applications, due to the limited optimal distance range be‐ tween the scanned subject and the sensor.

#### **2.1. Acquisition of 3D data**

Most facial 3D scanners use structured light in order to obtain the three-dimensional shape of a face. Structured light scanners project certain light pattern onto the object's surface, which is simultaneously captured by a camera from a different angle. The exact surface is then computed from the projected light pattern distortion, caused by the surface shape. The most commonly structured light pattern in 3D scanning devices consists of many narrow stripes lying side by side. Other methods, either using a different pattern, or one without the structured light, can be also used [4], however, they are not common in biometric systems.

The pattern can be projected using visible or infra-red light spectrum. An advantage to the infrared light is its non-disturbing effect on the user's eyes. On the other hand, it is more difficult to segment the image and distinguish between the neighboring stripes properly. Therefore, many methods of acquiring the 3D surface use a visible light and color camera. The description of the method, where many color stripes are used, is given in [5]. The au‐ thors use the De Bruijn sequence there (see Figure 2), which consists of seven colors, in order to minimize the misclassification between the projected lines and the lines in the image cap‐ tured by the camera.

**Figure 2.** The De Bruijn color sequence [5].

The algorithm for surface reconstruction is composed of several steps. In the first step, two images are taken. In the first one (*IM pattern*), the object is illuminated by structured light, whereas in the second one (*IM clean*) unstructured light is used. Next, the projected light is extracted from the background by subtracting the two images:

$$\text{IM}\_{\text{extracted}} = \text{IM}\_{\text{vartheta}\\ \text{return}} \text{-- } \text{IM}\_{\text{clean}} \tag{1}$$

distortion pattern is then transferred to the 3D image, using Artec software. The advantage of the scanner is its ability to merge several models (pattern images) belonging to the same object. When models are taken from different angles, the overall surface model is signifi‐ cantly accurate, and possible gaps in the surface are minimized. On the other hand, the sur‐ face of facial hair, or shiny materials, such as glasses, is hard to reconstruct because of a

3D and Thermo-Face Fusion http://dx.doi.org/10.5772/3420 35

A key part of every biometric system is the preprocessing of input data. In the 3D face field, this task involves primarily the alignment of the face into a predefined position. In this sec‐ tion, several possible approaches of the face alignment will be described. In order to fulfill such a task, the important landmarks are located first. Detecting the facial landmarks from three-dimensional data cannot be performed using the same algorithms as in the case of two-dimensional data. It is mainly because two-dimensional landmark detection is based on analyzing color space of the input face picture, which is not usually present in raw threedimensional data. However, if the texture data is available, the following landmark detec‐

The location of the tip the nose is a fundamental part of preprocessing in many three-dimen‐ sional facial recognition methods [7][8][9][15]. Segundo et al. [10] proposed an algorithm for nose tip localization, consisting of two stages. First, the *y*-coordinate is found, then an ap‐ propriate *x*-coordinate is assigned. To find the *y*-coordinate, two vertical *y*-projections of the face are computed – the profile and median curves. The profile curve is determined by the maximum depth value in each row, while the median curve is defined by the median depth value of every set of points with the same *y*-coordinate. A curve that represents the difference between the profile and median curves is created. A maximum of this difference curve along the *y*-axis is the *y*-coordinate of the nose. The *x*-coordinate of the nose tip is located as follows: along the horizontal line, that intersects the *y*-coordinate of the nose, the density of peak points is calculated; the point with the highest peak density is the final loca‐

In order to classify the points on the surface as peaks, curvature analysis is performed. The curvature at that specific point denotes how much the surface diverges from being flat. The sign of the curvature *k* indicates the direction in which the unit tangent vector rotates as a function of the parameter along the curve. If the unit tangent rotates counterclockwise, then

To depart from 2D curve to 3D surface, two principal (always mutually orthogonal) curva‐ tures *k*1 and *k*2 are calculated at each point. Using these principal curvatures, two important

*K* =*k*1*k*<sup>2</sup> (2)

measures are deduced: Gaussian curvature *K* and mean curvature *H* [11]:

highly distorted light pattern (see Figure 3).

tion methods, based on the pure 3D model, may be skipped.

**2.2. 3D face preprocessing**

tion of the nose tip (see Figure 4).

*k* >0. Otherwise, *k* <0.

The pattern in *IM extracted* is matched with the original pattern image. In the last step, the depth information of the points lying on the surface is calculated by the trigonometry princi‐ ple. In order to calculate the exact depths properly, the precise position of the camera and projector, including their orientation, need to be known. It can be measured, or calculated by the calibration of both devices.

An example of a 3D scanner (commercial solution) is the Minolta Vivid Laser 3D scanner. The light reflected by the object is acquired by the CCD camera. Then, the final model is cal‐ culated, using the standard triangulation method. For instance, the scanner was used to col‐ lect models from the FRGC database [20].

**Figure 3.** The examples of an acquired 3D face models, using Artec 3D scanner.

Another example is the Artec 3D scanner [6] which has a flash bulb and camera. The bulb flashes a light pattern onto an object, and the CCD camera records the created image. The distortion pattern is then transferred to the 3D image, using Artec software. The advantage of the scanner is its ability to merge several models (pattern images) belonging to the same object. When models are taken from different angles, the overall surface model is signifi‐ cantly accurate, and possible gaps in the surface are minimized. On the other hand, the sur‐ face of facial hair, or shiny materials, such as glasses, is hard to reconstruct because of a highly distorted light pattern (see Figure 3).

#### **2.2. 3D face preprocessing**

difficult to segment the image and distinguish between the neighboring stripes properly. Therefore, many methods of acquiring the 3D surface use a visible light and color camera. The description of the method, where many color stripes are used, is given in [5]. The au‐ thors use the De Bruijn sequence there (see Figure 2), which consists of seven colors, in order to minimize the misclassification between the projected lines and the lines in the image cap‐

The algorithm for surface reconstruction is composed of several steps. In the first step, two images are taken. In the first one (*IM pattern*), the object is illuminated by structured light, whereas in the second one (*IM clean*) unstructured light is used. Next, the projected light is

The pattern in *IM extracted* is matched with the original pattern image. In the last step, the depth information of the points lying on the surface is calculated by the trigonometry princi‐ ple. In order to calculate the exact depths properly, the precise position of the camera and projector, including their orientation, need to be known. It can be measured, or calculated by

An example of a 3D scanner (commercial solution) is the Minolta Vivid Laser 3D scanner. The light reflected by the object is acquired by the CCD camera. Then, the final model is cal‐ culated, using the standard triangulation method. For instance, the scanner was used to col‐

Another example is the Artec 3D scanner [6] which has a flash bulb and camera. The bulb flashes a light pattern onto an object, and the CCD camera records the created image. The

*IM extracted* = *IM pattern* - *IM clean* (1)

extracted from the background by subtracting the two images:

**Figure 3.** The examples of an acquired 3D face models, using Artec 3D scanner.

tured by the camera.

34 New Trends and Developments in Biometrics

**Figure 2.** The De Bruijn color sequence [5].

the calibration of both devices.

lect models from the FRGC database [20].

A key part of every biometric system is the preprocessing of input data. In the 3D face field, this task involves primarily the alignment of the face into a predefined position. In this sec‐ tion, several possible approaches of the face alignment will be described. In order to fulfill such a task, the important landmarks are located first. Detecting the facial landmarks from three-dimensional data cannot be performed using the same algorithms as in the case of two-dimensional data. It is mainly because two-dimensional landmark detection is based on analyzing color space of the input face picture, which is not usually present in raw threedimensional data. However, if the texture data is available, the following landmark detec‐ tion methods, based on the pure 3D model, may be skipped.

The location of the tip the nose is a fundamental part of preprocessing in many three-dimen‐ sional facial recognition methods [7][8][9][15]. Segundo et al. [10] proposed an algorithm for nose tip localization, consisting of two stages. First, the *y*-coordinate is found, then an ap‐ propriate *x*-coordinate is assigned. To find the *y*-coordinate, two vertical *y*-projections of the face are computed – the profile and median curves. The profile curve is determined by the maximum depth value in each row, while the median curve is defined by the median depth value of every set of points with the same *y*-coordinate. A curve that represents the difference between the profile and median curves is created. A maximum of this difference curve along the *y*-axis is the *y*-coordinate of the nose. The *x*-coordinate of the nose tip is located as follows: along the horizontal line, that intersects the *y*-coordinate of the nose, the density of peak points is calculated; the point with the highest peak density is the final loca‐ tion of the nose tip (see Figure 4).

In order to classify the points on the surface as peaks, curvature analysis is performed. The curvature at that specific point denotes how much the surface diverges from being flat. The sign of the curvature *k* indicates the direction in which the unit tangent vector rotates as a function of the parameter along the curve. If the unit tangent rotates counterclockwise, then *k* >0. Otherwise, *k* <0.

To depart from 2D curve to 3D surface, two principal (always mutually orthogonal) curva‐ tures *k*1 and *k*2 are calculated at each point. Using these principal curvatures, two important measures are deduced: Gaussian curvature *K* and mean curvature *H* [11]:

$$K = k\_1 k\_2 \tag{2}$$

$$H = \left(k\_1 + k\_2\right) / 2 \tag{3}$$

*Principal component analysis* (PCA) was first introduced by Karl Pearson and covers mathe‐ matical methods, which reduce the number of dimensions of a given multi-dimensional space. The dimensionality reduction is based on data distribution. The first principal compo‐ nent is the best way to describe the data in a minimum-squared-error sense. Other compo‐

3D and Thermo-Face Fusion http://dx.doi.org/10.5772/3420 37

The *eigenface* method is an example of PCA application. It is a holistic face recognition meth‐ od, which takes grayscale photographs of people, normalized with respect to size and reso‐ lution. The images are then interpreted as vectors. The method was introduced by M. Turk

*Linear discriminant analysis* (LDA), introduced by Ronald Aylmer Fisher, is an example of su‐ pervised learning. Class membership (data subject identity) is taken into account during learning. LDA seeks for vectors that provide the best discrimination between classes after

The *Fisherface* method is a combination of principal component analysis and linear discrimi‐ nant analysis. PCA is used to compute the face subspace in which the variance is maxi‐ mized, while LDA takes advantage of intra-class information. The method was introduced

Another data projection method is *independent component analysis* (ICA). Contrary to PCA, which seeks for dimensions where data varies the most, ICA looks for the transformation of input data that maximizes non-gaussianity. A frequently used algorithm that computes in‐

The adaptation of projection methods for a 3D face is usually based on the transformation of input 3D scans into range-images [15]. Each vertex of a 3D model is projected to a plane, where the brightness of pixels corresponds to specific values of *z*-coordinates in the input scan. An example of an input range image, and its decomposition in PCA subspace, consist‐ ing of 5 eigenvectors, is in Figure 5. Projection coefficients form the resulting feature vector,

The face recognition method proposed by Pan et al. [9] maps the face surface into a planar circle. At first, the nose tip is located and a region of interest (ROI) is chosen. The ROI is the sphere centered at the nose tip. After that, the face surface within the ROI is selected and mapped on the planar circle. The error function *E* that measures the distortion between the original surface and plane is used. The transformation to the planar circle is performed so that *E* is minimal. Heseltine [15] shows that the application of certain image processing tech‐

So far, the methods that have emerged as an extension of the classical 2D face recognition were mentioned. In this section, an overview of some purely 3D face recognition methods is

niques to the range image has a positive impact on recognition performance.

nents describe as much of the remaining variability as possible.

dependent components is the FastICA algorithm [14].

**Using projection methods for a 3D face**

*2.3.2. Recognition methods specific to 3D face*

and A. Pentland in 1991 [12].

by Belhumeur et al. [13].

the projection.

directly.

provided.

Classification of the surface points based on signs of Gaussian and mean curvatures is pre‐ sented in Table 2.


**Table 2.** Classification of points on 3D surface, based on signs of Gaussian (*K*) and mean curvatures (*H*).

**Figure 4.** The vertical profile curve that is used to determine the *y*-coordinate of the nose. Once the *y*-coordinate is located, another horizontal curve is created. Along the new curve, the density of peak points is calculated and exact position of the nose tip is located.

#### **2.3. Overview of methods**

#### *2.3.1. Adaptation of 2D face recognition methods*

The majority of widespread face recognition methods are holistic projection methods. These methods take the input image consisting of *r*rows and *c* columns, and transform it to a col‐ umn vector. Pixel intensities of the input image directly represent values of individual com‐ ponents in the resulting vector. Rows of the image are concatenated into one single column.

#### **Projection methods**

A common attribute of projection methods is the creation of the data distribution model, and a projection matrix, that transforms input vector *v* ∈R*<sup>r</sup>* <sup>⋅</sup>*<sup>c</sup>* into some lower dimensional space. In this section, the following methods will be described:


*Principal component analysis* (PCA) was first introduced by Karl Pearson and covers mathe‐ matical methods, which reduce the number of dimensions of a given multi-dimensional space. The dimensionality reduction is based on data distribution. The first principal compo‐ nent is the best way to describe the data in a minimum-squared-error sense. Other compo‐ nents describe as much of the remaining variability as possible.

The *eigenface* method is an example of PCA application. It is a holistic face recognition meth‐ od, which takes grayscale photographs of people, normalized with respect to size and reso‐ lution. The images are then interpreted as vectors. The method was introduced by M. Turk and A. Pentland in 1991 [12].

*Linear discriminant analysis* (LDA), introduced by Ronald Aylmer Fisher, is an example of su‐ pervised learning. Class membership (data subject identity) is taken into account during learning. LDA seeks for vectors that provide the best discrimination between classes after the projection.

The *Fisherface* method is a combination of principal component analysis and linear discrimi‐ nant analysis. PCA is used to compute the face subspace in which the variance is maxi‐ mized, while LDA takes advantage of intra-class information. The method was introduced by Belhumeur et al. [13].

Another data projection method is *independent component analysis* (ICA). Contrary to PCA, which seeks for dimensions where data varies the most, ICA looks for the transformation of input data that maximizes non-gaussianity. A frequently used algorithm that computes in‐ dependent components is the FastICA algorithm [14].

## **Using projection methods for a 3D face**

*H* =(*k*<sup>1</sup> + *k*2)/ 2 (3)

Classification of the surface points based on signs of Gaussian and mean curvatures is pre‐

*K* < 0 *K* = 0 *K* > 0

**Figure 4.** The vertical profile curve that is used to determine the *y*-coordinate of the nose. Once the *y*-coordinate is located, another horizontal curve is created. Along the new curve, the density of peak points is calculated and exact

The majority of widespread face recognition methods are holistic projection methods. These methods take the input image consisting of *r*rows and *c* columns, and transform it to a col‐ umn vector. Pixel intensities of the input image directly represent values of individual com‐ ponents in the resulting vector. Rows of the image are concatenated into one single column.

A common attribute of projection methods is the creation of the data distribution model,

into some lower dimensional

*H* < 0 saddle ridge ridge peak

*H* = 0 minimal flat (none)

*H* > 0 saddle valley valley Pit

**Table 2.** Classification of points on 3D surface, based on signs of Gaussian (*K*) and mean curvatures (*H*).

sented in Table 2.

36 New Trends and Developments in Biometrics

position of the nose tip is located.

**2.3. Overview of methods**

**Projection methods**

*2.3.1. Adaptation of 2D face recognition methods*

**•** Principal component analysis (PCA)

**•** Linear discriminant analysis (LDA)

**•** Independent component analysis (ICA)

and a projection matrix, that transforms input vector *v* ∈R*<sup>r</sup>* <sup>⋅</sup>*<sup>c</sup>*

space. In this section, the following methods will be described:

The adaptation of projection methods for a 3D face is usually based on the transformation of input 3D scans into range-images [15]. Each vertex of a 3D model is projected to a plane, where the brightness of pixels corresponds to specific values of *z*-coordinates in the input scan. An example of an input range image, and its decomposition in PCA subspace, consist‐ ing of 5 eigenvectors, is in Figure 5. Projection coefficients form the resulting feature vector, directly.

The face recognition method proposed by Pan et al. [9] maps the face surface into a planar circle. At first, the nose tip is located and a region of interest (ROI) is chosen. The ROI is the sphere centered at the nose tip. After that, the face surface within the ROI is selected and mapped on the planar circle. The error function *E* that measures the distortion between the original surface and plane is used. The transformation to the planar circle is performed so that *E* is minimal. Heseltine [15] shows that the application of certain image processing tech‐ niques to the range image has a positive impact on recognition performance.

#### *2.3.2. Recognition methods specific to 3D face*

So far, the methods that have emerged as an extension of the classical 2D face recognition were mentioned. In this section, an overview of some purely 3D face recognition methods is provided.

range image, and a graphical representation of the corresponding feature vector, is shown in

3D and Thermo-Face Fusion http://dx.doi.org/10.5772/3420 39

**Figure 6.** An input range image and its corresponding histogram template, using 9 stripes and 5 bins in each stripe.

In recent years, a family of the 3D face recognition methods, which is based on the compari‐ son of facial curves, has emerged. In these methods, the nose tip is located first. After that, a

(a) (b)

In [18], recognition based on iso-depth and iso-geodetic curves is proposed. The iso-depth curve is extracted from the intersection between the face surface and the parallel plane, per‐ pendicular to the *z*-axis (see Figure 7(a)). The iso-geodesic curve is a set of all points on the surface that have the same geodesic distance from a given point (see Figure 7(b)). The geo‐ desic distance between two points on the surface is a generalization of the term distance on

There is one very important attribute to a the iso-geodesic curve. Contrary to the iso-depth curves, from a given point, iso-geodesic curves are invariant to translation and rotation. This means that no pose normalization of the face is needed, in order to deploy a face recognition algorithm strictly based on iso-geodesic curves. However, precise localization of the nose-tip

There are several shape descriptors used for feature extraction in [18]. A set of 5 simple shape descriptors (convexity, ratio of principal axes, compactness, circular variance, and el‐

set of closed curves around the nose is created, and the features are extracted.

**Figure 7.** Iso-depth (a) and iso-geodesic (b) curves on the face surface [15].

is still a crucial part of the recognition pipeline.

Figure 6.

**Recognition based on facial curves**

a curved surface.

**Figure 5.** An input range image and its decomposition in PCA subspace consisting of 5 eigenvectors.

#### **Direct comparison using the hybrid ICP algorithm**

Lu et. al [7] proposed a method that compares a face scan to a 3D model stored in a data‐ base. The method consists of three stages. At first, landmarks are located. Lu uses the nose tip, the inside of one eye and the outside of the same eye. Localization is based on curvature analysis of the scanned face. These three points, obtained in the previous step, are used for coarse alignment with the 3D model, stored in the database. A rigid transformation of the three pairs of corresponding points is performed in the second step.

A fine registration process, the final step, uses the Iterative Closest Point (ICP) algorithm. The root mean square distance minimized by the ICP algorithm, is used as the comparison score.

#### **Recognition using histogram-based features**

The algorithm introduced by Zhou et al. [16] is able to deal with small variations caused by facial expressions, noisy data, and spikes on three-dimensional scans. After the localization of the nose, the face is aligned, such that the nose tip is situated in the origin of coordinates and the surface is converted to a range image. Afterwards, a rectangle area around the nose is selected (*region of interest*, ROI). The rectangle is divided into *N* equal stripes. Each stripe *n* contains *Sn* points. Maximal *Zn*,*max* and minimal *Zn*,*minz*-coordinates within each stripe are calculated and the *z*-coordinate space is divided into *K* equal width bins. With the use of the *K* bins, a histogram of *z*-coordinates of points forming the scan, is calculated in each stripe. This yields to a feature vector consisting of *N* ⋅ *K* components. An example of an input range image, and a graphical representation of the corresponding feature vector, is shown in Figure 6.

**Figure 6.** An input range image and its corresponding histogram template, using 9 stripes and 5 bins in each stripe.

#### **Recognition based on facial curves**

**Figure 5.** An input range image and its decomposition in PCA subspace consisting of 5 eigenvectors.

three pairs of corresponding points is performed in the second step.

Lu et. al [7] proposed a method that compares a face scan to a 3D model stored in a data‐ base. The method consists of three stages. At first, landmarks are located. Lu uses the nose tip, the inside of one eye and the outside of the same eye. Localization is based on curvature analysis of the scanned face. These three points, obtained in the previous step, are used for coarse alignment with the 3D model, stored in the database. A rigid transformation of the

A fine registration process, the final step, uses the Iterative Closest Point (ICP) algorithm. The root mean square distance minimized by the ICP algorithm, is used as the comparison

The algorithm introduced by Zhou et al. [16] is able to deal with small variations caused by facial expressions, noisy data, and spikes on three-dimensional scans. After the localization of the nose, the face is aligned, such that the nose tip is situated in the origin of coordinates and the surface is converted to a range image. Afterwards, a rectangle area around the nose is selected (*region of interest*, ROI). The rectangle is divided into *N* equal stripes. Each stripe *n* contains *Sn* points. Maximal *Zn*,*max* and minimal *Zn*,*minz*-coordinates within each stripe are calculated and the *z*-coordinate space is divided into *K* equal width bins. With the use of the *K* bins, a histogram of *z*-coordinates of points forming the scan, is calculated in each stripe. This yields to a feature vector consisting of *N* ⋅ *K* components. An example of an input

**Direct comparison using the hybrid ICP algorithm**

38 New Trends and Developments in Biometrics

**Recognition using histogram-based features**

score.

In recent years, a family of the 3D face recognition methods, which is based on the compari‐ son of facial curves, has emerged. In these methods, the nose tip is located first. After that, a set of closed curves around the nose is created, and the features are extracted.

**Figure 7.** Iso-depth (a) and iso-geodesic (b) curves on the face surface [15].

In [18], recognition based on iso-depth and iso-geodetic curves is proposed. The iso-depth curve is extracted from the intersection between the face surface and the parallel plane, per‐ pendicular to the *z*-axis (see Figure 7(a)). The iso-geodesic curve is a set of all points on the surface that have the same geodesic distance from a given point (see Figure 7(b)). The geo‐ desic distance between two points on the surface is a generalization of the term distance on a curved surface.

There is one very important attribute to a the iso-geodesic curve. Contrary to the iso-depth curves, from a given point, iso-geodesic curves are invariant to translation and rotation. This means that no pose normalization of the face is needed, in order to deploy a face recognition algorithm strictly based on iso-geodesic curves. However, precise localization of the nose-tip is still a crucial part of the recognition pipeline.

There are several shape descriptors used for feature extraction in [18]. A set of 5 simple shape descriptors (convexity, ratio of principal axes, compactness, circular variance, and el‐ liptical variance) is provided. Moreover, the Euclidian distance between the curve center and points on the curve is sampled for 120 points on the surface and projected using LDA in order to reduce dimensionality of the feature vector. Three curves are extracted for each face.

The 3D face recognition algorithm proposed in [19] uses iso-geodetic stripes and the surface data are encoded in the form of a graph. The nodes of the graph are the extracted stripes and the directed edges are labeled with *3D Weighted Walkthroughs*. The walkthrough from point *a* =(*xa*, *ya*) to *b* =(*xb*, *yb*) is illustrated in Figure 8. It is a pair *i*, *j* that describes the sign of mutual positions projected on both axes. For example, if *xa* < *xb* ∧ *ya* > *yb* holds, then *i*, *j* = 1, - 1 . For more information about the generalization of walkthroughs from points to a set of points and to 3D space, see [19].

**Figure 9.** Detected facial landmarks (marked with white circles) and four facial curves: vertical *profile curve*, horizontal *eye curve*, horizontal *nose, or middle, curve* (that intersects the tip of the nose and horizontal curve lying directly be‐

Category Description Number of

input scan and the corresponding curve from the mean (average) face

Basic Distances between selected landmarks 7

Profile curve Utilization of several distances between the profile curve, extracted from

Eyes curve Distances between the eyes curve from an input scan and its corresponding

Nose curve Distances between the nose curve from an input scan and the corresponding

corresponding middle curve from the average face model

Curvatures Horizontal and vertical curvatures on selected facial landmarks 6

Distances between the 1st derivation of facial curves and corresponding curves

Distances between the 2nd derivation of facial curves and the corresponding

A fundamental part of recognition, based on anatomical features, is the selection of feature vector components. This subset selection boosts components, with good discriminative abili‐ ty, and decreases the influence of features with low discriminative ability. There are several

eyes curve from the average face model

nose curve from the average face model

from the average face model

**Table 3.** Categories of anatomical 3D face features.

possibilities on how to fulfill this task:

curves from the average face model

Middle curve Distances between the middle curve from an input scan and the

features

3D and Thermo-Face Fusion http://dx.doi.org/10.5772/3420 41

4

4

4

4

16

16

Σ 61

tween the eye and nose curves).

1st derivation of

2nd derivation of

curves

curves

**Figure 8.** The walktrough *i*, *j* = 1, - 1 from *a*to*b*.

#### **Recognition based on the anatomical features**

Detection of important landmarks, described in section 2.2, may be extended to other points and curves on the face. The mutual positions of the detected points, distances be‐ tween curves, their mutual correlations, and curvatures at specific points may be extract‐ ed. These numerical values directly form a feature vector, thus the distance between two faces may be instantly compared using an arbitrary distance function between these two feature vectors.

In [17], 8 facial landmarks and 4 curves were extracted from each input scan (see Figure 9) and form over sixty features. These features may be divided into four categories (see Table 3).

**Figure 9.** Detected facial landmarks (marked with white circles) and four facial curves: vertical *profile curve*, horizontal *eye curve*, horizontal *nose, or middle, curve* (that intersects the tip of the nose and horizontal curve lying directly be‐ tween the eye and nose curves).


**Table 3.** Categories of anatomical 3D face features.

liptical variance) is provided. Moreover, the Euclidian distance between the curve center and points on the curve is sampled for 120 points on the surface and projected using LDA in order to reduce dimensionality of the feature vector. Three curves are extracted for each

The 3D face recognition algorithm proposed in [19] uses iso-geodetic stripes and the surface data are encoded in the form of a graph. The nodes of the graph are the extracted stripes and the directed edges are labeled with *3D Weighted Walkthroughs*. The walkthrough from point *a* =(*xa*, *ya*) to *b* =(*xb*, *yb*) is illustrated in Figure 8. It is a pair *i*, *j* that describes the sign of mutual positions projected on both axes. For example, if *xa* < *xb* ∧ *ya* > *yb* holds, then *i*, *j* = 1, - 1 . For more information about the generalization of walkthroughs from points

Detection of important landmarks, described in section 2.2, may be extended to other points and curves on the face. The mutual positions of the detected points, distances be‐ tween curves, their mutual correlations, and curvatures at specific points may be extract‐ ed. These numerical values directly form a feature vector, thus the distance between two faces may be instantly compared using an arbitrary distance function between these two

In [17], 8 facial landmarks and 4 curves were extracted from each input scan (see Figure 9) and form over sixty features. These features may be divided into four categories (see Table 3).

face.

to a set of points and to 3D space, see [19].

40 New Trends and Developments in Biometrics

**Figure 8.** The walktrough *i*, *j* = 1, - 1 from *a*to*b*.

feature vectors.

**Recognition based on the anatomical features**

A fundamental part of recognition, based on anatomical features, is the selection of feature vector components. This subset selection boosts components, with good discriminative abili‐ ty, and decreases the influence of features with low discriminative ability. There are several possibilities on how to fulfill this task:

**•** Linear discriminant analysis. The input feature space consisting of 61 dimensions is line‐ arly projected to a subspace, with fewer dimensions, such that the intra-class variability is reduced, and inter-class variability is maximized.

The upcoming Face Recognition Vendor Test 2012 continues a series of evaluations for face recognition systems. The primary goal of the FRVT 2012 is to measure the advancement in the capabilities of prototype systems and algorithms from commercial and academic com‐

3D and Thermo-Face Fusion http://dx.doi.org/10.5772/3420 43

Face recognition based on thermal images has minor importance in comparison to visible light spectrum recognition. Nevertheless, in applications such as the liveness detection or the fever scan, thermal face recognition is used as a standalone module, or as a part of a

Thermal images are remarkably invariant to light conditions. On the other hand, intra-class variability is very high. There are a lot of aspects that contribute to this negative property, such as different head poses, face expressions, changes of hair or facial hair, the environ‐

Every object, whose temperature is not absolute zero, emits the so called "thermal radia‐ tion". Most of the thermal radiation is emitted in the range of 3 to 14 µm, not visible to the human eye. The radiation itself consists primarily of self-emitted radiation from vibrational and rotational quantum energy level transitions in molecules, and, secondarily, from reflec‐ tion of radiation from other sources [26]. The intensity and the wavelength of the emitted energy from an object are influenced by its temperature. If the object is colder than 50°C, which is the case of temperatures of a human being, then its radiation lies completely in the

The radiation properties of objects are usually described in relation to a perfect blackbody (the perfect emitter) [27]. The coefficient value lies between 0 and 1, where 0 means none and 1 means perfect emissivity. For instance, the emissivity of human skin is 0.92. The re‐ flected radiations from an object are supposed to be much smaller than the emitted ones,

Atmosphere is present between an object and a thermal camera, which influences the radia‐ tion due to absorption by gases and particles. The amount of attenuation depends heavily on the light wavelength. The atmosphere usually transmits visible light very well, however, fog, clouds, rain, and snow can distort the camera from seeing distant objects. The same

The so-called atmospheric windows (only little attenuation), which lie between 2 and 5 µm (the mid-wave window), and 7.5–13.5 µm (the long-wave window) have to be used for ther‐ mo-graphic measurement. Atmospheric attenuation prevents an object's total radiation from

ment temperature, current health conditions, and even emotions.

munities.

**3. Thermal face recognition**

multi-modal biometric system.

**3.1. Thermal-face acquisition**

*3.1.1. Temperature measurements*

therefore they are neglected during imaging.

principle applies to infrared radiation.

IR spectrum.

**•** Subset selection and weighting. For the selection and weighting based on the discrimina‐ tive potential, see section 4.2.

### *2.3.3. State-of-the-art*

The developed face recognition system should be compared with other current face recogni‐ tion systems available on the market. In 2006, the National Institute of Standards and Tech‐ nology in USA found the Face Recognition Vendor Test (FRVT) [20]. It has been the latest, thus far, in a series of large scale independent evaluations. Previous evaluations in the series were the FERET, FRVT 2000, and FRVT 2002. The primary goal of the FRVT 2006 was to measure progress of prototype systems/algorithms and commercial face recognition systems since FRVT 2002. FRVT 2006 evaluated performance on high resolution still images (5 to 6 mega-pixels) and 3D facial scans.

A comprehensive report of achieved results, and used evaluation methodology is described in [21]. The progress that was achieved during the last years is depicted in Figure 10. Results show achieved false rejection rate, at a false acceptance rate of 0.001, for the best face recog‐ nition algorithms. This means that, if we admit that 0.1% of impostors are falsely accepted as genuine persons, only 1% of genuine users are incorrectly rejected. The best 3D face recogni‐ tion algorithm that has been evaluated in FRVT 2006 was Viisage, from the commercial por‐ tion of participating organizations [21].

**Figure 10.** Reduction in error rate for state-of-the-art face recognition algorithms as documented through FERET, FRVT 2002, and FRVT 2006 evaluations.

The upcoming Face Recognition Vendor Test 2012 continues a series of evaluations for face recognition systems. The primary goal of the FRVT 2012 is to measure the advancement in the capabilities of prototype systems and algorithms from commercial and academic com‐ munities.

## **3. Thermal face recognition**

**•** Linear discriminant analysis. The input feature space consisting of 61 dimensions is line‐ arly projected to a subspace, with fewer dimensions, such that the intra-class variability is

**•** Subset selection and weighting. For the selection and weighting based on the discrimina‐

The developed face recognition system should be compared with other current face recogni‐ tion systems available on the market. In 2006, the National Institute of Standards and Tech‐ nology in USA found the Face Recognition Vendor Test (FRVT) [20]. It has been the latest, thus far, in a series of large scale independent evaluations. Previous evaluations in the series were the FERET, FRVT 2000, and FRVT 2002. The primary goal of the FRVT 2006 was to measure progress of prototype systems/algorithms and commercial face recognition systems since FRVT 2002. FRVT 2006 evaluated performance on high resolution still images (5 to 6

A comprehensive report of achieved results, and used evaluation methodology is described in [21]. The progress that was achieved during the last years is depicted in Figure 10. Results show achieved false rejection rate, at a false acceptance rate of 0.001, for the best face recog‐ nition algorithms. This means that, if we admit that 0.1% of impostors are falsely accepted as genuine persons, only 1% of genuine users are incorrectly rejected. The best 3D face recogni‐ tion algorithm that has been evaluated in FRVT 2006 was Viisage, from the commercial por‐

**Figure 10.** Reduction in error rate for state-of-the-art face recognition algorithms as documented through FERET,

reduced, and inter-class variability is maximized.

tive potential, see section 4.2.

42 New Trends and Developments in Biometrics

mega-pixels) and 3D facial scans.

tion of participating organizations [21].

FRVT 2002, and FRVT 2006 evaluations.

*2.3.3. State-of-the-art*

Face recognition based on thermal images has minor importance in comparison to visible light spectrum recognition. Nevertheless, in applications such as the liveness detection or the fever scan, thermal face recognition is used as a standalone module, or as a part of a multi-modal biometric system.

Thermal images are remarkably invariant to light conditions. On the other hand, intra-class variability is very high. There are a lot of aspects that contribute to this negative property, such as different head poses, face expressions, changes of hair or facial hair, the environ‐ ment temperature, current health conditions, and even emotions.

## **3.1. Thermal-face acquisition**

Every object, whose temperature is not absolute zero, emits the so called "thermal radia‐ tion". Most of the thermal radiation is emitted in the range of 3 to 14 µm, not visible to the human eye. The radiation itself consists primarily of self-emitted radiation from vibrational and rotational quantum energy level transitions in molecules, and, secondarily, from reflec‐ tion of radiation from other sources [26]. The intensity and the wavelength of the emitted energy from an object are influenced by its temperature. If the object is colder than 50°C, which is the case of temperatures of a human being, then its radiation lies completely in the IR spectrum.

## *3.1.1. Temperature measurements*

The radiation properties of objects are usually described in relation to a perfect blackbody (the perfect emitter) [27]. The coefficient value lies between 0 and 1, where 0 means none and 1 means perfect emissivity. For instance, the emissivity of human skin is 0.92. The re‐ flected radiations from an object are supposed to be much smaller than the emitted ones, therefore they are neglected during imaging.

Atmosphere is present between an object and a thermal camera, which influences the radia‐ tion due to absorption by gases and particles. The amount of attenuation depends heavily on the light wavelength. The atmosphere usually transmits visible light very well, however, fog, clouds, rain, and snow can distort the camera from seeing distant objects. The same principle applies to infrared radiation.

The so-called atmospheric windows (only little attenuation), which lie between 2 and 5 µm (the mid-wave window), and 7.5–13.5 µm (the long-wave window) have to be used for ther‐ mo-graphic measurement. Atmospheric attenuation prevents an object's total radiation from reaching the camera. The correction of the attenuation has to be done in order to get the true temperature, otherwise it will be dropping, with increasing distance.

The next step of detection is the localization of important facial landmarks such as the eyes, nose, mouth and brows. The Violla-Jones detector can be used to detect some of them. Frie‐ drich and Yeshurn propose eye brow detection by analysis of local maxima in a vertical his‐

3D and Thermo-Face Fusion http://dx.doi.org/10.5772/3420 45

**Figure 12.** Eye brow detection by [33]: Original image (left), edges enhancement, binarized image and its vertical his‐

If a comparison algorithm is performed on raw thermal face images, without any process‐ ing, we would get unacceptable results. This is becuase the thermal faces belong to biomet‐ rics with high intra-class variability. This makes thermal face recognition one of the most challenging methods, in terms of intra-class variability reducing, while keeping or increas‐ ing the inter-class variability of sensory data. The normalization phase of the recognition process tries to deal with all these aspects and decrease the intra-class variability as much as

**Figure 13.** Thermal images of 3 different people. Output of all normalization methods is demonstrated by processing

Proposed normalization consists of a pose, intensity and region of stability normalization. All normalization methods are described in the following sections and their output is visual‐

Biometric systems based on face recognition do not strictly demand the head to be posi‐ tioned in front of the camera. The task of pose (geometric) normalization is to transform the captured face to a default position (front view without any rotation). The fulfillment of this task is one of the biggest challenges for 2D face recognition technologies. It is obvious that a perfect solution cannot be achieved by 2D technology, however, the variance caused by dif‐

togram (see Figure 12) [33].

togram - sum of intensities in each row (right).

possible, while preserving the inter-class variability.

ized on the sample thermal images in Figure 13.

ferent positions should be minimized as much as possible.

**3.3. Head normalization**

these raw images.

*3.3.1. Pose normalization*

## *3.1.2. Thermal detectors*

Majority of IR cameras have a microbolometer type detector, mainly because of cost consid‐ erations. They respond to radiant energy in a way that causes a change of state in the bulk material (i.e., the bolometer effect) [27]. Generally, microbolometers do not require cooling, which allows for compact camera designs (see Figure 11) to be relatively low cost. Apart from lower sensitivity to radiation, another substantial disadvantage of such cameras is their relatively slow reaction time, with a delay of dozens of milliseconds. Nevertheless, such pa‐ rameters are sufficient for biometric purposes.

**Figure 11.** FLIR ThermaCAM EX300[28].

For more demanding applications, quantum detectors can be used. They operate based on an intrinsic photoelectric effect [27]. The detectors can be very sensitive to the infrared radia‐ tion which is focused on them by cooling to cryogenic temperatures. They also react very quickly to changes in IR levels (i.e., temperatures), having a constant response time in the order of 1µs. However, their cost disqualifies their usage in biometric applications in these days.

#### **3.2. Face and facial landmarks detection**

Head detection in a visible spectrum is a very challenging task. There are many aspects making the detection difficult such as a non-homogenous background and various skin col‐ ors. Since the detection is necessary in the recognition process, much effort was invested in dealing with this problem. Nowadays, one of the most commonly used methods is based on the Viola-Jones detector, often combined with additional filtering, required for skin color model.

In contrast to the visible spectrum, detection of the skin on thermal images is easier. The skin temperature varies in a certain range. Moreover, skin temperature remarkably differs from the temperature of the environment. That is why techniques based on background and foreground separation are widely used. The first step of skin detection is usually based on a thresholding. A convenient threshold is in most scenarios computed using Otsu algorithm [34]. Binary images usually need more correction consisting of hole removal and contour smoothing [33]. Another approach detects the skin, using Bayesian segmentation [29].

The next step of detection is the localization of important facial landmarks such as the eyes, nose, mouth and brows. The Violla-Jones detector can be used to detect some of them. Frie‐ drich and Yeshurn propose eye brow detection by analysis of local maxima in a vertical his‐ togram (see Figure 12) [33].

**Figure 12.** Eye brow detection by [33]: Original image (left), edges enhancement, binarized image and its vertical his‐ togram - sum of intensities in each row (right).

#### **3.3. Head normalization**

reaching the camera. The correction of the attenuation has to be done in order to get the true

Majority of IR cameras have a microbolometer type detector, mainly because of cost consid‐ erations. They respond to radiant energy in a way that causes a change of state in the bulk material (i.e., the bolometer effect) [27]. Generally, microbolometers do not require cooling, which allows for compact camera designs (see Figure 11) to be relatively low cost. Apart from lower sensitivity to radiation, another substantial disadvantage of such cameras is their relatively slow reaction time, with a delay of dozens of milliseconds. Nevertheless, such pa‐

For more demanding applications, quantum detectors can be used. They operate based on an intrinsic photoelectric effect [27]. The detectors can be very sensitive to the infrared radia‐ tion which is focused on them by cooling to cryogenic temperatures. They also react very quickly to changes in IR levels (i.e., temperatures), having a constant response time in the order of 1µs. However, their cost disqualifies their usage in biometric applications in these

Head detection in a visible spectrum is a very challenging task. There are many aspects making the detection difficult such as a non-homogenous background and various skin col‐ ors. Since the detection is necessary in the recognition process, much effort was invested in dealing with this problem. Nowadays, one of the most commonly used methods is based on the Viola-Jones detector, often combined with additional filtering, required for skin color

In contrast to the visible spectrum, detection of the skin on thermal images is easier. The skin temperature varies in a certain range. Moreover, skin temperature remarkably differs from the temperature of the environment. That is why techniques based on background and foreground separation are widely used. The first step of skin detection is usually based on a thresholding. A convenient threshold is in most scenarios computed using Otsu algorithm [34]. Binary images usually need more correction consisting of hole removal and contour smoothing [33]. Another approach detects the skin, using Bayesian segmentation [29].

temperature, otherwise it will be dropping, with increasing distance.

*3.1.2. Thermal detectors*

44 New Trends and Developments in Biometrics

rameters are sufficient for biometric purposes.

**Figure 11.** FLIR ThermaCAM EX300[28].

**3.2. Face and facial landmarks detection**

days.

model.

If a comparison algorithm is performed on raw thermal face images, without any process‐ ing, we would get unacceptable results. This is becuase the thermal faces belong to biomet‐ rics with high intra-class variability. This makes thermal face recognition one of the most challenging methods, in terms of intra-class variability reducing, while keeping or increas‐ ing the inter-class variability of sensory data. The normalization phase of the recognition process tries to deal with all these aspects and decrease the intra-class variability as much as possible, while preserving the inter-class variability.

**Figure 13.** Thermal images of 3 different people. Output of all normalization methods is demonstrated by processing these raw images.

Proposed normalization consists of a pose, intensity and region of stability normalization. All normalization methods are described in the following sections and their output is visual‐ ized on the sample thermal images in Figure 13.

#### *3.3.1. Pose normalization*

Biometric systems based on face recognition do not strictly demand the head to be posi‐ tioned in front of the camera. The task of pose (geometric) normalization is to transform the captured face to a default position (front view without any rotation). The fulfillment of this task is one of the biggest challenges for 2D face recognition technologies. It is obvious that a perfect solution cannot be achieved by 2D technology, however, the variance caused by dif‐ ferent positions should be minimized as much as possible.

Geometric normalization often needs information about the position of some important points within the human face. These points are usually image coordinates of the eyes, nose and mouth. If they are located correctly, the image can be aligned to a default template.

This kind of normalization considers the 3D shape of the human face. However, the static (unchangeable) model is the biggest drawback of this method. There are more advanced

3D and Thermo-Face Fusion http://dx.doi.org/10.5772/3420 47

**Figure 15.** Pose normalization methods overview: 2D affine transformation (first row) and the 3D projection method

**Figure 16.** Intensity normalization methods overview: Min-max (first row), Global equalization (send row) and Local

A comparison of thermal images in absolute scale does not usually lead to the best results. The absolute temperature of the human face varies with environmental temperature, physi‐ cal activity and emotional state of the person. Some testing databases even do not even con‐ tain information on how to map pixel intensity to that temperature. Therefore, intensity normalization is necessary. It can be accomplished via global or local histogram equalization

The region of stability normalization takes into account the shape of a face, and a variability of temperature emission within different face parts. The output image of previous normali‐

(second row).

equalization (third row).

*3.3.2. Intensity normalization*

of noticeable facial regions (see Figure 16).

*3.3.3. Region of stability normalization*

zations has a rectangular shape.

techniques that will solve this problem by using the *3D Morphable Face Model* [22].

## **2D affine transformation**

Basic methods of geometric normalization are based on affine transformation, which is usu‐ ally realized by transformation, using a matrix *T* . Each point *p* = *x*, *y* of original image *I* is converted to homogenous coordinates *ph* = *x*, *y*, 1 . All these points are multiplied by the matrix *T* to get the new coordinates *p'.*

Methods of geometric normalization vary with the complexity of transformation matrix computation. The *general affine transformation* maps three different facial landmarks of the original image, *I*, to their expected positions within the default template. Transformation matrix coefficients are computed by solving a set of linear algebraic equations [23].

#### **3D projection**

Human heads have an irregular ellipsoid-like 3D shape, therefore, the 2D-warping method works well, when the head is scaled or rotated in the image plane. In the case of any other transformation, the normalized face is deformed (see Figure 15).

The proposed 3D-projection method works with an average 3D model of the human head. A 3D affine transformation, consisting of translation, rotation and scaling, can be applied to each vertex. The transformed model can be perspectively projected to a 2D plane after‐ wards. This process is well known from 3D computer graphics and visualization.

Model alignment, according to the image, *I*, is required. The goal is to find a transformation of the model whose orientation after the transformation will reveal each important facial landmark. The texture of the input image, *I*, is projected onto the aligned model. Then, the model is transformed (rotated and scaled) to its default position and finally, the texture from the model is re-projected onto the resulting image (see Figure 14).

**Figure 14.** Visualization of the 3D projection method.

This kind of normalization considers the 3D shape of the human face. However, the static (unchangeable) model is the biggest drawback of this method. There are more advanced techniques that will solve this problem by using the *3D Morphable Face Model* [22].

**Figure 15.** Pose normalization methods overview: 2D affine transformation (first row) and the 3D projection method (second row).

**Figure 16.** Intensity normalization methods overview: Min-max (first row), Global equalization (send row) and Local equalization (third row).

#### *3.3.2. Intensity normalization*

Geometric normalization often needs information about the position of some important points within the human face. These points are usually image coordinates of the eyes, nose and mouth. If they are located correctly, the image can be aligned to a default template.

Basic methods of geometric normalization are based on affine transformation, which is usu‐ ally realized by transformation, using a matrix *T* . Each point *p* = *x*, *y* of original image *I* is converted to homogenous coordinates *ph* = *x*, *y*, 1 . All these points are multiplied by the

Methods of geometric normalization vary with the complexity of transformation matrix computation. The *general affine transformation* maps three different facial landmarks of the original image, *I*, to their expected positions within the default template. Transformation

Human heads have an irregular ellipsoid-like 3D shape, therefore, the 2D-warping method works well, when the head is scaled or rotated in the image plane. In the case of any other

The proposed 3D-projection method works with an average 3D model of the human head. A 3D affine transformation, consisting of translation, rotation and scaling, can be applied to each vertex. The transformed model can be perspectively projected to a 2D plane after‐

Model alignment, according to the image, *I*, is required. The goal is to find a transformation of the model whose orientation after the transformation will reveal each important facial landmark. The texture of the input image, *I*, is projected onto the aligned model. Then, the model is transformed (rotated and scaled) to its default position and finally, the texture from

matrix coefficients are computed by solving a set of linear algebraic equations [23].

wards. This process is well known from 3D computer graphics and visualization.

transformation, the normalized face is deformed (see Figure 15).

the model is re-projected onto the resulting image (see Figure 14).

**Figure 14.** Visualization of the 3D projection method.

**2D affine transformation**

46 New Trends and Developments in Biometrics

**3D projection**

matrix *T* to get the new coordinates *p'.*

A comparison of thermal images in absolute scale does not usually lead to the best results. The absolute temperature of the human face varies with environmental temperature, physi‐ cal activity and emotional state of the person. Some testing databases even do not even con‐ tain information on how to map pixel intensity to that temperature. Therefore, intensity normalization is necessary. It can be accomplished via global or local histogram equalization of noticeable facial regions (see Figure 16).

#### *3.3.3. Region of stability normalization*

The region of stability normalization takes into account the shape of a face, and a variability of temperature emission within different face parts. The output image of previous normali‐ zations has a rectangular shape.

recognition [24]. It encodes the neighbors of a pixel, according to relative differences, and calculates a histogram of these codes in small areas. These histograms are then combined with a feature vector. Another comparative study [25] describes other local-matching meth‐ ods such as the Gabor Jets, the SIFT (*Scale Invariant Feature Transform*), the SURF (*Speeded up*

3D and Thermo-Face Fusion http://dx.doi.org/10.5772/3420 49

A different approach extracts the vascular network from thermal images, which is supposed to be unique according to each individual. One of the prominent methods [29] extracts ther‐ mal minutia points (similar to fingerprints minutia) and compares two vascular networks subsequently (see Figure 18). In another approach, it is proposed to use a feature set for the thermo-face representation: the bifurcation points of the thermal pattern and geographical,

*Robust Features*) and the WLD (*Weber Linear*) descriptors.

and gravitational centers of the thermal face [30].

**Figure 18.** A thermal minutia point being extracted from a thinned vascular network[29].

this problem, multi-algorithmic biometric fusion can be used.

images.

**4. Common processing parts for normalized, 3D, and thermal images**

For both thermal and 3D face recognition, it is very difficult to select one method, out of nu‐ merous possibilities, which gives the best performance in each scenario. Choosing the best method is always limited to a certain or specific database (input data). In order to address

In the following sections, a general multi-algorithmic biometric system will be described. Both 3D face recognition and thermal facial recognition require a normalized image. Since many of these characteristics are similar, we do not distinguish between origins of the nor‐ malized image. This section describes generic algorithms of feature extraction, feature pro‐ jection and comparison, which are being evaluated on normalized 3D, as well as thermal

**Figure 17.** Overview of methods used for region of stability and the normalization. The masks (first column) and nor‐ malization responses are displayed for the following methods: Elliptical (top row), Smooth-Elliptical, Weighted-Smooth-Elliptical, Discriminative potential (bottom row).

The main purpose of this normalization is to mark an area where the most important face data are located, in terms of unique characteristics. This normalization is done by multiply‐ ing the original image by some mask (see Figure 17).


#### **3.4. Feature extraction on thermal images**

Several comparative studies of thermal face recognition approaches have been developed in recent years. The first recognition algorithms were appearance-based. These methods deal with the normalized image as a vector of numbers. The projected vector turns into a (low dimensionality) subspace, where the separation between impostors and genuines can be ef‐ ficiently computed with higher accuracy. The commonly used methods are the PCA, LDA and ICA, which are described in more detail in section 4.2.

While appearance-based methods belong to the global-matching methods, there are localmatching methods, which compare only certain parts of an input image to achieve better performance. The LBP (*Local Binary Pattern*) method was primarily developed for texture de‐ scription and recognition. Nevertheless, it was successfully used in visible and thermal face recognition [24]. It encodes the neighbors of a pixel, according to relative differences, and calculates a histogram of these codes in small areas. These histograms are then combined with a feature vector. Another comparative study [25] describes other local-matching meth‐ ods such as the Gabor Jets, the SIFT (*Scale Invariant Feature Transform*), the SURF (*Speeded up Robust Features*) and the WLD (*Weber Linear*) descriptors.

A different approach extracts the vascular network from thermal images, which is supposed to be unique according to each individual. One of the prominent methods [29] extracts ther‐ mal minutia points (similar to fingerprints minutia) and compares two vascular networks subsequently (see Figure 18). In another approach, it is proposed to use a feature set for the thermo-face representation: the bifurcation points of the thermal pattern and geographical, and gravitational centers of the thermal face [30].

**Figure 18.** A thermal minutia point being extracted from a thinned vascular network[29].

**Figure 17.** Overview of methods used for region of stability and the normalization. The masks (first column) and nor‐ malization responses are displayed for the following methods: Elliptical (top row), Smooth-Elliptical, Weighted-

The main purpose of this normalization is to mark an area where the most important face data are located, in terms of unique characteristics. This normalization is done by multiply‐

**•** Elliptical mask: The human face has an approximately elliptical shape. A Binary elliptical

**•** Smooth-elliptical mask: The weighted mask does not have a step change on the edge be‐

**•** Smooth-weighted-elliptical mask: Practical experiments show that the human nose is the most unstable feature, in terms of temperature emissivity. Therefore, the final mask has

**•** Discriminative potential mask: Another possibility to mark regions of stability is by train‐ ing on part of a face´s database. A mask is obtained by the discriminative potential meth‐

Several comparative studies of thermal face recognition approaches have been developed in recent years. The first recognition algorithms were appearance-based. These methods deal with the normalized image as a vector of numbers. The projected vector turns into a (low dimensionality) subspace, where the separation between impostors and genuines can be ef‐ ficiently computed with higher accuracy. The commonly used methods are the PCA, LDA

While appearance-based methods belong to the global-matching methods, there are localmatching methods, which compare only certain parts of an input image to achieve better performance. The LBP (*Local Binary Pattern*) method was primarily developed for texture de‐ scription and recognition. Nevertheless, it was successfully used in visible and thermal face

Smooth-Elliptical, Discriminative potential (bottom row).

48 New Trends and Developments in Biometrics

ing the original image by some mask (see Figure 17).

mask is therefore the simplest and most practical solution.

tween the expected face points and background points.

lower weight within the expected nasal area position.

and ICA, which are described in more detail in section 4.2.

od described in section 4.2.

**3.4. Feature extraction on thermal images**

## **4. Common processing parts for normalized, 3D, and thermal images**

For both thermal and 3D face recognition, it is very difficult to select one method, out of nu‐ merous possibilities, which gives the best performance in each scenario. Choosing the best method is always limited to a certain or specific database (input data). In order to address this problem, multi-algorithmic biometric fusion can be used.

In the following sections, a general multi-algorithmic biometric system will be described. Both 3D face recognition and thermal facial recognition require a normalized image. Since many of these characteristics are similar, we do not distinguish between origins of the nor‐ malized image. This section describes generic algorithms of feature extraction, feature pro‐ jection and comparison, which are being evaluated on normalized 3D, as well as thermal images.

## **4.1. Feature extraction of the normalized image**

The feature extraction part takes the normalized image as input, and produces a feature vector as output. This feature vector is then processed in the feature projection part of the process.

There is an optional step to perform a per-feature *z*-score normalization after the projection,

Optional processing, after the application of statistical projection methods, is the feature

the same discriminative ability. While some component may have positive contribution to the overall recognition performance, the other component may not. We have implemented

The first possible solution is the LDA application. The second option is to make an assump‐ tion that the good feature vector component has stable values across different scans of the same subjects, however, the mean value of a specific component across different subject dif‐ fers to the greatest possible extent. Let the intra-class variability of the feature component *i*

presses the standard deviation of means of measured values for the same subject. The result‐

The number of combinations for common feature extraction techniques and optional feature vector processing, yields a large set of possible recognition methods. For example, the Gabor filter bank, consisting of 12 kernels, may be convolved with an input image. The results of the convolution are concatenated into one large column vector, which is then processed with PCA, followed by LDA. Another example is an input image processed by the PCA. The indi‐ vidual features in the resulting feature vector, are multiplied by their corresponding normal‐

After the features are extracted, the feature vector is compared with the template from a bio‐ metric database, using some arbitrary distance function. If the distance is below a certain threshold, the person, whose features were extracted, is accepted as a genuine user. If we are

for the same subjects. The inter-class variability of component *i* is denoted as *interi*

ing discriminative potential, therefore, can be expressed as follows:

*FV* ={(*id*1, *f v*1), (*id*2, *f v*2), …, (*idn*, *f vn*)} (5)

, as it expresses the mean of standard deviations of all measured values

*discriminativepotential* =*interi* - *intrai* (6)

*<sup>σ</sup>* , (4)

, and their corre‐

3D and Thermo-Face Fusion http://dx.doi.org/10.5772/3420 51

do not have

, and ex‐

*f v* '

is the mean vector and *σ* is the vector of standard deviations.

The individual feature vector components, *v <sup>j</sup>*1, *f v <sup>j</sup>*2, …, *f v jm*, of the vector *f v <sup>j</sup>*

weighting. Suppose that we have a set *FV* of all pairs of feature vectors, *v <sup>j</sup>*

:

and evaluated two possible feature evaluation techniques.

<sup>=</sup> *fv* - *fv* -

so that each vector *fv* is transformed into

sponding class (subject) labels, *id <sup>j</sup>*

be denoted as *intrai*

**4.3. Fusion using binary classifiers**

ized discriminative potential weight.

where *fv* -

Vectorization is the simplest method of feature extraction. The intensity values of the image *I* =*w* × *h* are concentrated to a single column vector. Performance of this extraction depends on the normalization method.

Since normalized images are not always convenient for direct vectorization, feature extrac‐ tion with the use of the **bank of filters** has been presented in several works. The normalized image is convolved with a bank of 2D filters, which are generated, using some kernel func‐ tion with different parameters (see Figure 19). Each response of the convolution forms a fi‐ nal feature vector.

The **Gabor filter bank** is one of the most popular filter banks [32]. We employed the **La‐ guerre-Gaussian filter bank** as well due to its good performance in the facial recognition field [31].

**Figure 19.** The Gabor (first row) and Laguere-Gaussian (second row) filter banks.

#### **4.2. Feature vector projection and further processing**

Statistical projection methods linearly transform the input feature vector from an input mdimensional space into an *n*-dimensional space, where *n* <*m*. We utilize the following meth‐ ods:


Every projectional method has a common learning parameter, which defines how much var‐ iability of the input space is captured by the PCA. This parameter controls the dimensionali‐ ty of the output projection space. Let *k* eigenvalues, computed during the PCA calculation, be denoted as *e*1, *e*2, …, *ek* , (*e*<sup>1</sup> >*e*<sup>2</sup> >…>*ek* ). These eigenvalues directly represent the varia‐ bility in each output dimension. If we want to preserve only 98% of variability, then only the first *l* eigenvalues and its corresponding eigenvectors are selected, such that their sum forms only 98% of the ∑ *<sup>j</sup>*=1 *<sup>k</sup> ej* .

There is an optional step to perform a per-feature *z*-score normalization after the projection, so that each vector *fv* is transformed into

$$\begin{array}{c} \begin{array}{c} \text{if } v \stackrel{\cdot}{=} \frac{\cdot}{fv \cdot v} \end{array} . \end{array} \tag{4} \tag{4}$$

where *fv* is the mean vector and *σ* is the vector of standard deviations.

Optional processing, after the application of statistical projection methods, is the feature weighting. Suppose that we have a set *FV* of all pairs of feature vectors, *v <sup>j</sup>* , and their corre‐ sponding class (subject) labels, *id <sup>j</sup>* :

$$FV = \left| \begin{pmatrix} id\_{1\prime} & f \ v\_1 \end{pmatrix}, \begin{pmatrix} id\_{2\prime} & f \ v\_2 \end{pmatrix}, \dots, \begin{pmatrix} id\_{n\prime} & f \ v\_n \end{pmatrix} \right\rangle \tag{5}$$

The individual feature vector components, *v <sup>j</sup>*1, *f v <sup>j</sup>*2, …, *f v jm*, of the vector *f v <sup>j</sup>* do not have the same discriminative ability. While some component may have positive contribution to the overall recognition performance, the other component may not. We have implemented and evaluated two possible feature evaluation techniques.

The first possible solution is the LDA application. The second option is to make an assump‐ tion that the good feature vector component has stable values across different scans of the same subjects, however, the mean value of a specific component across different subject dif‐ fers to the greatest possible extent. Let the intra-class variability of the feature component *i* be denoted as *intrai* , as it expresses the mean of standard deviations of all measured values for the same subjects. The inter-class variability of component *i* is denoted as *interi* , and ex‐ presses the standard deviation of means of measured values for the same subject. The result‐ ing discriminative potential, therefore, can be expressed as follows:

$$\text{discrijunctive potential} = \text{inter}\_{i} - \text{intra}\_{i} \tag{6}$$

#### **4.3. Fusion using binary classifiers**

**4.1. Feature extraction of the normalized image**

process.

on the normalization method.

50 New Trends and Developments in Biometrics

nal feature vector.

field [31].

ods:

only 98% of the ∑ *<sup>j</sup>*=1

*<sup>k</sup> ej* .

The feature extraction part takes the normalized image as input, and produces a feature vector as output. This feature vector is then processed in the feature projection part of the

Vectorization is the simplest method of feature extraction. The intensity values of the image *I* =*w* × *h* are concentrated to a single column vector. Performance of this extraction depends

Since normalized images are not always convenient for direct vectorization, feature extrac‐ tion with the use of the **bank of filters** has been presented in several works. The normalized image is convolved with a bank of 2D filters, which are generated, using some kernel func‐ tion with different parameters (see Figure 19). Each response of the convolution forms a fi‐

The **Gabor filter bank** is one of the most popular filter banks [32]. We employed the **La‐ guerre-Gaussian filter bank** as well due to its good performance in the facial recognition

Statistical projection methods linearly transform the input feature vector from an input mdimensional space into an *n*-dimensional space, where *n* <*m*. We utilize the following meth‐

Every projectional method has a common learning parameter, which defines how much var‐ iability of the input space is captured by the PCA. This parameter controls the dimensionali‐ ty of the output projection space. Let *k* eigenvalues, computed during the PCA calculation, be denoted as *e*1, *e*2, …, *ek* , (*e*<sup>1</sup> >*e*<sup>2</sup> >…>*ek* ). These eigenvalues directly represent the varia‐ bility in each output dimension. If we want to preserve only 98% of variability, then only the first *l* eigenvalues and its corresponding eigenvectors are selected, such that their sum forms

**Figure 19.** The Gabor (first row) and Laguere-Gaussian (second row) filter banks.

**•** PCA followed by linear discriminant analysis (LDA of PCA, Fisherfaces)

**•** PCA followed by independent component analysis (ICA of PCA)

**4.2. Feature vector projection and further processing**

**•** Principal component analysis (PCA, Eigenfaces)

The number of combinations for common feature extraction techniques and optional feature vector processing, yields a large set of possible recognition methods. For example, the Gabor filter bank, consisting of 12 kernels, may be convolved with an input image. The results of the convolution are concatenated into one large column vector, which is then processed with PCA, followed by LDA. Another example is an input image processed by the PCA. The indi‐ vidual features in the resulting feature vector, are multiplied by their corresponding normal‐ ized discriminative potential weight.

After the features are extracted, the feature vector is compared with the template from a bio‐ metric database, using some arbitrary distance function. If the distance is below a certain threshold, the person, whose features were extracted, is accepted as a genuine user. If we are

using several different recognition algorithms, the simple threshold becomes a binary classi‐ fication problem. The biometric system has to decide whether the resulting score vector *s* =(*s*1, *s*2, …, *sn*) belongs to the genuine user or the impostor. An example of a general multi‐ modal-biometric system employing a score-level fusion is in Figure 20, but the same ap‐ proach may be applied to a multi-algorithmic system, where input is just one sample and more than one feature extraction and comparison method is applied.

training of the projection methods. The second portion was intended for optional calculation of *z*-score normalization parameters, the feature weighting, and the fusion classification

3D and Thermo-Face Fusion http://dx.doi.org/10.5772/3420 53

To ensure that the particular results of the employed methods are stable and reflect the real performance, the following cross-validation process was selected. The database was ran‐ domly divided into three parts, where all parts had an equal number of subjects. This ran‐ dom division, and subsequent evaluation was processed *n* times, where *n* depends on the size of the database. The Equinox database was cross-validated 10 times, while the Notredame database and FRGC were cross-validated 3 times. The performance of a particular

From these techniques, 10 best different recognition methods were selected for final scorelevel fusion. Logistic regression was used for score fusion. The results are given in Table 4.

For performance evaluation of the 3D face scans, the following recognition methods were

**•** Recognition, using anatomical features. The discriminative potential weighting was ap‐ plied on the resulting feature vector, consisting of 61 features. The city-block (Manhattan,

**•** Recognition using histogram-based features. We use a division of the face into 10 rows and 6 columns. The individual feature vector components were weighted by their dis‐

**•** Shape-index images projected by the PCA, weighting based on discriminative potential,

**•** Shape-index images projected by the PCA followed by the ICA, weighting based on dis‐

criminative potential. Cosine metric is used for the distance measurement.

**•** Cross correlation of shape-index images[7], see Figure 21.

criminative potential, and the cosine distance function.

method was reported as the mean value of the achieved equal error rates (EERs).

For thermal face recognition the following techniques were fused:

training. The final part was used for evaluation.

*4.4.1. Evaluation on thermal images*

**•** PCA and ICA projection,

*4.4.2. Evaluation on 3D face scans*

*L* 0) metric was employed.

and the cosine distance function.

used:

**•** Local and global contrast enhancement,

**•** Weighting based on discriminative potential,

**•** Comparison using the cosine distance function.

**•** The Gabor and Laguerre filter banks,

**Figure 20.** A generic multimodal biometric system using score-level fusion.

In order to compare and fuse scores that come from different methods, normalization, to a certain range, has to be performed. We use the following score normalization: the score val‐ ues are linearly transformed so that the genuine mean (the score, obtained from comparing the same subjects) is 0, and the impostor mean (score, obtained from comparing different subjects) is 1. Note that individual scores may have negative values. This does not matter in the context of score-level fusion, since these values represent positions within the classifica‐ tion space, rather than distances between two feature vectors.

The theoretical background of general multimodal biometric fusion, especially the link be‐ tween the correlation and variance of both impostor and genuine distribution between the employed recognition methods, is described in [35]. The advantage provided by score fusion relative to monomodal biometric systems is described in detail in [36].

In our fusion approach, we have implemented a classification using logistic regression [37], support vector machines (SVM), with linear and sigmoidal kernel [37], and linear discrimi‐ nant analysis (LDA).

## **4.4. Experimental results**

For evaluation of our fusion approach on ethermal images, we used the Equinox [38] and Notre-dame databases [39][40]. Equinox contains 243 scans of 74 subjects, while the Notre Dame database consists of 2,292 scans. Evaluation for 3D face recognition was performed on the "Spring 2004" part on the FRGC database [20], from which we have selected only sub‐ jects with more than 5 scans. This provided 1,830 3D scans in total.

The evaluation scenario was as follows. We divided each database into three equal parts. Different data subjects were present in each part. The first portion of the data was used for training of the projection methods. The second portion was intended for optional calculation of *z*-score normalization parameters, the feature weighting, and the fusion classification training. The final part was used for evaluation.

To ensure that the particular results of the employed methods are stable and reflect the real performance, the following cross-validation process was selected. The database was ran‐ domly divided into three parts, where all parts had an equal number of subjects. This ran‐ dom division, and subsequent evaluation was processed *n* times, where *n* depends on the size of the database. The Equinox database was cross-validated 10 times, while the Notredame database and FRGC were cross-validated 3 times. The performance of a particular method was reported as the mean value of the achieved equal error rates (EERs).

#### *4.4.1. Evaluation on thermal images*

using several different recognition algorithms, the simple threshold becomes a binary classi‐ fication problem. The biometric system has to decide whether the resulting score vector *s* =(*s*1, *s*2, …, *sn*) belongs to the genuine user or the impostor. An example of a general multi‐ modal-biometric system employing a score-level fusion is in Figure 20, but the same ap‐ proach may be applied to a multi-algorithmic system, where input is just one sample and

In order to compare and fuse scores that come from different methods, normalization, to a certain range, has to be performed. We use the following score normalization: the score val‐ ues are linearly transformed so that the genuine mean (the score, obtained from comparing the same subjects) is 0, and the impostor mean (score, obtained from comparing different subjects) is 1. Note that individual scores may have negative values. This does not matter in the context of score-level fusion, since these values represent positions within the classifica‐

The theoretical background of general multimodal biometric fusion, especially the link be‐ tween the correlation and variance of both impostor and genuine distribution between the employed recognition methods, is described in [35]. The advantage provided by score fusion

In our fusion approach, we have implemented a classification using logistic regression [37], support vector machines (SVM), with linear and sigmoidal kernel [37], and linear discrimi‐

For evaluation of our fusion approach on ethermal images, we used the Equinox [38] and Notre-dame databases [39][40]. Equinox contains 243 scans of 74 subjects, while the Notre Dame database consists of 2,292 scans. Evaluation for 3D face recognition was performed on the "Spring 2004" part on the FRGC database [20], from which we have selected only sub‐

The evaluation scenario was as follows. We divided each database into three equal parts. Different data subjects were present in each part. The first portion of the data was used for

more than one feature extraction and comparison method is applied.

52 New Trends and Developments in Biometrics

**Figure 20.** A generic multimodal biometric system using score-level fusion.

tion space, rather than distances between two feature vectors.

nant analysis (LDA).

**4.4. Experimental results**

relative to monomodal biometric systems is described in detail in [36].

jects with more than 5 scans. This provided 1,830 3D scans in total.

For thermal face recognition the following techniques were fused:


From these techniques, 10 best different recognition methods were selected for final scorelevel fusion. Logistic regression was used for score fusion. The results are given in Table 4.

#### *4.4.2. Evaluation on 3D face scans*

For performance evaluation of the 3D face scans, the following recognition methods were used:


for the higher-level decision-making support, called semantic biometrics [3]. Decision-mak‐ ing in semantic form is the basis for implementation in distributed security systems, PASS of the next generation. In this approach, the properties of linguistic averaging are efficiently utilized for smoothing temporal errors, including errors caused by insufficiency of informa‐ tion, at the local and global levels of biometric systems. The concept of semantics in biomet‐ rics is linked to various disciplines; in particular, to dialogue support systems, as well as to

Another extension of the concept of PASS is the Training PASS (T-PASS) that provides a training environment for the users of the system [41]. Such a system makes use of synthetic biometric data [42], automatically generated to ''imitate" real data. For example, models can be generated from real acquired data, and can simulate age, accessories, and other attributes of the human face. Generation of synthetic faces, using 3D models that provide the attribute of convincing facial expressions, and thermal models which have a given emotional color‐ ing, is a function of both the PASS (to support identification by analysis through synthesis, for instance, modeling of head rotation to improve recognition of faces acquired from vid‐

This research has been realized under the support of the following grants: "Security-Orient‐ ed Research in Information Technology" – MSM0021630528 (CZ), "Information Technology in Biomedical Engineering" – GD102/09/H083 (CZ), "Advanced secured, reliable and adap‐ tive IT" – FIT-S-11-1 (CZ), "The IT4Innovations Centre of Excellence" – IT4I-CZ 1.05/1.1.00/02.0070 (CZ) and NATO Collaborative Linkage Grant CBP.EAP.CLG 984 "Intelli‐

, Martin Drahanský1

and Svetlana Yanushkevich2

3D and Thermo-Face Fusion http://dx.doi.org/10.5772/3420 55

eo), and T-PASS (to provide virtual reality modeling for trainees).

gent assistance systems: multisensor processing and reliability analysis".

, Radim Dvořák<sup>1</sup>

1 Faculty of Information Technology, Brno University of Technology, Czech Republic

[1] Jain A. K., Nandakumar K., Uludag U., LuX. Multimodal biometrics: augmenting face with other cues, In: Face Processing: Advanced Modeling and Methods, Elsevier,

there commender systems.

**Acknowledgement**

**Author details**

Štěpán Mráček<sup>1</sup>

**References**

2006.

, Jan Váňa<sup>1</sup>

2 University of Calgary, Canada

**Figure 21.** An original range image and shape index visualization.

The result of the algorithms evaluation using FRGC database is given in Table 4.


**Table 4.** Evaluation of fusion based on logistic regression. For every fusion test, all individual components of the resulting fusion method were evaluated separately. The best component was compared with overall fusion, and improvement was also reported. The numbers represent the achieved EER in %.

## **5. Conclusion**

This chapter addressed a novel approach to biometric based-system design, viewed as de‐ sign of a distributed network of multiple biometric modules, or assistants. The advances of multi-source biometric data are demonstrated using 3D and infrared facial biometrics. A particular task of such system, namely, identification, using fusion of these biometrics, is demonstrated. It is shown that reliability of the fusion-based decision increases. Specifically, 3D face models carry additional topological information, and, thus, are more robust, com‐ pared to 2D models. Thermal data brings additional information. By deploying the best face recognition techniques for both 3D and thermal domains, we showed, through experiments, that fusion increases the overall performace up to 50%.

It should be noted that the important components of the processing pipeline image process‐ ing and the correct feature selection greatly influence the decision-making (comparison). Al‐ so, choice of another way of fusion may influence the results.

The identification task, combined with advanced discriminative analysis of biometric data, such as temperature and its derivatives (blood flow rate, pressure etc.), constitute the basis for the higher-level decision-making support, called semantic biometrics [3]. Decision-mak‐ ing in semantic form is the basis for implementation in distributed security systems, PASS of the next generation. In this approach, the properties of linguistic averaging are efficiently utilized for smoothing temporal errors, including errors caused by insufficiency of informa‐ tion, at the local and global levels of biometric systems. The concept of semantics in biomet‐ rics is linked to various disciplines; in particular, to dialogue support systems, as well as to there commender systems.

Another extension of the concept of PASS is the Training PASS (T-PASS) that provides a training environment for the users of the system [41]. Such a system makes use of synthetic biometric data [42], automatically generated to ''imitate" real data. For example, models can be generated from real acquired data, and can simulate age, accessories, and other attributes of the human face. Generation of synthetic faces, using 3D models that provide the attribute of convincing facial expressions, and thermal models which have a given emotional color‐ ing, is a function of both the PASS (to support identification by analysis through synthesis, for instance, modeling of head rotation to improve recognition of faces acquired from vid‐ eo), and T-PASS (to provide virtual reality modeling for trainees).

## **Acknowledgement**

**Figure 21.** An original range image and shape index visualization.

54 New Trends and Developments in Biometrics

Equinox Global contrast enhancement, no filter bank, ICA, cosine distance

Notre-dame Global contrast enhancement, no filter bank, PCA, cosine distance

FRGC Shape index, PCA, weighting using

**5. Conclusion**

The result of the algorithms evaluation using FRGC database is given in Table 4.

**Table 4.** Evaluation of fusion based on logistic regression. For every fusion test, all individual components of the resulting fusion method were evaluated separately. The best component was compared with overall fusion, and

This chapter addressed a novel approach to biometric based-system design, viewed as de‐ sign of a distributed network of multiple biometric modules, or assistants. The advances of multi-source biometric data are demonstrated using 3D and infrared facial biometrics. A particular task of such system, namely, identification, using fusion of these biometrics, is demonstrated. It is shown that reliability of the fusion-based decision increases. Specifically, 3D face models carry additional topological information, and, thus, are more robust, com‐ pared to 2D models. Thermal data brings additional information. By deploying the best face recognition techniques for both 3D and thermal domains, we showed, through experiments,

It should be noted that the important components of the processing pipeline image process‐ ing and the correct feature selection greatly influence the decision-making (comparison). Al‐

The identification task, combined with advanced discriminative analysis of biometric data, such as temperature and its derivatives (blood flow rate, pressure etc.), constitute the basis

EER

Fusion EER Improvement

2.28 1.06 53.51%

6.70 5.99 10.60%

4.06 3.88 4.43%

Database Best single method name Best single method

discriminative potential, cosine distance

improvement was also reported. The numbers represent the achieved EER in %.

that fusion increases the overall performace up to 50%.

so, choice of another way of fusion may influence the results.

This research has been realized under the support of the following grants: "Security-Orient‐ ed Research in Information Technology" – MSM0021630528 (CZ), "Information Technology in Biomedical Engineering" – GD102/09/H083 (CZ), "Advanced secured, reliable and adap‐ tive IT" – FIT-S-11-1 (CZ), "The IT4Innovations Centre of Excellence" – IT4I-CZ 1.05/1.1.00/02.0070 (CZ) and NATO Collaborative Linkage Grant CBP.EAP.CLG 984 "Intelli‐ gent assistance systems: multisensor processing and reliability analysis".

## **Author details**

Štěpán Mráček<sup>1</sup> , Jan Váňa<sup>1</sup> , Radim Dvořák<sup>1</sup> , Martin Drahanský1 and Svetlana Yanushkevich2

1 Faculty of Information Technology, Brno University of Technology, Czech Republic

2 University of Calgary, Canada

## **References**

[1] Jain A. K., Nandakumar K., Uludag U., LuX. Multimodal biometrics: augmenting face with other cues, In: Face Processing: Advanced Modeling and Methods, Elsevier, 2006.

[2] Yanushkevich S. N., Stoica A., Shmerko V. P. Experience of design and prototyping of a multi-biometric early warning physical access control security system (PASS) and a training system (T-PASS), In: Proc. 32nd Annual IEEE Industrial Electronics Society Conference, Paris, pp. 2347-2352, 2006.

[15] Heseltine T., Pears N. & Austin J. Three-dimensional face recognition using combina‐ tions of surface feature map subspace components," Image and Vision Computing,

3D and Thermo-Face Fusion http://dx.doi.org/10.5772/3420 57

[16] Zhou X., Seibert H. & BuschC. A 3D face recognition algorithm using histogrambased features, Eurographics 2008 Workshop on 3D Object Retrieval, pp. 65-71, 2008.

[17] Mráček Š., Busch C., Dvořák R. &Drahanský M. Inspired by Bertillon - Recognition Based on Anatomical Features from 3D Face Scans, in Proceedings of the 3rd Interna‐

[19] Berretti S., Del Bimbo A. & Pala P. 3D Face Recognition Using iso-Geodesic Stripes. IEEE transactions on pattern analysis and machine intelligence, vol. 32, no. 12, pp.

[20] Phillips P. J. et al. Overview of the Face Recognition Grand Challenge, in IEEE Com‐ puter Society Conference on Computer Vision and Pattern Recognition (CVPR'05),

[21] Phillips P. J. et al., FRVT 2006 and ICE 2006 large-scale experimental results, IEEE transactions on pattern analysis and machine intelligence, vol. 32, no. 5, pp. 831-46,

[22] Blanz V. Face Recognition based on a 3D Morphable Model, in Proc. of the 7th Int. Conference of Automatic Face and Gesture Recognition, pp. 617–622, 2006.

[23] Raton B. Affine Transformations. In Standard Mathematical Tables and Formulae, D.

[24] Mendez H., San Martı´n C., Kittler J., Plasencia Y., Garcı´a E. Face recognition with

[25] Hermosilla G., Ruiz-del-Solar J., Verschae R., Correa M. A comparative study of ther‐ mal face recognition methods in unconstrained environments, Pattern Recognition,

[26] Lloyd J. M. Thermal Imaging Systems, New York: Plenum Press. ISBN 0-306-30848-7,

[27] FLIR Systems. The Ultimate Infrared Handbook for R&D Professionals: A Resource Guide for Using Infrared in the Research and Development Industry, United King‐

[28] FLIR Systems, ThermaCAM Reporter - user's manual, Professional Edition, Version

[29] Buddharaju P., Pavlidis I. Multi-spectral face recognition - fusion of visual imagery with physiological information, in:R. I. Hammoud,B. R. Abidi, M. A. Abidi (Eds.),

LWIR imagery using local binary patterns, LNCS5558, pp. 327–336, 2009.

tional Workshop on Security and Communication Networks, pp. 53-58, 2011. [18] Jahanbin S., Choi H., Liu Y. &Bovik A. C. Three Dimensional Face Recognition Using Iso-Geodesic and Iso-Depth Curves, in 2nd IEEE International Conference on Biomet‐

vol. 26, no. 3, pp. 382-396, 2006.

2162-2177, 2010.

2010.

1975.

dom, 2012.

vol. 1, pp. 947-954, 2005.

Zwillinger, pp. 265–266, 1995.

vol. 45, no. 7, pp. 2445-2459, 2012.

8.1, publ. No. 1558567, 2007.

rics: Theory, Applications and Systems. 2008.


[15] Heseltine T., Pears N. & Austin J. Three-dimensional face recognition using combina‐ tions of surface feature map subspace components," Image and Vision Computing, vol. 26, no. 3, pp. 382-396, 2006.

[2] Yanushkevich S. N., Stoica A., Shmerko V. P. Experience of design and prototyping of a multi-biometric early warning physical access control security system (PASS) and a training system (T-PASS), In: Proc. 32nd Annual IEEE Industrial Electronics

[3] Yanushkevich S. N., Shmerko V. P., Boulanov O. R., StoicaA. Decision making sup‐ port in biometric-based physical access control systems: design concept, architecture, and applications, In book: N. Bourgoulis and L. Micheli-Tsanakou, K. Platantionis (Eds.), Biometrics: Theory, Methods and Applications, IEEE/Wiley Press, pp. 599-632,

[4] Peng T."Algorithms and Models for 3-D Shape Measurement Using Digital Fringe

[5] Fechteler P., Eisert P., Rurainsky J. Fast and High Resolution 3D Face Scanning, In Proceedings of the 14th International Conference on Image Processing (ICIP2007), San

[6] Artec M™ 3D Scanner. ARTEC GROUP. Artec M™ — Artec 3D Scanners, http://

[7] Lu X., Colbry D. & Jain A. Three-Dimensional Model Based Face Recognition, in ICPR 04: Proceedings of the Pattern Recognition, 17th International Conference on

[8] Mahoor, M. H. & Abdel-Mottaleb, M. Face recognition based on 3D ridge images ob‐ tained from range data, Pattern Recognition, vol. 42, no. 3, pp. 445-451, 2009.

[9] Pan G., Han S., Wu Z. & Wang Y. 3D Face Recognition using Mapped Depth Images, in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recog‐

[10] Segundo M., Queirolo C., Bellon O. & Silva L. Automatic 3D facial segmentation and landmark detection, in ICIAP '07 Proceedings of the 14th International Conference on

[11] GrayA. The Gaussian and Mean Curvatures, in Modern Differential Geometry of

[12] Turk M. & Pentland A. Face recognition using eigen faces, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 591, no. 1, pp. 586-591,

[13] Belhumeur P., Hespanha J. & Kriegman D. Eigenfacesvs. Fisherfaces: Recognition Using Class Specific Linear Projection, IEEE Transactions on Pattern Analysis and

[14] Hyvärinen A. Fast and robust fixed-point algorithms for independent component analysis., IEEE transactions on neural networks / a publication of the IEEE Neural

Society Conference, Paris, pp. 2347-2352, 2006.

Projections", Ph.D. thesis, University of Maryland, 2006.

www.artec3d.com/3d\_scanners/artec-m (accessed 21 April 2012).

2009.

56 New Trends and Developments in Biometrics

1991.

Antonio, Texas, USA, 2007.

Pattern Recognition, pp. 362-366, 2004.

nition (CVPR'05) - Workshops, vol. 3, p. 175, 2005.

Image Analysis and Processing, pp. 431-436, 2007.

Curves and Surfaces with Mathematica, pp. 373-380, 1997.

Machine Intelligence, vol. 19, no. 7, pp. 711-720, 1997.

Networks Council, vol. 10, no. 3, pp. 626-34, 1999.


Face Biometrics for Personal Identification: Multi-Sensory Multi-Modal Systems, Springer, pp.91–108, 2007.

**Chapter 3**

**Provisional chapter**

**Finger-Vein Image Restoration Based on a Biological**

**Finger-Vein Image Restoration Based on a Biological**

Finger-vein recognition, as a highly secure and convenient technique of personal identification, has been attracted much attention for years. In contrast to conventional appearance-based biometric traits such as face, fingerprint and palmprint, finger-vein patterns are hidden beneath the human skin and unnoticeable without the help of some specific viewing or imaging devices. This makes finger-vein trait resistant to steal or forgery,

Generally, in order to visualize finger-vein vessels inside the finger tissues, the near infrared (NIR) transillumination is often adopted in image acquisition system [1], as shown in Fig. 1. In this imaging manner, the image sensor placed under a finger is used to visualize the transmitted NIR lights, as shown in Fig. 1(a) and (b), here, Fig. 1(b) is a disassembled homemade imaging device. Then with the help of imaging software, the finger-vein images can be captured by the computer. Due to the interaction between NIR light and biological tissues, thus the captured images inevitably carry some important inner information of finger tissue. In blood vessels, the hemoglobin absorbs more NIR radiation than other substances in finger tissues [2], the intensity distribution of transmitted NIR rays therefore vary spatially in terms of the vessel distribution, and venous regions can cast darker "shadows" on imaging plane while the other tissues present a brighter background, as shown in Fig. 1(d). From Fig. 1(d), we can clearly see that, in a captured image, not all regions are useful for accuracy finger-vein recognition, so to eliminate some unwanted regions, a simple but effective method of region of interest (ROI) localization have been proposed in our previous work [5].

Unfortunately, the captured finger-vein images are always not good in quality due to lower cotrast such that the venous regions are not salient. This certainly makes finger-vein feature representation unreliable, and further impairs the accuracy of finger-vein recognition in

> ©2012 Yang et al., licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2012 Yang et al.; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use,

© 2012 Yang, et al.; licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

distribution, and reproduction in any medium, provided the original work is properly cited.

**Optical Model**

10.5772/52104

**1. Introduction**

**Optical Model**

http://dx.doi.org/10.5772/52104

Jinfeng Yang, Yihua Shi and Jucheng Yang

Additional information is available at the end of the chapter

Jinfeng Yang, Yihua Shi and Jucheng Yang

Additional information is available at the end of the chapter

and thereby highly reliable for identity authentication.

In Fig. 1(e), we list some ROI extraction results for illustration.


**Provisional chapter**

## **Finger-Vein Image Restoration Based on a Biological Optical Model Optical Model**

**Finger-Vein Image Restoration Based on a Biological**

Jinfeng Yang, Yihua Shi and Jucheng Yang Additional information is available at the end of the chapter

Jinfeng Yang, Yihua Shi and Jucheng Yang

Additional information is available at the end of the chapter 10.5772/52104

http://dx.doi.org/10.5772/52104

## **1. Introduction**

Face Biometrics for Personal Identification: Multi-Sensory Multi-Modal Systems,

[30] Cho S. Y., Wang L., Ong W. L. Thermal imprint feature analysis for face recognition, Industrial Electronics, ISIE2009, IEEE International Symposium on, pp.1875–1880,

[31] Jacovitti, G. & Neri, A. Multiscale image features analysis with circular harmonic

[32] Lee T. Image representation using 2D Gabor wavelets, IEEE Transactions on Pattern

[33] Friedrich G. & Yeshurun Y. Seeing People in the Dark: Face Recognition in Infrared Images, in BMCV '02 Proceedings of the Second International Workshop on Biologi‐ cally Motivated Computer Vision, pp. 348–359. Springer-Verlag London, UK, 2002.

[34] Otsu N. A Threshold Selection Method from Gray-Level Histograms, IEEE Transac‐

[35] Poh N. & Bengio S. How Do Correlation and Variance of Base-Experts Affect Fusion in Biometric Authentication Tasks? IEEE Transactions on Signal Processing, vol. 53,

[36] Puente L., Poza M. J., Ruíz B. & Carrero D. Biometrical Fusion – Input Statistical Dis‐ tribution, in Advanced Biometric Technologies, InTech, 2011, pp. 87-110, 2011.

[37] Bishop C. Pattern Recognition and Machine Learning, vol. 4, no. 4. Springer, p. 738,

[38] Equinox, Multimodal face database. http://www.equinoxsensors.com/products/

[39] Chen X., Flynn P. & Bowyer K. Visible-light and infrared face recognition, ACM

[40] Flynn P., Bowyer K. & Phillips P. Assessment of time dependency in face recogni‐ tion: An initial study, Lecture Notes in Computer Science, vol. 2688, pp. 44-51, 2003.

[41] Yanushkevich S. N., Stoica A. & Shmerko, V. P. Fundamentals of biometric-based training system design, In: S. N. Yanushkevich, P. Wang, S. Srihari, and M. Gavrilo‐ va, Eds., M. S. Nixon, Consulting Ed., Image Pattern Recognition: Synthesis and

[42] Yanushkevich S. N., Stoica A., Shmerko V. Synthetic Biometrics, IEEE Computational

[43] Wu S., Lin W., Xie S. Skin heat transfer model of facial thermograms and its applica‐

[44] Lu Y., Yang J., Wu S., Fang Z. Normalization of Infrared Facial Images under Variant Ambient Temperatures, book chapter, Advanced Biometrics Technologies, Intech,

Workshop on Multimodal User Authentication, pp. 48-55, 2003.

Intelligence Magazine, Volume 2, Number 2, pp. 60-69, 2007.

tion in face recognition. Pattern Recognition 41(8): 2718-2729, 2008.

wavelets,in Proceedings of SPIE, vol. 2569, pp. 363–374, 1995.

Analysis and Machine Intelligence, pp. 959–971, 1996.

tions on Systems, Man and Cybernetics, pages 62-66, 1979.

Springer, pp.91–108, 2007.

58 New Trends and Developments in Biometrics

no. 11, pp. 4384-4396, 2005.

HID.html, (accessed 3 May 2012).

Analysis in Biometrics, World Scientific, 2006.

Vienna, Austria, ISBN 978-953-307-487-0, 2011.

2009.

2006.

Finger-vein recognition, as a highly secure and convenient technique of personal identification, has been attracted much attention for years. In contrast to conventional appearance-based biometric traits such as face, fingerprint and palmprint, finger-vein patterns are hidden beneath the human skin and unnoticeable without the help of some specific viewing or imaging devices. This makes finger-vein trait resistant to steal or forgery, and thereby highly reliable for identity authentication.

Generally, in order to visualize finger-vein vessels inside the finger tissues, the near infrared (NIR) transillumination is often adopted in image acquisition system [1], as shown in Fig. 1. In this imaging manner, the image sensor placed under a finger is used to visualize the transmitted NIR lights, as shown in Fig. 1(a) and (b), here, Fig. 1(b) is a disassembled homemade imaging device. Then with the help of imaging software, the finger-vein images can be captured by the computer. Due to the interaction between NIR light and biological tissues, thus the captured images inevitably carry some important inner information of finger tissue. In blood vessels, the hemoglobin absorbs more NIR radiation than other substances in finger tissues [2], the intensity distribution of transmitted NIR rays therefore vary spatially in terms of the vessel distribution, and venous regions can cast darker "shadows" on imaging plane while the other tissues present a brighter background, as shown in Fig. 1(d). From Fig. 1(d), we can clearly see that, in a captured image, not all regions are useful for accuracy finger-vein recognition, so to eliminate some unwanted regions, a simple but effective method of region of interest (ROI) localization have been proposed in our previous work [5]. In Fig. 1(e), we list some ROI extraction results for illustration.

Unfortunately, the captured finger-vein images are always not good in quality due to lower cotrast such that the venous regions are not salient. This certainly makes finger-vein feature representation unreliable, and further impairs the accuracy of finger-vein recognition in

©2012 Yang et al., licensee InTech. This is an open access chapter distributed under the terms of the Creative

10.5772/52104

61

http://dx.doi.org/10.5772/52104

Finger-Vein Image Restoration Based on a Biological Optical Model

**Figure 2.** Finger-vein ROI extraction. Here the used ROI extraction method is proposed in [5].

due to light scattering. **2. Related work** (a) (b)

**Figure 3.** Image contrast reduction due to light scattering. (a) A real shadow as no light scattering. (b) The shadow is blurred

Traditionally, many image enhancement methods have been proposed to improve the quality of vein images. Histogram equalization based algorithms were used to enhance the contrast between the venous and background regions in [6, 7]. Considering the variations of vein-coursing directions, different oriented filtering strategies were used to highlight the finger-vein texture. [8–11]. The retinex theory combined with the fuzzy was adopted to enhance the near-infrared vein images [12]. Pi *et al*. [13] used an edge-preserving filter and an elliptic high-pass filter together to denoise and enhance some small blurred finger veins. Gao *et al*. [14] combined the traditional high frequency emphasis filtering algorithm and the histogram equalization to sharpen the image contrast. Oh *et al.* [15] proposed a homomorphic filter incorporating morphological subband decomposition to enhance the dark blood vessels. Although these methods can respectively enhance vein images to some extent, their performances were considerably undesirable in practice since they all did not

Strong scattering occurring in the biological tissue during vein imaging is the main reason causing contrast deterioration in finger-vein images [16]. Considering light transport in skin tissue, Lee and Park used an depth-dependent point spread function (D-PSF) to address the blurring issue in finger-vein imaging [29, 30]. This method is encouraging in finger-vein visibility improvement, however, D-PSF is derived for handling degraded issues in transcutaneous fluorescent imaging manner but not in transillumination manner [31]. Hence, the performance of D-PSF on light scattering suppression is still unsatisfying for finger-vein images since, in transillumination, light attenuation (absorption and scattering) arises not only from the skin but from other tissues of the finger, such as bone, muscles, and blood vessels [33]. Moreover, estimating biological parameters properly is also a difficult task for D-PSF based image debluring technique in practice. Therefore, for reliable finger-vein image contrast improvement, this chapter aims to find a proper way of scattering removal

treat of the key issue of light scattering in degrading finger-vein images.

according to tissue optics, especially skin optics.

Object

**Figure 1.** Finger-vein image acquisition. (a) NIR light transillumination. (b) A homemade finger-vein imaging device. (c) Our finger-vein image acquisition system. (d) Some captured finger-vein images.

practice. According to tissue optics, multiple light scattering predominates in lights that penetrate through the biological tissue layer [3] as the biological tissue is a kind of highly heterogeneous optical medium in imaging. Thus, the quality of finger-vein images is always not good because the scattering effects can greatly reduce the contrast between the venous and non-venous regions [4]. The basic concept of image degradation due to light scattering is illustrated in Fig. 3. If the incident light is not scattered in optical medium, a real shadow of an object must be casted on the imaging plane, as shown in Fig. 3(a), where the dark circle region represents the real shadow of an object. However, the object shadow is always blurred to a certain extent since light scattering is inevitable in real situations, as shown in Fig. 3(b). Hence, in practical scenario, the inherent advantage of finger-vein can not always be exploited effectively and reliably for finger-vein recognition due to the low contrast of venous regions. Therefore, to exploit the genuine characteristics in finger-vein images, the visibility of finger-vein patterns should be reliably improved in advance.

In this chapter, we first give an analysis of the intrinsic factors causing finger-vein image degradation, and then propose a simple but effective image restoration method based on scattering removal. To give a proper description of finger-vein image degradation, a biological optical model (BOM) specific to finger-vein imaging is proposed according to the principle of light propagation in biological tissues. Finally, based on BOM, the light scattering component is properly estimated and removed for finger-vein image restoration.

In the following sections, we first give a brief description of the related work in Section 2. Then, in Section 3 the traditional image dehazing model is presented, and the optical model used in this chapter is derived after discussing the difference and relationship between our model and the image dehazing model. In Section 4, the steps of scattering removal algorithm are detailed. For finger-vein image matching, the Phase-Only-Correlation measure is used in Section 5. The experimental results are reported in Section 6. Finally, in Section 7, we give some conclusions.

10.5772/52104

**Figure 2.** Finger-vein ROI extraction. Here the used ROI extraction method is proposed in [5].

**Figure 3.** Image contrast reduction due to light scattering. (a) A real shadow as no light scattering. (b) The shadow is blurred due to light scattering.

## **2. Related work**

2 New Trends and Developments in Biometrics

(a) (b) (c)

(d)

**Figure 1.** Finger-vein image acquisition. (a) NIR light transillumination. (b) A homemade finger-vein imaging device. (c) Our

practice. According to tissue optics, multiple light scattering predominates in lights that penetrate through the biological tissue layer [3] as the biological tissue is a kind of highly heterogeneous optical medium in imaging. Thus, the quality of finger-vein images is always not good because the scattering effects can greatly reduce the contrast between the venous and non-venous regions [4]. The basic concept of image degradation due to light scattering is illustrated in Fig. 3. If the incident light is not scattered in optical medium, a real shadow of an object must be casted on the imaging plane, as shown in Fig. 3(a), where the dark circle region represents the real shadow of an object. However, the object shadow is always blurred to a certain extent since light scattering is inevitable in real situations, as shown in Fig. 3(b). Hence, in practical scenario, the inherent advantage of finger-vein can not always be exploited effectively and reliably for finger-vein recognition due to the low contrast of venous regions. Therefore, to exploit the genuine characteristics in finger-vein images, the

In this chapter, we first give an analysis of the intrinsic factors causing finger-vein image degradation, and then propose a simple but effective image restoration method based on scattering removal. To give a proper description of finger-vein image degradation, a biological optical model (BOM) specific to finger-vein imaging is proposed according to the principle of light propagation in biological tissues. Finally, based on BOM, the light scattering

In the following sections, we first give a brief description of the related work in Section 2. Then, in Section 3 the traditional image dehazing model is presented, and the optical model used in this chapter is derived after discussing the difference and relationship between our model and the image dehazing model. In Section 4, the steps of scattering removal algorithm are detailed. For finger-vein image matching, the Phase-Only-Correlation measure is used in Section 5. The experimental results are reported in Section 6. Finally, in Section 7, we give

finger-vein image acquisition system. (d) Some captured finger-vein images.

some conclusions.

visibility of finger-vein patterns should be reliably improved in advance.

component is properly estimated and removed for finger-vein image restoration.

Traditionally, many image enhancement methods have been proposed to improve the quality of vein images. Histogram equalization based algorithms were used to enhance the contrast between the venous and background regions in [6, 7]. Considering the variations of vein-coursing directions, different oriented filtering strategies were used to highlight the finger-vein texture. [8–11]. The retinex theory combined with the fuzzy was adopted to enhance the near-infrared vein images [12]. Pi *et al*. [13] used an edge-preserving filter and an elliptic high-pass filter together to denoise and enhance some small blurred finger veins. Gao *et al*. [14] combined the traditional high frequency emphasis filtering algorithm and the histogram equalization to sharpen the image contrast. Oh *et al.* [15] proposed a homomorphic filter incorporating morphological subband decomposition to enhance the dark blood vessels. Although these methods can respectively enhance vein images to some extent, their performances were considerably undesirable in practice since they all did not treat of the key issue of light scattering in degrading finger-vein images.

Strong scattering occurring in the biological tissue during vein imaging is the main reason causing contrast deterioration in finger-vein images [16]. Considering light transport in skin tissue, Lee and Park used an depth-dependent point spread function (D-PSF) to address the blurring issue in finger-vein imaging [29, 30]. This method is encouraging in finger-vein visibility improvement, however, D-PSF is derived for handling degraded issues in transcutaneous fluorescent imaging manner but not in transillumination manner [31]. Hence, the performance of D-PSF on light scattering suppression is still unsatisfying for finger-vein images since, in transillumination, light attenuation (absorption and scattering) arises not only from the skin but from other tissues of the finger, such as bone, muscles, and blood vessels [33]. Moreover, estimating biological parameters properly is also a difficult task for D-PSF based image debluring technique in practice. Therefore, for reliable finger-vein image contrast improvement, this chapter aims to find a proper way of scattering removal according to tissue optics, especially skin optics.

**Figure 4.** Light scattering in the atmospheric medium. Here, the environmental illumination is redundant for object imaging.

In computer vision, scattering removal has been a hot topic for reducing the atmospheric scattering effects on the images of outdoor scenes [17–21]. This technique often is termed as dehazing or de-weather, which is based on a physical model that describes the formation of hazing image. Inspired by image dehazing, we here propose an optical-model-based scattering removal algorithm for finger-vein image enhancement. The proposed optical model allows for the light propagation in finger-skin layer such that it is powerful in describing the effects of skin scattering on finger-vein images.

## **3. The optical model of atmospheric scattering**

Light scattering is a physical phenomenon as light transports in turbid medium. In daily life, we are very familiar with light scattering, such as blue sky, fog and smoke. Therefore, the irradiance received by the camera is often attenuated due to medium absorption and scattering, which degrades the captured images and make them lose the contrast. Removing the scattering effect is certainly necessary for improving the scene visibility. In computer vision, the physical model widely used to image dehazing, also named Koschmieder model, is expressed as [22]

$$I\_d = e^{-Kd}I\_0 + (1 - e^{-Kd})I\_{\infty}.\tag{1}$$

10.5772/52104

63

http://dx.doi.org/10.5772/52104

Incident Light

blurring in optical transillumination imaging.

respect to the optical model should be stated:

very small and negligible here.

Tissue

**Figure 5.** Light propagation through biological tissue. Here, multiple scattering is mainly caused by diffuse photons.

scattering is said to be prevalent in the biological optical medium [1, 32–35].

It is noticeable that, despite having not taken multiple scattering into account, the Koschmieder model is practicable for vision applications. In atmosphere, the distances between particles are usually large enough so that the particles can be viewed as independent scatterers, whose scattered intensities do not significantly interfere with each other, and thus the effect of multiple scattering is negligible [23]. Whereas, in the case of biological tissue, light propagation suffers a more complex process due to the complexity of tissue structure. Particularly, the scattering particles in biological tissue are so dense that the interaction of scattered intensities between neighboring particles cannot be ignored [24]. Hence, multiple

From the biophotonic point of view, as the light propagates through a tissue, the transmitted light is composed of three components—the ballistic, the snake, and the diffuse photons [25], as shown in Fig. 5. Ballistic photons travel a straight, undeviated path in the medium. Snake photons experience some slight scattering events, but still propagate in the forward or near-forward direction. Diffuse photons undergo multiple scattering and emerge from random directions. Obviously, in transillumination imaging of objects embedded in the biological tissue, the ballistic photons with propagation direction preservation can form sharp shadows of objects on the imaging plane, whereas the multiple scattered diffuse photons can inevitably reduce the contrast of the shadows as well as giving rise to the unwanted, incoherent imaging background [26]. That is to say, the multiple scattering is the most unfavorable factor that contributes to diffuse photons and further leads to image

Based on the preceding analysis of image dehazing model and the associated knowledge about light propagation through biological tissue, we propose a simplified skin scattering model to characterize the effects of skin scattering on finger-vein imaging, as shown in Fig. 6. Before giving mathematical description of the proposed model, there are several points with

• In a real finger-vein imaging system, the objects to be visualized are palm-side vein vessels which are mostly interspersed within the inner layer of the finger skin (see Fig. 7(a)). So, for the sake of simplicity, only the skin layer is considered as a reference optical medium regardless of the atmosphere between skin surface and camera, whose scattering effect is

• Human skin is known to be an inhomogeneous, multilayered tissue containing epidermis, dermis and subcutaneous layer, as shown in Fig. 7(a). But at the molecular level, skin tissues are composed of a limited number of basic molecular species, and these molecules are composed of optically similar chemical units [3]. Moreover, the ensemble of light-skin interaction homogenizes the optical behavior of biological structures. Thus, the skin

Object

Transmitted Light *Ballistic*

Finger-Vein Image Restoration Based on a Biological Optical Model

*Diffuse Snake*

This model provides a very simple but elegant description for two main effects of atmospheric scattering on the observed intensity *Id* of an object at a distance *d* in a hazing or foggy day. Here, the intensity at close range (distance *d* = 0) *I*<sup>0</sup> is called the intrinsic intensity of the object, *I*<sup>∞</sup> is the intensity of environmental illumination (equivalent to an object at infinite distance), which is generally assumed to be globally constant, and *K* is the extinction coefficient of the atmosphere.

As illustrated in Fig. 4, the first effect of atmospheric scattering is called direct attenuation, and can be described by Beer–Lambert law, which results in an exponential attenuation of object intensity with the transmission distance through scattering media, i.e., the first term *<sup>e</sup>*−*Kd <sup>I</sup>*<sup>0</sup> on the right side of Eq. (1). The second effect, referred to as airlight in Koschmieder theory of horizontal visibility, is caused by the suspended particles in haze or fog that scatter the environmental illumination toward the observer. The airlight acts as an additional radiation superimposed on the image of the object, whose intensity is related to the environmental illumination *I*<sup>∞</sup> and increases with pathlength *d* from the observer to the object, as described by the term (1 − *e*−*Kd*)*I*∞.

**Figure 5.** Light propagation through biological tissue. Here, multiple scattering is mainly caused by diffuse photons.

Object

describing the effects of skin scattering on finger-vein images.

*Id* = *e*

**3. The optical model of atmospheric scattering**

is expressed as [22]

extinction coefficient of the atmosphere.

the object, as described by the term (1 − *e*−*Kd*)*I*∞.

Environmental

illumination Airlight

Direct Attenuation

**Figure 4.** Light scattering in the atmospheric medium. Here, the environmental illumination is redundant for object imaging.

In computer vision, scattering removal has been a hot topic for reducing the atmospheric scattering effects on the images of outdoor scenes [17–21]. This technique often is termed as dehazing or de-weather, which is based on a physical model that describes the formation of hazing image. Inspired by image dehazing, we here propose an optical-model-based scattering removal algorithm for finger-vein image enhancement. The proposed optical model allows for the light propagation in finger-skin layer such that it is powerful in

Light scattering is a physical phenomenon as light transports in turbid medium. In daily life, we are very familiar with light scattering, such as blue sky, fog and smoke. Therefore, the irradiance received by the camera is often attenuated due to medium absorption and scattering, which degrades the captured images and make them lose the contrast. Removing the scattering effect is certainly necessary for improving the scene visibility. In computer vision, the physical model widely used to image dehazing, also named Koschmieder model,

<sup>−</sup>*Kd <sup>I</sup>*<sup>0</sup> + (<sup>1</sup> <sup>−</sup> *<sup>e</sup>*

This model provides a very simple but elegant description for two main effects of atmospheric scattering on the observed intensity *Id* of an object at a distance *d* in a hazing or foggy day. Here, the intensity at close range (distance *d* = 0) *I*<sup>0</sup> is called the intrinsic intensity of the object, *I*<sup>∞</sup> is the intensity of environmental illumination (equivalent to an object at infinite distance), which is generally assumed to be globally constant, and *K* is the

As illustrated in Fig. 4, the first effect of atmospheric scattering is called direct attenuation, and can be described by Beer–Lambert law, which results in an exponential attenuation of object intensity with the transmission distance through scattering media, i.e., the first term *<sup>e</sup>*−*Kd <sup>I</sup>*<sup>0</sup> on the right side of Eq. (1). The second effect, referred to as airlight in Koschmieder theory of horizontal visibility, is caused by the suspended particles in haze or fog that scatter the environmental illumination toward the observer. The airlight acts as an additional radiation superimposed on the image of the object, whose intensity is related to the environmental illumination *I*<sup>∞</sup> and increases with pathlength *d* from the observer to

(Image Sensor) Distance (*d*)

Observer

<sup>−</sup>*Kd*)*I*∞. (1)

It is noticeable that, despite having not taken multiple scattering into account, the Koschmieder model is practicable for vision applications. In atmosphere, the distances between particles are usually large enough so that the particles can be viewed as independent scatterers, whose scattered intensities do not significantly interfere with each other, and thus the effect of multiple scattering is negligible [23]. Whereas, in the case of biological tissue, light propagation suffers a more complex process due to the complexity of tissue structure. Particularly, the scattering particles in biological tissue are so dense that the interaction of scattered intensities between neighboring particles cannot be ignored [24]. Hence, multiple scattering is said to be prevalent in the biological optical medium [1, 32–35].

From the biophotonic point of view, as the light propagates through a tissue, the transmitted light is composed of three components—the ballistic, the snake, and the diffuse photons [25], as shown in Fig. 5. Ballistic photons travel a straight, undeviated path in the medium. Snake photons experience some slight scattering events, but still propagate in the forward or near-forward direction. Diffuse photons undergo multiple scattering and emerge from random directions. Obviously, in transillumination imaging of objects embedded in the biological tissue, the ballistic photons with propagation direction preservation can form sharp shadows of objects on the imaging plane, whereas the multiple scattered diffuse photons can inevitably reduce the contrast of the shadows as well as giving rise to the unwanted, incoherent imaging background [26]. That is to say, the multiple scattering is the most unfavorable factor that contributes to diffuse photons and further leads to image blurring in optical transillumination imaging.

Based on the preceding analysis of image dehazing model and the associated knowledge about light propagation through biological tissue, we propose a simplified skin scattering model to characterize the effects of skin scattering on finger-vein imaging, as shown in Fig. 6. Before giving mathematical description of the proposed model, there are several points with respect to the optical model should be stated:


10.5772/52104

65

http://dx.doi.org/10.5772/52104

. (3)

Finger-Vein Image Restoration Based on a Biological Optical Model

, *i* = 1, 2, ··· , *n*) around *s* are

a pixel p Ω( ) p

), then the direct transmitted radiation

). So, according to

*i* )*T*(*s*′ *i*

where *I*<sup>0</sup> represents an finger-vein image free of degradation, *D* is the object depth in a biological medium, *µ* = *µ*<sup>1</sup> + *µ*<sup>2</sup> is called transport attenuation coefficient, and *Itr* denotes the transmitted intensity after absorption and scattering. Noticeably, due to the heterogeneity of skin tissue and the spatial randomness of vein distribution, both the transport attenuation coefficient *µ* and the depth *D* vary spatially in tissue medium, that is, *µ* = *µ*(*x*, *y*) and

*T*(*x*, *y*) is often called non-scattered transmission map [16], which describes the optical

For the scattering component, due to the randomicity of the scattered light, it can be regarded as the background illumination on the whole, and only a part of the background illumination can arrive at the imaging plane. For intuitively understanding this point, Fig. 8 gives a schematic illustration. In Fig. 8, *s* represents an original source in x-y coordinates, *p* is the observation of *s* on the imaging plane, *H* denotes a small column in the skin tissue corresponding to a beam from the object point *s* to a point *p* on the image plane (each

viewed as the local background radiation sources, which would emit radiation and produce

scattered radiance

a small column

scattering component

Image Plane

*I*(*p*) = *I*0(*s*)*T*(*s*)+(1 − *T*(*s*))*Ir*(*s*), (4)

unscattered radiance

D(s)

**Figure 8.** Schematic representation of the effect of scattered radiation and the finger-vein image degradation process. Here,

Accordingly, in a similar way of Koschmieder model, the proposed biological optical model

where *I*0(*s*) still represents the intrinsic intensity of the object, that is veins, to be visualized, *Ir*(*s*) denotes the intensity of scattered radiation, and *I*(*p*) is the observation of the vein object on the image plane. A key point needs to be noted that, different from the environmental illumination in atmosphere, *Ir*(*s*) varies spatially because its value is associated to the

> *<sup>i</sup>* be *<sup>I</sup>*0(*s*′ *i*

*µ*(*x*,*y*)*D*(*x*,*y*)

*i*

*T*(*x*, *y*) = *e*

*D* = *D*(*x*, *y*). So, for a given biological tissue, we can define

pixel corresponds to a small column), the neighbor points (*s*′

neighboring points

s′

s′

both *s* and *p* respectively corresponds to any one point in x-y coordinates.

s

s′

transmissivity of the given tissue medium.

a scattering component along *H*.

(BOM) is defined as

object point

intensities of the imaging background.

Let the original intensity of a neighbor point *s*′

of this point, that is the unscattered radiation, should be *I*0(*s*′

Ω( )s

**Figure 6.** The simplified scattering model in human skin layer. Here, the light photons are divided into scattered and un-scattered groups.

**Figure 7.** Skin layer modeling. (a) Cross-sectional view of human skin. (b) Simplified model of finger palm-side skin layer.

can be viewed as a medium with a random but homogeneous distribution of scattering particles over its thickness [27], as shown in Fig. 7(b), and then the scattering coefficient of the skin tissue here can be assumed to be locally constant for a given finger subject.

• Different from the image dehazing techniques, we need not consider the effect of environmental illumination as well as the airlight indeed. Nevertheless, due to light interaction occurs among biological scatterers, the scattered radiation from both the object and the background will be partially re-scattered towards the observer, which approximately amounts to environmental illumination for finger-vein imaging.

In view of these points, the radiant intensity observed at skin surface corresponding to the object with a certain depth in the skin can be simply decomposed into the direct attenuation component and the scattering component, as shown in Fig. 6. The former, representing the unscattered of ballistic photons, is a reduction of the original radiation over the traversing medium, which obeys the Beer–Lambert law [16, 36], while the latter represents the effect of snake and diffuse photons, which emerges randomly from the tissue surface. Especially, the proportion of scattered radiation enters into the direction of observer and interferes with the direct radiation of object, whose intensity increases with depth because a deeper object tends to suffer more influence of the scattered radiation.

For the direct attenuation component, its intensity on the imaging plane is mainly determined by the non-scattered light. So, assume that *µ*<sup>1</sup> and *µ*<sup>2</sup> denote the optical absorption and scattering coefficients, based on the Beer–Lambert law, we can obtain

$$I\_{tr} = I\_0 e^{-\mu D} \, \prime \tag{2}$$

10.5772/52104

where *I*<sup>0</sup> represents an finger-vein image free of degradation, *D* is the object depth in a biological medium, *µ* = *µ*<sup>1</sup> + *µ*<sup>2</sup> is called transport attenuation coefficient, and *Itr* denotes the transmitted intensity after absorption and scattering. Noticeably, due to the heterogeneity of skin tissue and the spatial randomness of vein distribution, both the transport attenuation coefficient *µ* and the depth *D* vary spatially in tissue medium, that is, *µ* = *µ*(*x*, *y*) and *D* = *D*(*x*, *y*). So, for a given biological tissue, we can define

6 New Trends and Developments in Biometrics

un-scattered groups.

Finger Vein (Object)

^

^

to suffer more influence of the scattered radiation.

Epidermis

Dermis

Subcutaneous

^

Object Plane

Depth (*D*)

(a) (b)

**Figure 6.** The simplified scattering model in human skin layer. Here, the light photons are divided into scattered and

**Figure 7.** Skin layer modeling. (a) Cross-sectional view of human skin. (b) Simplified model of finger palm-side skin layer.

approximately amounts to environmental illumination for finger-vein imaging.

In view of these points, the radiant intensity observed at skin surface corresponding to the object with a certain depth in the skin can be simply decomposed into the direct attenuation component and the scattering component, as shown in Fig. 6. The former, representing the unscattered of ballistic photons, is a reduction of the original radiation over the traversing medium, which obeys the Beer–Lambert law [16, 36], while the latter represents the effect of snake and diffuse photons, which emerges randomly from the tissue surface. Especially, the proportion of scattered radiation enters into the direction of observer and interferes with the direct radiation of object, whose intensity increases with depth because a deeper object tends

For the direct attenuation component, its intensity on the imaging plane is mainly determined by the non-scattered light. So, assume that *µ*<sup>1</sup> and *µ*<sup>2</sup> denote the optical absorption and scattering coefficients, based on the Beer–Lambert law, we can obtain

*Itr* = *I*0*e*

<sup>−</sup>*µD*, (2)

can be viewed as a medium with a random but homogeneous distribution of scattering particles over its thickness [27], as shown in Fig. 7(b), and then the scattering coefficient of the skin tissue here can be assumed to be locally constant for a given finger subject. • Different from the image dehazing techniques, we need not consider the effect of environmental illumination as well as the airlight indeed. Nevertheless, due to light interaction occurs among biological scatterers, the scattered radiation from both the object and the background will be partially re-scattered towards the observer, which

Skin Surface

Skin Layer Component

Scattering

Vein Vessels Scattering Particles

Direct Attenuation

Scattered Radiation

$$T(\mathbf{x}, \mathbf{y}) = e^{\mu(\mathbf{x}, \mathbf{y})D(\mathbf{x}, \mathbf{y})}. \tag{3}$$

*T*(*x*, *y*) is often called non-scattered transmission map [16], which describes the optical transmissivity of the given tissue medium.

For the scattering component, due to the randomicity of the scattered light, it can be regarded as the background illumination on the whole, and only a part of the background illumination can arrive at the imaging plane. For intuitively understanding this point, Fig. 8 gives a schematic illustration. In Fig. 8, *s* represents an original source in x-y coordinates, *p* is the observation of *s* on the imaging plane, *H* denotes a small column in the skin tissue corresponding to a beam from the object point *s* to a point *p* on the image plane (each pixel corresponds to a small column), the neighbor points (*s*′ *i* , *i* = 1, 2, ··· , *n*) around *s* are viewed as the local background radiation sources, which would emit radiation and produce a scattering component along *H*.

**Figure 8.** Schematic representation of the effect of scattered radiation and the finger-vein image degradation process. Here, both *s* and *p* respectively corresponds to any one point in x-y coordinates.

Accordingly, in a similar way of Koschmieder model, the proposed biological optical model (BOM) is defined as

$$I(p) = I\_0(s)T(s) + (1 - T(s))I\_r(s),\tag{4}$$

where *I*0(*s*) still represents the intrinsic intensity of the object, that is veins, to be visualized, *Ir*(*s*) denotes the intensity of scattered radiation, and *I*(*p*) is the observation of the vein object on the image plane. A key point needs to be noted that, different from the environmental illumination in atmosphere, *Ir*(*s*) varies spatially because its value is associated to the intensities of the imaging background.

Let the original intensity of a neighbor point *s*′ *<sup>i</sup>* be *<sup>I</sup>*0(*s*′ *i* ), then the direct transmitted radiation of this point, that is the unscattered radiation, should be *I*0(*s*′ *i* )*T*(*s*′ *i* ). So, according to the energy conservation principle, the scattered radiation of this point should be 1 − *T*(*s*′ *i* ) *I*0(*s*′ *i* ), where *D*(*s*′ *i* ) is the depth of point *s*′ *i* in the skin layer. Thus, we can obtain the scattered radiation *Ir*(*s*) in *H*. Since the scattering directions are random, *Ir*(*s*) here is considered as an average of total radiation from overall neighbor points and can be rewritten as

$$I\_{\Gamma}(s) = \frac{1}{Z\_{\Omega(s)}} \sum\_{s'\_i \in \Omega(s)} \left(1 - T(s'\_i)\right) I\_0(s'\_i) \,. \tag{5}$$

10.5772/52104

67

http://dx.doi.org/10.5772/52104

Finger-Vein Image Restoration Based on a Biological Optical Model

Instead of directly computing ˆ*I*0(*s*), we first estimate the scattering component *V*(*s*), and then estimate the intensity of scattered radiation ˆ*Ir*(*s*). Thus, the restored image ˆ*I*0(*s*) can be

Generally, the distribution of scattering energy is not uniform in a local block Ω(*s*) since the skin medium is inhomogeneous. However, it is affirmable that (1) the directions of scattered light rays are random due to the high density of biological cells, and (2) the

Although the direction of scattered light is highly random due to the heterogeneity of skin tissue, the multiple scattering is dominated by near forward scattering events in biological tissues [16, 42]. Hence, it is reasonable that using the local observation to estimate the

Here, unlike the solution of scattering component estimation described in [18], *V*(*s*) varies locally and spatially on finger-vein imaging plane due to the heterogeneousness of the human skin tissue. In this sense, three practical constraints should be introduced for *V*(*s*) estimation:

• For each point *s*, the intensity *V*(*s*) is positive and cannot be higher than the final observed

• *V*(*s*) is smooth except the edges of venous regions since the points in Ω(*s*) approximate

Based on these constraints, to estimate *V*(*s*), a fast algorithm described in [20] is modified as

 

the 2D neighborhood centered at point *p*, *w*<sup>1</sup> (∈ [0, 1]) is a factor controlling the strength of the estimated scattering component. Next, for removing the estimated scattering effect, we

To obtain the transmission map *T*(*s*), we should compute *Ir*(*s*) in advance according to the Eq. 8. Intuitively, we can obtain ˆ*Ir*(*s*) via Eq. (5) directly. However, it is a difficult task

meaning that the scattered radiation ˆ*Ir*(*s*) depends on the interaction among neighbor points

*Z*Ω(*p*) ∑ *i*=1

*Z*Ω(*p*)

*w*1*B*(*s*), ˆ*I*(*p*)

 , 0

, *A*(*p*) = *median*Ω(*p*)

) is unavailable in practice. Hence, considering the physical

, (10)

, Ω(*p*) denotes

 ˆ*I*(*p*) 

ˆ*I*(*pi*), (11)

min

<sup>ˆ</sup>*I*(*p*) <sup>−</sup> *<sup>A</sup>*(*p*)

*<sup>i</sup>* to *s* is, the higher the probability of the scattered light into column *H* is [40, 41].

obtained based on Eqs. (8) and (9).

nearer *s*′

scattering component.

to be same in depth;

where *B*(*s*) = *A*(*p*) − *median*Ω(*p*)

**4.2. Scattering radiation estimation**

since the intrinsic intensity ˆ*I*0(*s*′

**4.1. Scattering component estimation**

intensity ˆ*I*(*p*), that is, 0 ≤ *V*(*s*) ≤ ˆ*I*(*p*);

• ˆ*Ir*(*s*) tends to be constant in Ω(*s*) and *V*(*s*) ≤ ˆ*Ir*(*s*) ≤ ˆ*I*(*p*).

*V*(*s*) = max

*i*

in Ω(*s*), we here simply use a local statistic of Ω(*p*) to represent ˆ*Ir*(*s*), that is,

<sup>ˆ</sup>*Ir*(*s*) = *<sup>w</sup>*<sup>2</sup>

also have to estimate *T*(*s*) based on the observation.

where Ω(*s*) denotes the 2D neighborhood centered at point *s*, and *Z*<sup>Ω</sup> indicates the number of points in Ω(*s*). Given *Ir*(*s*), *µ*(*s*) and *D*(*s*), we can obtain *I*0(*s*) which represents the intrinsic intensity of a finger-vein image without scattering corruption.

However, solving *I*0(*s*) from a single observed image *I*(*p*) with Eq. (4) is a very ill-posed problem. Not only is the extinct coefficient *µ*(*s*) of human skin tissue inconsistent, but the thickness *D*(*s*) also varies with different individuals. The values of *Ir*(*s*), *µ*(*s*) and *D*(*s*) therefore can not be accurately evaluated in practice. This is due to that the light scattering phenomenon in tissues is very complex. Hence, we have to utilize the observation (or captured) image *I*(*p*) to estimate the scattering component for implementing scattering removal.

## **4. The proposed scattering removal algorithm for restoration**

In the observation *I*(*p*), veins appear shadows due to light absorbtion, which makes vein information sensitive to illumination modification. Hence, finger-vein images should be transformed into their negative versions. In the negative versions, the venous regions turn brighter than their surroundings, veins thus can be regarded as luminous objects. Moreover, in this situation, the skin tissue can be approximately treated as the only opaque layer that blurs vein objects during imaging. This is beneficial for scattering illumination estimation. Thus, we can rewrite the proposed BOM as

$$
\hat{I}(p) = \hat{I}\_0(s)T(s) + (1 - T(s))\hat{I}\_r(s), \tag{6}
$$

where ˆ*I*(*p*), ˆ*I*0(*s*) and ˆ*Ir*(*s*) represent the negative versions of *I*(*p*), *I*0(*s*) and *Ir*(*s*), respectively. Referring to the image dehazing technique, we here introduce

$$V(s) = (1 - T(s))\hat{I}\_r(s). \tag{7}$$

*V*(*s*) can be regarded as the total scattering component. Moreover, we can obtain the transmission map,

$$T(s) = 1 - \frac{V(s)}{\hat{I}\_r(s)}.\tag{8}$$

*T*(*s*) describes the relative portion of light radiation surviving through a medium. Thus, the optical model can be rewritten as

$$
\hat{I}(p) = T(s)\hat{I}\_0(s) + V(s). \tag{9}
$$

Instead of directly computing ˆ*I*0(*s*), we first estimate the scattering component *V*(*s*), and then estimate the intensity of scattered radiation ˆ*Ir*(*s*). Thus, the restored image ˆ*I*0(*s*) can be obtained based on Eqs. (8) and (9).

## **4.1. Scattering component estimation**

8 New Trends and Developments in Biometrics

), where *D*(*s*′

*i*

Thus, we can rewrite the proposed BOM as

*T*(*s*′ *i* ) *I*0(*s*′ *i*

as

removal.

transmission map,

optical model can be rewritten as

the energy conservation principle, the scattered radiation of this point should be

the scattered radiation *Ir*(*s*) in *H*. Since the scattering directions are random, *Ir*(*s*) here is considered as an average of total radiation from overall neighbor points and can be rewritten

where Ω(*s*) denotes the 2D neighborhood centered at point *s*, and *Z*<sup>Ω</sup> indicates the number of points in Ω(*s*). Given *Ir*(*s*), *µ*(*s*) and *D*(*s*), we can obtain *I*0(*s*) which represents the

However, solving *I*0(*s*) from a single observed image *I*(*p*) with Eq. (4) is a very ill-posed problem. Not only is the extinct coefficient *µ*(*s*) of human skin tissue inconsistent, but the thickness *D*(*s*) also varies with different individuals. The values of *Ir*(*s*), *µ*(*s*) and *D*(*s*) therefore can not be accurately evaluated in practice. This is due to that the light scattering phenomenon in tissues is very complex. Hence, we have to utilize the observation (or captured) image *I*(*p*) to estimate the scattering component for implementing scattering

In the observation *I*(*p*), veins appear shadows due to light absorbtion, which makes vein information sensitive to illumination modification. Hence, finger-vein images should be transformed into their negative versions. In the negative versions, the venous regions turn brighter than their surroundings, veins thus can be regarded as luminous objects. Moreover, in this situation, the skin tissue can be approximately treated as the only opaque layer that blurs vein objects during imaging. This is beneficial for scattering illumination estimation.

where ˆ*I*(*p*), ˆ*I*0(*s*) and ˆ*Ir*(*s*) represent the negative versions of *I*(*p*), *I*0(*s*) and *Ir*(*s*),

*V*(*s*) can be regarded as the total scattering component. Moreover, we can obtain the

*<sup>T</sup>*(*s*) = <sup>1</sup> <sup>−</sup> *<sup>V</sup>*(*s*)

*T*(*s*) describes the relative portion of light radiation surviving through a medium. Thus, the

ˆ*Ir*(*s*)

ˆ*I*(*p*) = ˆ*I*0(*s*)*T*(*s*)+(1 − *T*(*s*))ˆ*Ir*(*s*), (6)

*V*(*s*)=(1 − *T*(*s*))ˆ*Ir*(*s*). (7)

ˆ*I*(*p*) = *T*(*s*)ˆ*I*0(*s*) + *V*(*s*). (9)

. (8)

) is the depth of point *s*′

*<sup>Z</sup>*Ω(*s*) <sup>∑</sup> *s*′ *i* ∈Ω(*s*) 1 − *T*(*s* ′ *i*) *I*0(*s* ′

*Ir*(*s*) = <sup>1</sup>

intrinsic intensity of a finger-vein image without scattering corruption.

**4. The proposed scattering removal algorithm for restoration**

respectively. Referring to the image dehazing technique, we here introduce

1 −

*i* in the skin layer. Thus, we can obtain

*i*), (5)

Generally, the distribution of scattering energy is not uniform in a local block Ω(*s*) since the skin medium is inhomogeneous. However, it is affirmable that (1) the directions of scattered light rays are random due to the high density of biological cells, and (2) the nearer *s*′ *<sup>i</sup>* to *s* is, the higher the probability of the scattered light into column *H* is [40, 41]. Although the direction of scattered light is highly random due to the heterogeneity of skin tissue, the multiple scattering is dominated by near forward scattering events in biological tissues [16, 42]. Hence, it is reasonable that using the local observation to estimate the scattering component.

Here, unlike the solution of scattering component estimation described in [18], *V*(*s*) varies locally and spatially on finger-vein imaging plane due to the heterogeneousness of the human skin tissue. In this sense, three practical constraints should be introduced for *V*(*s*) estimation:


Based on these constraints, to estimate *V*(*s*), a fast algorithm described in [20] is modified as

$$V(s) = \max\left(\min\left(w\_1 B(s), \hat{I}(p)\right), 0\right),\tag{10}$$

where *B*(*s*) = *A*(*p*) − *median*Ω(*p*) <sup>ˆ</sup>*I*(*p*) <sup>−</sup> *<sup>A</sup>*(*p*) , *A*(*p*) = *median*Ω(*p*) ˆ*I*(*p*) , Ω(*p*) denotes the 2D neighborhood centered at point *p*, *w*<sup>1</sup> (∈ [0, 1]) is a factor controlling the strength of the estimated scattering component. Next, for removing the estimated scattering effect, we also have to estimate *T*(*s*) based on the observation.

### **4.2. Scattering radiation estimation**

To obtain the transmission map *T*(*s*), we should compute *Ir*(*s*) in advance according to the Eq. 8. Intuitively, we can obtain ˆ*Ir*(*s*) via Eq. (5) directly. However, it is a difficult task since the intrinsic intensity ˆ*I*0(*s*′ *i* ) is unavailable in practice. Hence, considering the physical meaning that the scattered radiation ˆ*Ir*(*s*) depends on the interaction among neighbor points in Ω(*s*), we here simply use a local statistic of Ω(*p*) to represent ˆ*Ir*(*s*), that is,

$$\hat{I}\_r(s) = \frac{w\_2}{Z\_{\Omega(p)}} \sum\_{i=1}^{Z\_{\Omega(p)}} \hat{I}(p\_i)\_{\prime} \tag{11}$$

where *pi* ∈ <sup>Ω</sup>(*p*), *<sup>Z</sup>*Ω(*p*) indicates the number of points in <sup>Ω</sup>(*p*), and *<sup>w</sup>*<sup>2</sup> (∈ [0, 1]) is a factor for making the constraint *V*(*s*) ≤ ˆ*Ir*(*s*) ≤ ˆ*I*(*p*) satisfying. So, based on Eq. (8), we can estimate *T*(*s*) accordingly.

## **4.3. Finger-vein image restoration**

Given the estimations of *V*(*s*) and *T*(*s*), we can approximately restore an original finger-vein image with scattering removal. That is, by solving Eq. (6) with respect to ˆ*I*0(*s*), we can obtain

$$\begin{split} I\_0(s) &= 1 - \hat{I}\_0(s) \\ &= 1 - \frac{\hat{I}(p) - V(s)}{T(s)}. \end{split} \tag{12}$$

10.5772/52104

69

http://dx.doi.org/10.5772/52104

Max=0.0317

Finger-Vein Image Restoration Based on a Biological Optical Model

Max=1

classes.

**6. Experimental results**

8-bit gray images with a resolution of 180 × 100.

**6.1. Finger-vein image restoration**

image degradation.

**Figure 9.** POC measure. Left: *r*(*x*, *y*) of two same finger-vein images. Right: *r*(*x*, *y*) of two finger-vein images from different

In this section, the used finger-vein images are captured by a homemade transillumination imaging system with a 760 nm NIR LED array source, and then extracted from raw images by the ROI localization and segmentation method proposed in [5]. The finger-vein image database contains 700 individual finger-vein images from 70 individuals. Each individual contributes 10 forefinger-vein images of the right hand. All cropped finger-vein images are

Here, some captured finger-vein image samples are collected to demonstrate the validity of the proposed method in finger-vein image restoration. Fig. 10 shows some examples of the estimated *V*(*x*, *y*), *Ir*(*x*, *y*), *T*(*x*, *y*) and restored finger-vein images *I*0(*x*, *y*). After scattering removal, the contrast of finger-vein images is improved significantly, and the vein networks present in the restored images can be clearly distinguished from the background. This shows that the proposed optical model allowing for the effects of light scattering in skin layer, particularly the multiple scattering, is desirable for describing the mechanism of finger-vein

Nevertheless, the proposed method is somewhat sensitive to image noises, as shown in Fig. 10(e). In fact, before lighting the palm-side veins, the NIR rays have been randomly diffused by finger dorsal tissues such as finger-back skin, bone, tendon, fatty tissue and so on. This inevitably gives rise to irregular shadows and noises in the captured finger-vein images, whereas the proposed optical model has not taken account of the effects of finger dorsal tissues except the palm-side skin. As a result, the spatial varied background noises

In Fig. 11, we compare our method with several common approaches for finger-vein image enhancement. Additionally, we treat the degraded finger-vein images as hazing images, and directly use dehazing method to restore them regardless of the mismatch between the Koschmieder model and the proposed model. Here, a method proposed in [20] is adopted to implement finger-vein image "dehazing", and the results are also shown in Fig. 11.

are also strengthened when estimating the scattering components.

Thus, computing *I*0(*s*) pixelwise using Eq. (12) can generate an image *I*0(*x*, *y*) automatically and effectively. Here, *I*0(*x*, *y*) represents the restored finger-vein image which appears free of multiple light scattering.

## **5. Finger-vein image matching**

In this section, the Phase-Only-Correlation (POC) measure proposed in [28] is used for simply handling the finger-vein matching problem based on the restored finger-vein images.

Assume that *I*0*<sup>i</sup>* (*x*, *y*) and *I*0*<sup>j</sup>* (*x*, *y*) are two restored images, and *Fi*(*u*, *v*) and *Fj*(*u*, *v*) represent their 2D DFT, respectively, according to the property of Fourier transform, that is,

$$I\_{0\_l}(\mathbf{x}, \mathcal{Y}) \circ I\_{0\_l}(\mathbf{x}, \mathcal{Y}) \iff F\_i(\mathbf{u}, \boldsymbol{\upsilon}) \overline{F\_j(\mathbf{u}, \boldsymbol{\upsilon})},\tag{13}$$

where " ◦ " denotes a 2D correlation operator, we can compute the cross phase spectrum as

$$R(\mu, v) = \frac{F\_l(\mu, v)\overline{F\_{\hat{l}}(\mu, v)}}{||F\_l(\mu, v)\overline{F\_{\hat{l}}(\mu, v)}||} = e^{\hat{l}\theta(\mu, v)}.\tag{14}$$

Let *r*(*x*, *y*) = IDFT(*R*(*u*, *v*)), thus, *r*(*x*, *y*) is called the POC measure. The POC measure has a sharp peak when two restored finger-vein images are similar, whereas it will be near zero for those from different classes, as shown in Fig. 9. Moreover, the POC function is somewhat insensitive to image shifts and noises. This is helpful for accurately measuring the similarities in finger-vein image matching.

It is worth pointing out that, to robustly handle accurate image matching problem, band-limited phase-only-correlation (BLPOC) function has been also proposed in [28] and widely used for image matching in practice [37–39]. Compared with POC, BLPOC is more reliable in measuring the similarities between two images. However, traditional POC is yet more convincing than BLPOC in investigating the qualities of images. This is because the matching result based on POC is more sensitive to image quality than that of BLPOC. Hence, the POC function still can be used as a simple and effective measure to objectively evaluate the performance of the proposed method in scattering removal and venous region enhancement.

**Figure 9.** POC measure. Left: *r*(*x*, *y*) of two same finger-vein images. Right: *r*(*x*, *y*) of two finger-vein images from different classes.

## **6. Experimental results**

10 New Trends and Developments in Biometrics

**4.3. Finger-vein image restoration**

estimate *T*(*s*) accordingly.

of multiple light scattering.

in finger-vein image matching.

enhancement.

Assume that *I*0*<sup>i</sup>*

**5. Finger-vein image matching**

(*x*, *y*) and *I*0*<sup>j</sup>*

*I*0*i*

(*x*, *y*) ◦ *I*0*<sup>j</sup>*

*R*(*u*, *v*) =

where *pi* ∈ <sup>Ω</sup>(*p*), *<sup>Z</sup>*Ω(*p*) indicates the number of points in <sup>Ω</sup>(*p*), and *<sup>w</sup>*<sup>2</sup> (∈ [0, 1]) is a factor for making the constraint *V*(*s*) ≤ ˆ*Ir*(*s*) ≤ ˆ*I*(*p*) satisfying. So, based on Eq. (8), we can

Given the estimations of *V*(*s*) and *T*(*s*), we can approximately restore an original finger-vein image with scattering removal. That is, by solving Eq. (6) with respect to ˆ*I*0(*s*), we can obtain

<sup>=</sup> <sup>1</sup> <sup>−</sup> <sup>ˆ</sup>*I*(*p*) <sup>−</sup> *<sup>V</sup>*(*s*)

Thus, computing *I*0(*s*) pixelwise using Eq. (12) can generate an image *I*0(*x*, *y*) automatically and effectively. Here, *I*0(*x*, *y*) represents the restored finger-vein image which appears free

In this section, the Phase-Only-Correlation (POC) measure proposed in [28] is used for simply handling the finger-vein matching problem based on the restored finger-vein images.

where " ◦ " denotes a 2D correlation operator, we can compute the cross phase spectrum as

*Fi*(*u*, *v*)*Fj*(*u*, *v*) �*Fi*(*u*, *<sup>v</sup>*)*Fj*(*u*, *<sup>v</sup>*)� <sup>=</sup> *<sup>e</sup>*

Let *r*(*x*, *y*) = IDFT(*R*(*u*, *v*)), thus, *r*(*x*, *y*) is called the POC measure. The POC measure has a sharp peak when two restored finger-vein images are similar, whereas it will be near zero for those from different classes, as shown in Fig. 9. Moreover, the POC function is somewhat insensitive to image shifts and noises. This is helpful for accurately measuring the similarities

It is worth pointing out that, to robustly handle accurate image matching problem, band-limited phase-only-correlation (BLPOC) function has been also proposed in [28] and widely used for image matching in practice [37–39]. Compared with POC, BLPOC is more reliable in measuring the similarities between two images. However, traditional POC is yet more convincing than BLPOC in investigating the qualities of images. This is because the matching result based on POC is more sensitive to image quality than that of BLPOC. Hence, the POC function still can be used as a simple and effective measure to objectively evaluate the performance of the proposed method in scattering removal and venous region

their 2D DFT, respectively, according to the property of Fourier transform, that is,

(*x*, *y*) are two restored images, and *Fi*(*u*, *v*) and *Fj*(*u*, *v*) represent

ˆ*jθ*(*u*,*v*)

(*x*, *y*) ⇐⇒ *Fi*(*u*, *v*)*Fj*(*u*, *v*), (13)

. (14)

*<sup>T</sup>*(*s*) . (12)

*I*0(*s*) = 1 − ˆ*I*0(*s*)

In this section, the used finger-vein images are captured by a homemade transillumination imaging system with a 760 nm NIR LED array source, and then extracted from raw images by the ROI localization and segmentation method proposed in [5]. The finger-vein image database contains 700 individual finger-vein images from 70 individuals. Each individual contributes 10 forefinger-vein images of the right hand. All cropped finger-vein images are 8-bit gray images with a resolution of 180 × 100.

## **6.1. Finger-vein image restoration**

Here, some captured finger-vein image samples are collected to demonstrate the validity of the proposed method in finger-vein image restoration. Fig. 10 shows some examples of the estimated *V*(*x*, *y*), *Ir*(*x*, *y*), *T*(*x*, *y*) and restored finger-vein images *I*0(*x*, *y*). After scattering removal, the contrast of finger-vein images is improved significantly, and the vein networks present in the restored images can be clearly distinguished from the background. This shows that the proposed optical model allowing for the effects of light scattering in skin layer, particularly the multiple scattering, is desirable for describing the mechanism of finger-vein image degradation.

Nevertheless, the proposed method is somewhat sensitive to image noises, as shown in Fig. 10(e). In fact, before lighting the palm-side veins, the NIR rays have been randomly diffused by finger dorsal tissues such as finger-back skin, bone, tendon, fatty tissue and so on. This inevitably gives rise to irregular shadows and noises in the captured finger-vein images, whereas the proposed optical model has not taken account of the effects of finger dorsal tissues except the palm-side skin. As a result, the spatial varied background noises are also strengthened when estimating the scattering components.

In Fig. 11, we compare our method with several common approaches for finger-vein image enhancement. Additionally, we treat the degraded finger-vein images as hazing images, and directly use dehazing method to restore them regardless of the mismatch between the Koschmieder model and the proposed model. Here, a method proposed in [20] is adopted to implement finger-vein image "dehazing", and the results are also shown in Fig. 11.

(a)

(c)

(e)

10.5772/52104

71

http://dx.doi.org/10.5772/52104

Finger-Vein Image Restoration Based on a Biological Optical Model

<sup>10</sup>),

<sup>70</sup>). By respectively using

(b)

(d)

(f)

The captured images 1 1 Histogram Template Equalization (HTE) 0.4076 4.4941 High Frequency Emphasis Filtering (HFEF) 0.4239 3.7571 Circular Gabor Filtering (CGF) 0.4141 3.7386 Image Dehazing (ImD) 0.4932 3.3967 The Proposed Method 0.3358 4.6210

For finger-vein matching on this database, the number of genuine attempts is 3,150 (70*C*<sup>2</sup>

the original images, HTE-based images, HFEF-based images, CGF-based images and the proposed restored images for finger-vein matching under POC (Phase-Only-Correlation) measure, the ROC (receiver operating characteristic) curves are plotted in Fig. 12, where false non-match rates (FNMR) and false match rates (FMR) are shown in the same plot at different thresholds on the POC matching score, and EER (equal error rate) is the error rate

From Fig. 12, we can clearly see that the proposed method has the best performance of ROC curves and makes the lowest EER. This indicates that the finger-vein images with scattering removal are more discriminative in inter-class. Hence, the proposed method is desirable for

**Methods Average MSSIM Average CII**

**Figure 11.** Comparisons with other methods. (a) Some captured finger-vein images. (b) The results from histogram template equalization (HTE) [6]. (c) The results from high frequency emphasis filtering (HFEF) [14]. (d) The results from circular Gabor

filtering (CGF) [8]. (e) The results from image dehazing (ImD) [20]. (f) The results from the proposed method.

**Table 1.** Quantitative evaluation of different enhancement methods.

and the number of impostor attempts is 241,500 (10 × 10*C*<sup>2</sup>

improving the accuracy of finger-vein image matching in practice.

**6.2. Finger-vein image matching**

where FNMR and FMR are equal.

**Figure 10.** Scattering removal experiments. (a) Some captured finger-vein images *I*(*x*, *y*). (b) The estimated scattering components *V*(*x*, *y*). (c) The estimated scattering radiations *Ir*(*x*, *y*). (d) The estimated transmission maps *T*(*x*, *y*). (e) The restored images *I*0(*x*, *y*).

In order to evaluate the performance of the proposed method in terms of contrast improvement for finger-vein image, the mean structural similarity index (MSSIM) [13] and the contrast improvement index (CII) [15] are used as two common evaluation criterions. We first randomly choose 50 individual finger-vein images from database as samples, and use these enhancement methods in Fig. 11 to process the finger-vein image samples. Then, we obtain the average MSSIM and the average CII of every enhancement method.

In general, MSSIM is often used to measure the similarity between a processed image and a standard image with perfect quality ( *i.e.*, a distortion-free image). The larger the value of MSSIM is, the better an image is improved. This makes a processed image more approximate to its standard quality. However, it is impossible for us to have standard or perfect finger-vein images since the captured images all are degraded due to light scattering. Therefore, we regard the degraded finger-vein images as standard references. Thus, the more the dissimilarity between a processed finger-vein image and its original version is, the better the finger-vein is improved. That is, the lower the value of MSSIM is, the better the quality of a restored image is. CII is often used to measure the improvement of contrast between a processed image and its original version, and the larger the value of CII is, the better the contrast of an improved image is.

Hence, the quality and the visibility of restored finger-vein images can be quantitatively evaluated using MSSIM and CII. In Table 1, we list the two values corresponding to different finger-vein enhancement methods. From Table 1, we can clearly see that the proposed method provides the lowest MSSIM value and the highest CII value. This means the proposed method has better performance in finger-vein image enhancement.

10.5772/52104

**Figure 11.** Comparisons with other methods. (a) Some captured finger-vein images. (b) The results from histogram template equalization (HTE) [6]. (c) The results from high frequency emphasis filtering (HFEF) [14]. (d) The results from circular Gabor filtering (CGF) [8]. (e) The results from image dehazing (ImD) [20]. (f) The results from the proposed method.


**Table 1.** Quantitative evaluation of different enhancement methods.

#### **6.2. Finger-vein image matching**

12 New Trends and Developments in Biometrics

restored images *I*0(*x*, *y*).

contrast of an improved image is.

(a) (b) (c) (d) (e)

**Figure 10.** Scattering removal experiments. (a) Some captured finger-vein images *I*(*x*, *y*). (b) The estimated scattering components *V*(*x*, *y*). (c) The estimated scattering radiations *Ir*(*x*, *y*). (d) The estimated transmission maps *T*(*x*, *y*). (e) The

In order to evaluate the performance of the proposed method in terms of contrast improvement for finger-vein image, the mean structural similarity index (MSSIM) [13] and the contrast improvement index (CII) [15] are used as two common evaluation criterions. We first randomly choose 50 individual finger-vein images from database as samples, and use these enhancement methods in Fig. 11 to process the finger-vein image samples. Then, we

In general, MSSIM is often used to measure the similarity between a processed image and a standard image with perfect quality ( *i.e.*, a distortion-free image). The larger the value of MSSIM is, the better an image is improved. This makes a processed image more approximate to its standard quality. However, it is impossible for us to have standard or perfect finger-vein images since the captured images all are degraded due to light scattering. Therefore, we regard the degraded finger-vein images as standard references. Thus, the more the dissimilarity between a processed finger-vein image and its original version is, the better the finger-vein is improved. That is, the lower the value of MSSIM is, the better the quality of a restored image is. CII is often used to measure the improvement of contrast between a processed image and its original version, and the larger the value of CII is, the better the

Hence, the quality and the visibility of restored finger-vein images can be quantitatively evaluated using MSSIM and CII. In Table 1, we list the two values corresponding to different finger-vein enhancement methods. From Table 1, we can clearly see that the proposed method provides the lowest MSSIM value and the highest CII value. This means the

proposed method has better performance in finger-vein image enhancement.

obtain the average MSSIM and the average CII of every enhancement method.

For finger-vein matching on this database, the number of genuine attempts is 3,150 (70*C*<sup>2</sup> <sup>10</sup>), and the number of impostor attempts is 241,500 (10 × 10*C*<sup>2</sup> <sup>70</sup>). By respectively using the original images, HTE-based images, HFEF-based images, CGF-based images and the proposed restored images for finger-vein matching under POC (Phase-Only-Correlation) measure, the ROC (receiver operating characteristic) curves are plotted in Fig. 12, where false non-match rates (FNMR) and false match rates (FMR) are shown in the same plot at different thresholds on the POC matching score, and EER (equal error rate) is the error rate where FNMR and FMR are equal.

From Fig. 12, we can clearly see that the proposed method has the best performance of ROC curves and makes the lowest EER. This indicates that the finger-vein images with scattering removal are more discriminative in inter-class. Hence, the proposed method is desirable for improving the accuracy of finger-vein image matching in practice.

10.5772/52104

73

http://dx.doi.org/10.5772/52104

Finger-Vein Image Restoration Based on a Biological Optical Model

**Author details**

and Technology, China

**References**

2010.

pp. 1148-1151.

46, pp.291-292, 2008.

39, pp.220-222, 2009.

Jinfeng Yang1, Yihua Shi1 and Jucheng Yang<sup>2</sup>

1 Tianjin Key Lab for Advanced Signal Processing, Civil Aviation University of China, China 2 College of Computer Science and Information Engineering, Tianjin University of Science

[1] Dhawan, A.P.; Alessandro, B.D.; Fu, X. Optical imaging modalities for biomedical

[2] Kono, M.; Ueki, H.; Umemura, S. Near-infrared finger vein patterns for personal

[3] Backman, V.; Wax, A. Classical light scattering models. *Biomedical Applications of Light Scattering*; Wax, A., Backman, V.,Eds.; McGraw-Hill: New York, NY, USA, pp.3-29.

[4] Sprawls, P. Scattered Radiation and Contrast. *The Physical Principles of Medical Imaging*,

[5] Yang, J.F.; Li, X. Efficient Finger vein Localization and Recognition. *Proceedings of the 20th International Conference on Pattern Recognition*, Istanbul, Turkey,23–26 August 2010;

[6] Wen, X.B.; Zhao, J.W.; Liang, X.Z. Image enhancement of finger-vein patterns based on wavelet denoising and histogram template equalization (in Chinese). *J. Jilin University*,

[7] Zhao, J.J.; Xiong, X.; Zhang, L.; Fu, T.; Zhao, Y.X. Study on enhanced algorithm of hand vein image based on CLAHE and Top-hat transform (in Chinese). *Laser Infrared*,

[8] Yang, J.F.; Yang, J.L.; Shi, Y.H. Combination of Gabor Wavelets and Circular Gabor Filter for Finger-vein Extraction. *Proceedings of the 5th International Conference on*

[9] Yang, J.F.; Yang, J.L. Multi-Channel Gabor Filter Design for Finger-vein Image Enhancement. In *Proceedings of the 5th International Conference on Image and Graphics*,

[10] Yang, J.F.; Yan, M.F. An Improved Method for Finger-vein Image Enhancement. *Proceedings of IEEE 10th International Conference on Signal Processing*, Beijing, China,

[11] Wang, K.J.; Ma, H.; Li, X.F.; Guan, F.X.; Liu, J.Y. Finger vein pattern extraction method using oriented filtering technology (in Chinese). *J. Image Graph.*, 16, pp.1206-1212, 2011.

*Intelligent Computing*, Ulsan, South Korea, 16-19 September 2009;

Xi'an, China, 20-23 September 2009; pp. 87-91.

24–28 October 2010; pp. 1706-1709.

applications, *IEEE Reviews in Biomedical Engineering*, vol.3, pp.69-92, 2010.

identification. *Appl. Opt.*, 41, pp.7429-7436, 2002.

2nd ed.; Aspen Publishers, New York, NY, USA, 1993).

**Figure 12.** ROC curves of different finger-vein enhancement results.

## **7. Conclusions**

In this chapter, a scattering removal method was introduced for finger-vein image restoration. The proposed method was based on a biological optical model which reasonably described the effects of skin scattering. In this model, the degradation of finger-vein images was viewed as a joint function of the direct light attenuation and multiple light scattering. By properly estimating the scattering components and transmission maps, the proposed method could effectively remove the effects of skin scattering effects from finger-vein images to obtain the restored results. The comparative experiments and quantitative evaluations demonstrated that the proposed method could provide better results compared to the common methods for finger-vein image enhancement and recognition.

Indeed, the proposed method also had its own drawbacks. First, the simplified model in our work did not take into account of the effects of background tissues, which made the proposed method somewhat sensitive to image noises while enhancing the vein patterns. Besides, the rough estimations of the scattering components as well as the scattered radiations could also decrease the performance of the proposed method to some extent. All these shortcomings will be of our further improvement in future work.

## **Acknowledgements**

This work was supported in part by the National Natural Science Foundation of China (Grant No.61073143 and 61063035).

## **Author details**

14 New Trends and Developments in Biometrics

**FNMR**

**0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1**

**7. Conclusions**

**Acknowledgements**

No.61073143 and 61063035).

**Figure 12.** ROC curves of different finger-vein enhancement results.

for finger-vein image enhancement and recognition.

will be of our further improvement in future work.

**0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1**

**EER**

**Orignal image**

**The Proposed Method Circular Gabor Filtering Histogram Template Equalization**

**High Frequency Emphasis Filtering**

**FMR**

In this chapter, a scattering removal method was introduced for finger-vein image restoration. The proposed method was based on a biological optical model which reasonably described the effects of skin scattering. In this model, the degradation of finger-vein images was viewed as a joint function of the direct light attenuation and multiple light scattering. By properly estimating the scattering components and transmission maps, the proposed method could effectively remove the effects of skin scattering effects from finger-vein images to obtain the restored results. The comparative experiments and quantitative evaluations demonstrated that the proposed method could provide better results compared to the common methods

Indeed, the proposed method also had its own drawbacks. First, the simplified model in our work did not take into account of the effects of background tissues, which made the proposed method somewhat sensitive to image noises while enhancing the vein patterns. Besides, the rough estimations of the scattering components as well as the scattered radiations could also decrease the performance of the proposed method to some extent. All these shortcomings

This work was supported in part by the National Natural Science Foundation of China (Grant

Jinfeng Yang1, Yihua Shi1 and Jucheng Yang<sup>2</sup>

1 Tianjin Key Lab for Advanced Signal Processing, Civil Aviation University of China, China 2 College of Computer Science and Information Engineering, Tianjin University of Science and Technology, China

## **References**


[12] Wang, K.J.; Fu, B.; Xiong, X.Y. A novel adaptive vein image contrast enhancement method based on fuzzy and retinex theory (in Chinese). *Tech. Automation Appl.*, 28, pp.72-75, 2009.

10.5772/52104

75

http://dx.doi.org/10.5772/52104

Finger-Vein Image Restoration Based on a Biological Optical Model

[28] Ito, K.; Nakajima, H.; Kobayashi, K.; Aoki, T.; Higuchi, T. A fingerprint matching algorithm using phase-only correlation. *IEICE Trans. Fundamentals*, E87-A(3),

[29] Lee, E.C.; Park, K.R. Restoration method of skin scattering blurred vein image for

[30] Lee, E.C.; Park, K.R. Image restoration of skin scattering and optical blurring for finger

[31] Shimizu, K.; Tochio, K.; Kato, Y. Improvement of transcutaneous fluorescent images with a depth-dependent point-spread function, *Applied Optics*, vol.44, no.11,

[32] Delpy, D.T.; Cope, M. Quantification in tissue near-infrared spectroscopy, *Phil. Trans.*

[33] Xu, J.; Wei, H.; Li, X.; Wu, G.; Li, D. Optical characteristics of human veins tissue in kubelka-munk model at he-ne laser in vitro, *Journal of Optoelectronics Laser*, vol.13, no.3,

[34] Cheng, R.; Huang, B.; Wang, Y.; Zeng, H.; Xie, S. The optical model of human skin.

[35] Bashkatov, A.N.; Genina, E.A.; Kochubey, V.I.; Tuchin, V.V. Optical properties of human skin, subcutaneous and mucous tissues in the wavelength rang from 400 to 2000 nm, *J.*

[37] Ito, K.; Aoki, T.; Nakajima, H.; Kobayashi, K.; Higuchi, T. A palmprint recognition algorithm using phase-only correlation, *IEICE Trans. Fundamentals*, vol.E91-A, no.4,

[38] Miyazawa, K.; Ito, K.; Aoki, T. Kobayashi, K.; Nakajima, H. An effective approach for iris recognition using phase-based image matching, *IEEE Trans. Pattern Anal. Mach.*

[39] Zhang, L.; Zhang, L.; Zhang, D.; Zhu, H. Ensemble of local and global information for finger-knuckle-print recognition, *Pattern recognition*, vol.44, no.9, pp.1990-1998, Sep.

[40] De Boer, J.F., Van Rossum, M.C.W.; Van Albada, M.P.; Nieuwenhuizen, T.M.; Lagendijk, A. Probability distribution of multiple scattered light measured in total transmission,

[41] Fruhwirth, R.; Liendl, M. Mixture models of multiple scattering: computation and simulation, *Computer Physics Communications*, vol.141, pp.230-46, Jun. 2001.

finger vein recognition, *Electron. Lett.*, vol.45, no.21, pp.1074-1076, 2009.

vein recognition, *Opt. Lasers Eng.*, vol.49, pp.816-828, 2011.

*R. Soc. Lond. B.*, vol. 352, pp.649-659, 1997.

*Acta Laser Biology Sinica*, vol.14, pp.401-404, 2005.

*Intell.*, vol.20, no.10, pp.1741-1756, Oct. 2008.

*Phys. Rev. Lett.*, vol.73, no.19, pp.2567-2570, Nov. 1994.

*Phys. D: Appl. Phys.*, vol.38, no.2005, pp.2543-2555, 2005.

[36] Ingle, J.D.J; Crouch, S.R. Spectrochemical analysis. *Prentice Hall*, 1988.

pp.682-691, 2004.

pp.2154-2161, 2005.

pp.401-404, 2002.

pp.1023-1030, Apr. 2008.

2011.


10.5772/52104

[28] Ito, K.; Nakajima, H.; Kobayashi, K.; Aoki, T.; Higuchi, T. A fingerprint matching algorithm using phase-only correlation. *IEICE Trans. Fundamentals*, E87-A(3), pp.682-691, 2004.

16 New Trends and Developments in Biometrics

pp.72-75, 2009.

[12] Wang, K.J.; Fu, B.; Xiong, X.Y. A novel adaptive vein image contrast enhancement method based on fuzzy and retinex theory (in Chinese). *Tech. Automation Appl.*, 28,

[13] Pi, W.; Shin, J.O; Park, D.S. An Effective Quality Improvement Approach for Low Quality Finger vein Image. *Proceedings of International Conference on Electronics and*

[14] Gao, X.Y.; Ma, J.S.; Wu, J.J. The research of finger-vein image enhancement algorithm

[15] Oh, J.S.; Hwang, H.S. Feature enhancement of medical images using morphology-based homomorphic filter and differential evolution algorithm. *Int.*

[16] Cheong W.F.; Prahl S.A.; Welch A.J. A review of the optical properties of biological

[17] Narasimhan, S.G.; Nayar, S.K. Contrast restoration of weather degraded images. *IEEE*

[18] Tan, R.T. Visibility in bad weather from a single image. *Proceedings of IEEE Conference on Computer Vision and Pattern Recognition*, Anchorage, AK, USA, 23–28 June 2008; pp.1-8.

[20] Tarel, J.P.; Hautière, N. Fast Visibility Restoration from a Single Color or Gray Level Image. *Proceedings of IEEE 12th International Conference on Computer Vision*, Kyoto, Japan,

[21] He, K.M.; Sun, J.; Tang, X.O. Single image haze removal using dark channel prior. *IEEE*

[22] Dumont, E.; Hautière, N.; Gallen, R. A semi-analytic model of fog effects on vision. *Atmospheric Turbulence, Meteorological Modeling and Aerodynamics*; Lang, P.R., Lombargo,

[23] Narasimhan, S.G.; Nayar, S.K. Vision and the atmosphere. *Int. J. Comput. Vis.*,48,

[24] Hollis, V. Non-Invasive Monitoring of Brain Tissue Temperature by Near-Infrared Spectroscopy. Ph.D. Dissertation, University of London, London, UK, 2002.

[25] Prasad, P.N. Bioimaging: Principles and Techniques. *Introduction to Biophotonics*; John

[26] Ramachandran, H. Imaging through turbid media. *Current Sci.*, 76, 1pp.334-1340, 1999.

[27] Van Gemert, M.J.C.; Jacques, S.L.; Sterenborg, H.J.C.M.; Star, W.M. Skin optics. *IEEE*

F.S., Eds.; Nova Science Publishers: New York, NY, USA, 2011; pp.635-670.

*Information Engineering*, Kyoto, Japan, 1-2 August 2010; pp. V1-424-427.

(in Chinese). *Opt. Instrum.*, 32, pp.29-32, 2010.

*J. Control Automation Syst.*, 8, pp.857-861, 2010.

29 September–2 October 2009; pp. 2201-2208.

pp.233-254, 2002.

*Trans. Pattern Anal. Mach. Intell.*, 33, pp.2341-2353, 2011.

Wiley & Sons: New York, NY, USA, 2003; pp. 203-209.

*Trans. Biomed. Eng.*, 36, pp.1146-1154, 1989.

tissues. *IEEE J. Quantum Electron.*, 26(12), pp.2166-2185, 1990.

[19] Fattal, R. Single image dehazing. *ACM Trans. Graph.*, 27, pp.1-9, 2008.

*Trans. Pattern Anal. Mach. Intell.*, 25, pp.713-724, 2003.


[42] Baranoski, G.V.G.; Krishnaswamy, A. An introduction to light interaction with human skin, *RITA*, vol.11, no.1, pp.33-62, 2004.

**Chapter 4**

**Basic Principles and Trends in Hand Geometry and**

Researchers in the field of biometrics found that human hand, especially human palm, contains some characteristics that can be used for personal identification. These character‐ istics mainly include thickness of the palm area and width, thickness and length of the fingers. Large numbers of commercial systems use these characteristics in various appli‐

Hand geometry biometrics is not a new technique. It is first mentioned in the early 70's of the 20th century and it is older than palm print which is part of dactiloscopy. The first

Hand geometry is based on the palm and fingers structure, including width of the fin‐ gers in different places, length of the fingers, thickness of the palm area, etc. Although these measurements are not very distinctive among people, hand geometry can be very useful for identity verification, i.e. personal authentication. Special task is to combine some non-descriptive characteristics in order to achieve better identification results. This techni‐ que is widely accepted and the verification includes simple data processing. Mentioned features make hand geometry an ideal candidate for research and development of new

Anthropologists believe that humans survived and developed to today's state (Homo sa‐ piens) thanks to highly developed brains and separated thumbs. Easily moved and elas‐ tic human fist enables us catching and throwing various things, but also making and using various kinds of tools in everyday life. Today, human fist is not used just for that purpose, but also as a personal identifier, i.e. it can be used for personal identification.

> © 2012 Bača et al.; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use,

© 2012 Bača et al.; licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

distribution, and reproduction in any medium, provided the original work is properly cited.

**Hand Shape Biometrics**

http://dx.doi.org/10.5772/51912

**1. Introduction**

cations.

Miroslav Bača, Petra Grd and Tomislav Fotak

Additional information is available at the end of the chapter

known use was for security checks in Wall Street.

acquisition, preprocessing and verification techniques.
