**Robust Density Comparison Using Eigenvalue Decomposition**

Omar Arif1 and Patricio A. Vela2

<sup>1</sup>*National University of Sciences and Technology* <sup>2</sup>*Georgia Institute of Technology* <sup>1</sup>*Pakistan* <sup>2</sup>*USA*

#### **1. Introduction**

206 Principal Component Analysis

Hotelling H. (1933). Analysis of a Complex of Statistical Variables into Principal

Horn J. L. (1965). A Rationale and Test for the Number Factors in Factor Analysis.

Horn J. L. & Engstrom R. (1979). Cattell's Scree Test in Relation to Bartlett's Chi-Square Test

Hubbard R. & Allen S. J. (1987). An Empirical Comparison of Alternative Methods for Principal Component Extraction. *Journal of Business Research, 15,* 173-190. Humphreys L. G. & Montanelli R. G. (1975). An Investigation of the Parallel Analysis

Hyvarinen A., Karhunen J. & Oja E. (2001). *Independent Component Analysis*. New York:

Jackson D.A. (1993). Stopping Rules in Principal Components Analysis: A Comparison of

Kaiser H. F. (1960). The Application of Electronic Computers to Factor Analysis. *Educational* 

Lambert Z. V., Wildt A.R. & Durand R.M. (1990). Assessing Sampling Variation Relative to number-of-factor Criteria. *Educational and Psychological Measurement,* 50:33-49. Lawley D.N. (1956). Tests of Significance for the Latent Roots of Covariance and Correlation

Lebart L., Morineau A. & Piron M. (1995). *Statistique Exploratoire Multidimensionnelle*. Paris :

Pearson K. (1901). On Lines and Planes of Closest Fit to Systems of Points in Space.

Stevens J. (1986). *Applied Multivariate Statistics for the Social Sciences.* Hillsdale, NJ: Lawrence

Tibshirani R. (1996). Regression Shrinkage and Selection via the Lasso. *Journal of the Royal* 

Schölkopf B., Smola A.J. & Müller K.-R. (1997). Kernel Principal Component Analysis.

Schölkopf B., Smola A.J. & Müller K.-R. (1998). Nonlinear Component Analysis as a Kernel

Vasilescu M. A. O. & Terzopoulos D. (2007). Multilinear (Tensor) ICA and Dimensionality

Zou H., Hastie T. & Tibshirani R. (2006). Sparse Principal Component Analysis. *Journal of* 

Zwick W.R. & Velicer W.F. (1986). Comparison of Five Rules for Determining the Number of

Eigenvalue Problem. *Neural Computation*, 10(5):1299–1319.

*Lecture Notes in Computer Science*, Vol.1327, Artificial Neural Networks —

Reduction. *Lecture Notes in Computer Science*, Vol. 4666, Independent Component

Heuristical and Statistical Approaches. *Ecology* 74(8): 2204-2214. Jolliffe I. T. (2002). *Principal Component Analysis*. Second ed. New York: Springer-Verlag Jolliffe I.T., Trendalov N.T. & Uddin M. (2003). A Modied Principal Component

Kaiser H. F. (1974). An Index of Factorial Simplicity. *Psychometrika*, 39:31-36.

*and Psychological Measurement,* 20:141-151.

*Philosophical Magazine*, Series 6, 2(11), 559–572.

Analysis and Signal Separation, pp. 818-826.

*Computational and Graphical Statistics*, 15(2): 265-286.

Components to Retain. *Psychological Bulletin,* 99, 432-442.

*Statistical Society*, series B 58(267-288).

Matrices. *Biometrika*, 43: 128-136.

and other Observations on the Number of Factors Problem. *Multivariate Behavioral* 

Criterion for Determining the Number of Common Factors. *Multivariate Behavioral* 

Technique based on the LASSO. *Journal of Computational and Graphical Statistics*,

Components. *Journal of Educational Psychology*, 24(6):417–441.

*Psychometrika*, 30, 179-185.

*Research*, 14, 283-300.

*Research*, *10*, 193-206.

Wiley.

12(3):531–547.

Dunod.

Erlbaum Associates.

ICANN'97, pp.583-588.

Many problems in various fields require measuring the similarity between two distributions. Often, the distributions are represented through samples and no closed form exists for the distribution, or it is unknown what the best parametrization is for the distribution . Therefore, the traditional approach of first estimating the probability distribution using the samples, then comparing the distance between the two distributions is not feasible. In this chapter, a method to compute the similarity between two distributions, which is robust to noise and outliers, is presented. The method works directly on the samples without requiring the intermediate step of density estimation, although the approach is closely related to density estimation. The method is based on mapping the distributions into a reproducing kernel Hilbert space, where eigenvalue decomposition is performed. Retention of only the top *M* eigenvectors minimizes the effect of noise on density comparison.

The chapter is organized in two parts. First, we explain the procedure to obtain the robust density comparison method. The relation between the method and kernel principal component analysis (KPCA) is also explained. The method is validated on synthetic examples. In the second part, we apply the method to the problem of visual tracking. In visual tracking, an initial target and target appearance is given, and must be found within future images. The target information is assumed to be characterized by a probability distribution. Thus tracking, in this scenario, is defined to be the problem of finding the distribution within each image of a sequence that most fits the given target distribution. Here, the object is tracked by minimizing the similarity measure between the model distribution and the candidate distribution where the target position is the optimization variable.

### **2. Mercer kernels**

Let {*ui*}*<sup>n</sup> <sup>i</sup>*=1, *ui* <sup>∈</sup> **<sup>R</sup>***d*, be a set of *<sup>n</sup>* observations. A Mercer kernel is a function <sup>k</sup> : **<sup>R</sup>***<sup>d</sup>* <sup>×</sup>**R***<sup>d</sup>* <sup>→</sup> **<sup>R</sup>**, which satisfies:


function, such as the Gaussian kernel. The mean of the mapping is defined as *μ* : *Pu* → *μ*[*Pu*],

Robust Density Comparison Using Eigenvalue Decomposition 209

et al. (2007) showed that the mean mapping can be used to compute the probability at a test

Equation (3) results in the familiar Parzen window density estimator. In terms of the Hilbert space embedding, the density function estimate results from the inner product of the mapped point *φ*(*u*) with the mean of the distribution *μ*[*Pu*]. The mean map *μ* : *Pu* → *μ*[*Pu*] is injective, Smola et al. (2007), and allows for the definition of a similarity measure between two sampled sets *Pu* and *Pv*, sampled from the same or two different distributions. The measure is defined to be *D*(*Pu*, *Pv*) := ||*μ*[*Pu*] − *μ*[*Pv*]||. This similarity measure is called the maximum mean discrepancy (MMD). MMD has been used to address the two sample problem, Gretton et al.

In the proposed method, principal component analysis is carried out in the Hilbert space H and the eigenvectors corresponding to the leading eigenvalues are retained. It is assumed that the lower eigenvectors capture the noise present in the data set. Mapped points in the Hilbert space are reconstructed by projecting them onto the eigenvectors. The reconstructed points are then used to compute the robust mean map. All the computations in the Hilbert space are performed through the Mercer kernel in the input space and no explicit mapping is carried

*<sup>i</sup>*=1, with *ui* <sup>∈</sup> **<sup>R</sup>***d*, be a set of *nu* observations. As mentioned before, if <sup>k</sup> is a Mercer

*φ*(*ui*)*φ*(*ui*)*T*,

*<sup>i</sup>* , *<sup>i</sup>* <sup>=</sup> 1, . . . , *<sup>n</sup>*, and *<sup>λ</sup><sup>k</sup>* are the *<sup>k</sup>*-th eigenvector

kernel then there exists a high dimensional Hilbert space <sup>H</sup>, with mapping *<sup>φ</sup>* : **<sup>R</sup>***<sup>d</sup>* → H. The

Empirical computations of *<sup>C</sup>*<sup>H</sup> require one to know the mapping up front. A technique to avoid this requirement is to perform eigenvalue decomposition of the covariance matrix *C*<sup>H</sup> using the inner product matrix *K*, called the Gram/kernel matrix, with *Kij* = *φ*(*ui*)*Tφ*(*uj*) = k(*ui*, *uj*). The Gram matrix allows for an eigenvalue/eigenvector decomposition of the covariance matrix without explicitly computing the mapping *φ*. The Gram kernel matrix

components and eigenvalue of the kernel matrix *K*, then the *k*-th eigenvector of the covariance

*n* ∑ *i*=1 *ak <sup>i</sup> φ*(*ui*).

*nu* ∑ *i*=1

*<sup>C</sup>*<sup>H</sup> <sup>=</sup> <sup>1</sup> *nu*

> *<sup>V</sup><sup>k</sup>* <sup>=</sup> <sup>1</sup> √ *λk*

*nu*

*nu* ∑ *i*=1 *<sup>i</sup>*=<sup>1</sup> are drawn from the distribution

*nu* <sup>∑</sup>*nu*

k(*u*, *ui*). (3)

*<sup>i</sup>*=<sup>1</sup> *φ*(*ui*). Smola

where *<sup>μ</sup>*[*Pu*] = *<sup>E</sup>*[*φ*(*ui*)]. If the finite sample of points {*ui*}*nu*

(2007). The next section introduces Robust MMD (rMMD).

covariance matrix *C*<sup>H</sup> in the Hilbert space H is given by

can be computed using the Mercer kernel. If *a<sup>k</sup>*

matrix *C*<sup>H</sup> is given by Leventon (2002)

**5. Robust maximum mean discrepancy**

**5.1 Eigenvalue decomposition**

point *<sup>u</sup>* <sup>∈</sup> **<sup>R</sup>***<sup>d</sup>* as

out.

Let {*ui*}*nu*

*Pu*, then the unbiased numerical estimate of the mean mapping *μ*[*Pu*] is <sup>1</sup>

*<sup>p</sup>*(*u*) = �*μ*[*Pu*], *<sup>φ</sup>*(*u*)� <sup>≈</sup> <sup>1</sup>

Fig. 1. Toy example: Dot product in the mapped space can be computed using the kernel in the input space.

**Theorem:** If k is a Mercer kernel then, there exists a high dimensional Hilbert space H with mapping *<sup>φ</sup>* : **<sup>R</sup>***<sup>d</sup>* → H such that:

$$
\phi(u\_i) \cdot \phi(u\_j) = k(u\_i, u\_j). \tag{1}
$$

The Mercer kernel k implicitly maps the data to a Hilbert space H, where the dot product is given by the kernel k.

#### **3. Example**

Figure 1 shows a simple binary classification example from Schölkopf & Smola (2001). The true decision boundary is given by the circle in the input space. The points in the input space, *u* = [*u*1, *u*2] *<sup>T</sup>*, are mapped to **R**<sup>3</sup> using the mapping *φ*(*u*)=[*u*<sup>2</sup> 1, <sup>√</sup><sup>2</sup> *<sup>u</sup>*1*u*2, *<sup>u</sup>*<sup>2</sup> 2] *<sup>T</sup>*. In **R**3, the decision boundary is transformed from an circle to a hyperplane, i.e. from a non-linear boundary to a linear one. There are many ways to carry out the mapping *φ*, but the above defined mapping has the important property that the dot product in the mapped space is given by the square of the dot product in the input space. This means that the dot product in the mapped space can be obtained without explicitly computing the mapping *φ*.

$$\begin{aligned} \phi(u) \cdot \phi(v) &= u\_1^2 v\_1^2 + 2u\_1 v\_1 u\_2 v\_2 + u\_2^2 v\_2^2 \\ &= \left( u\_1 v\_1 + u\_2 v\_2 \right)^2 = \left( u \cdot v \right)^2 = k(u, v). \end{aligned}$$

An example Mercer kernel is Gaussian kernel:

$$\mathcal{k}(u\_i, u\_j) = \frac{1}{\sqrt{|2\pi\Sigma|}} \exp\left(-\frac{1}{2}(u\_i - u\_j)^T \Sigma^{-1} (u\_i - u\_j)\right),\tag{2}$$

where Σ is *d* × *d* covariance matrix.

#### **4. Maximum nean discrepancy**

Let {*ui*}*nu <sup>i</sup>*=1, with *ui* <sup>∈</sup> **<sup>R</sup>***d*, be a set of *nu* observations drawn from the distribution *Pu*. Define a mapping *<sup>φ</sup>* : **<sup>R</sup>***<sup>d</sup>* → H, such that *φ*(*ui*), *φ*(*uj*) = k(*ui*, *uj*), where k is a Mercer kernel 2 Will-be-set-by-IN-TECH

*<sup>φ</sup>* : **<sup>R</sup>**<sup>2</sup> <sup>→</sup> **<sup>R</sup>**<sup>3</sup>

Fig. 1. Toy example: Dot product in the mapped space can be computed using the kernel in

**Theorem:** If k is a Mercer kernel then, there exists a high dimensional Hilbert space H with

The Mercer kernel k implicitly maps the data to a Hilbert space H, where the dot product is

Figure 1 shows a simple binary classification example from Schölkopf & Smola (2001). The true decision boundary is given by the circle in the input space. The points in the input

the decision boundary is transformed from an circle to a hyperplane, i.e. from a non-linear boundary to a linear one. There are many ways to carry out the mapping *φ*, but the above defined mapping has the important property that the dot product in the mapped space is given by the square of the dot product in the input space. This means that the dot product in

<sup>1</sup> <sup>+</sup> <sup>2</sup>*u*1*v*1*u*2*v*<sup>2</sup> <sup>+</sup> *<sup>u</sup>*<sup>2</sup>

*<sup>i</sup>*=1, with *ui* <sup>∈</sup> **<sup>R</sup>***d*, be a set of *nu* observations drawn from the distribution *Pu*. Define

2*v*2 2

<sup>2</sup> = (*<sup>u</sup>* · *<sup>v</sup>*)<sup>2</sup> <sup>=</sup> <sup>k</sup>(*u*, *<sup>v</sup>*).

(*ui* <sup>−</sup> *uj*)*T*Σ−1(*ui* <sup>−</sup> *uj*)

*<sup>T</sup>*, are mapped to **R**<sup>3</sup> using the mapping *φ*(*u*)=[*u*<sup>2</sup>

the mapped space can be obtained without explicitly computing the mapping *φ*.

= (*u*1*v*<sup>1</sup> + *u*2*v*2)

exp −1 2

1*v*2

<sup>|</sup>2*π*Σ<sup>|</sup>

*φ*(*ui*), *φ*(*uj*)

*<sup>φ</sup>*(*u*) · *<sup>φ</sup>*(*v*) = *<sup>u</sup>*<sup>2</sup>

An example Mercer kernel is Gaussian kernel:

where Σ is *d* × *d* covariance matrix.

**4. Maximum nean discrepancy**

a mapping *<sup>φ</sup>* : **<sup>R</sup>***<sup>d</sup>* → H, such that

Let {*ui*}*nu*

<sup>k</sup>(*ui*, *uj*) = <sup>1</sup>

*φ*(*ui*) · *φ*(*uj*) = k(*ui*, *uj*). (1)

1,

= k(*ui*, *uj*), where k is a Mercer kernel

<sup>√</sup><sup>2</sup> *<sup>u</sup>*1*u*2, *<sup>u</sup>*<sup>2</sup>

2]

, (2)

*<sup>T</sup>*. In **R**3,

the input space.

given by the kernel k.

space, *u* = [*u*1, *u*2]

**3. Example**

mapping *<sup>φ</sup>* : **<sup>R</sup>***<sup>d</sup>* → H such that:

function, such as the Gaussian kernel. The mean of the mapping is defined as *μ* : *Pu* → *μ*[*Pu*], where *<sup>μ</sup>*[*Pu*] = *<sup>E</sup>*[*φ*(*ui*)]. If the finite sample of points {*ui*}*nu <sup>i</sup>*=<sup>1</sup> are drawn from the distribution *Pu*, then the unbiased numerical estimate of the mean mapping *μ*[*Pu*] is <sup>1</sup> *nu* <sup>∑</sup>*nu <sup>i</sup>*=<sup>1</sup> *φ*(*ui*). Smola et al. (2007) showed that the mean mapping can be used to compute the probability at a test point *<sup>u</sup>* <sup>∈</sup> **<sup>R</sup>***<sup>d</sup>* as

$$p(\boldsymbol{\mu}) = \langle \mu[P\_{\boldsymbol{\mu}}], \phi(\boldsymbol{\mu}) \rangle \approx \frac{1}{n\_{\boldsymbol{\mu}}} \sum\_{i=1}^{n\_{\boldsymbol{\mu}}} k(\boldsymbol{\mu}, \boldsymbol{\mu}\_{i}). \tag{3}$$

Equation (3) results in the familiar Parzen window density estimator. In terms of the Hilbert space embedding, the density function estimate results from the inner product of the mapped point *φ*(*u*) with the mean of the distribution *μ*[*Pu*]. The mean map *μ* : *Pu* → *μ*[*Pu*] is injective, Smola et al. (2007), and allows for the definition of a similarity measure between two sampled sets *Pu* and *Pv*, sampled from the same or two different distributions. The measure is defined to be *D*(*Pu*, *Pv*) := ||*μ*[*Pu*] − *μ*[*Pv*]||. This similarity measure is called the maximum mean discrepancy (MMD). MMD has been used to address the two sample problem, Gretton et al. (2007). The next section introduces Robust MMD (rMMD).

#### **5. Robust maximum mean discrepancy**

In the proposed method, principal component analysis is carried out in the Hilbert space H and the eigenvectors corresponding to the leading eigenvalues are retained. It is assumed that the lower eigenvectors capture the noise present in the data set. Mapped points in the Hilbert space are reconstructed by projecting them onto the eigenvectors. The reconstructed points are then used to compute the robust mean map. All the computations in the Hilbert space are performed through the Mercer kernel in the input space and no explicit mapping is carried out.

#### **5.1 Eigenvalue decomposition**

Let {*ui*}*nu <sup>i</sup>*=1, with *ui* <sup>∈</sup> **<sup>R</sup>***d*, be a set of *nu* observations. As mentioned before, if <sup>k</sup> is a Mercer kernel then there exists a high dimensional Hilbert space <sup>H</sup>, with mapping *<sup>φ</sup>* : **<sup>R</sup>***<sup>d</sup>* → H. The covariance matrix *C*<sup>H</sup> in the Hilbert space H is given by

$$C\_{\mathcal{H}} = \frac{1}{n\_u} \sum\_{i=1}^{n\_u} \phi(u\_i)\phi(u\_i)^T \mu$$

Empirical computations of *<sup>C</sup>*<sup>H</sup> require one to know the mapping up front. A technique to avoid this requirement is to perform eigenvalue decomposition of the covariance matrix *C*<sup>H</sup> using the inner product matrix *K*, called the Gram/kernel matrix, with *Kij* = *φ*(*ui*)*Tφ*(*uj*) = k(*ui*, *uj*). The Gram matrix allows for an eigenvalue/eigenvector decomposition of the covariance matrix without explicitly computing the mapping *φ*. The Gram kernel matrix can be computed using the Mercer kernel. If *a<sup>k</sup> <sup>i</sup>* , *<sup>i</sup>* <sup>=</sup> 1, . . . , *<sup>n</sup>*, and *<sup>λ</sup><sup>k</sup>* are the *<sup>k</sup>*-th eigenvector components and eigenvalue of the kernel matrix *K*, then the *k*-th eigenvector of the covariance matrix *C*<sup>H</sup> is given by Leventon (2002)

$$V^k = \frac{1}{\sqrt{\lambda^k}} \sum\_{i=1}^n a\_i^k \phi(\mu\_i).$$

The density at a point *u* is then estimated by the inner-product of the robust mean map *μr*[*Pu*]

Robust Density Comparison Using Eigenvalue Decomposition 211

Retention of only the leading eigenvectors in the procedure minimizes the effects of noise on the density estimate. Figure 3(d) shows density estimation of a multimodal Gaussian distribution in the presence of noise using the robust method. The effect of noise is less pronounced as compared to the kernel density estimation (Figure 3(b)). An alternate procedure that reaches the same result (Equation 7) from a different perspective is proposed by Girolami (2002). There, the probability density is estimated using orthogonal series of

As mentioned before a kernel density estimate, obtained as per Equation (3), is computable using the inner product of the mapped test point and the mean mapping. The sample mean can be influenced by outliers and noise. In Kim & Scott (2008), the sample mean is replaced with a robust estimate using M-estimation Huber et al. (1981). The resulting density function

where the sample mean *μ*[*Pu*] is replaced with a robust mean estimator *μ*ˆ[*Pu*]. The robust

where *ρ* is robust loss function. The iterative re-weighted least squares (IRWLS) is used to compute the robust mean estimate. IRWLS depends only on the inner products and can be efficiently implemented using the Mercer kernel Kim & Scott (2008). The resulting density

> *nu* ∑ *i*=1

where *γ<sup>i</sup>* ≥ 0, ∑ *γ<sup>i</sup>* = 1 and *γ<sup>i</sup>* are obtained through IRWLS algorithm. The *γ<sup>i</sup>* values tend to

In this section, we compare the performance of kernel density estimation (Equation (3)), robust density estimation using M-estimation (Equation (10)) and robust density estimation using

A sample set *X* of 150 points is generated from a 2-dimensional multimodal Gaussian

added from a uniform distribution over the domain [−6, 6] × [−6, 6]. Figure 3 shows the true and the estimated density using the three methods. Data points corresponding to the outliers

<sup>N</sup>2(*μ*2, <sup>Σ</sup>2) + <sup>1</sup>

3

3

*<sup>T</sup>*, *<sup>μ</sup>*<sup>3</sup> = [0, <sup>−</sup>3]

*nu* ∑ *i*=1

*m* ∑ *k*=1

*ω<sup>k</sup> f <sup>k</sup>*

*p*(*u*) = �*μ*ˆ[*Pu*], *φ*(*u*)�, (8)

*ρ*(||*φ*(*ui*) − *μ*[*Pu*]||), (9)

*γi*k(*u*, *ui*), (10)

N3(*μ*3, Σ3), (11)

*<sup>T</sup>* and Σ<sup>1</sup> = Σ<sup>2</sup> = Σ<sup>3</sup> = *I*. Outliers are

(*u*). (7)

*p*(*u*) = *μr*[*Pu*] · *φ*(*u*) =

functions, which are then approximated using the KPCA eigenfunctions.

mean estimator is computed using the M-estimation criterion

*μ*ˆ[*Pu*] = arg min

*μ*[*Pu*]∈H

*<sup>p</sup>*(*u*) = <sup>1</sup>

<sup>N</sup>1(*μ*1, <sup>Σ</sup>1) + <sup>1</sup>

*nu*

and the mapped point *φ*(*u*).

**5.2.1 Example:**

is given by

estimation function is

distribution

where *μ*<sup>1</sup> = [3, 3]

be low for outlier data points.

eigenvalue decomposition (Equation 7).

*<sup>X</sup>* <sup>∼</sup> <sup>1</sup> 3

*<sup>T</sup>*, *<sup>μ</sup>*<sup>2</sup> = [−3, 3]

Fig. 2. Eigenvalue decomposition in the Hilbert space <sup>H</sup>. Observations {*ui*}*nu <sup>i</sup>*=<sup>1</sup> are mapped implicitly to the Hilbert space where eigenvalue decomposition results in an *m*-dimensional reduced space.

#### **5.2 Robust density function**

Let <sup>V</sup> = [*V*1, ··· , *<sup>V</sup>m*] be the m leading eigenvectors of the covariance matrix *<sup>C</sup>*H, where the eigenvector *V<sup>k</sup>* is given by

$$V^k = \sum\_{i=1}^{n\_u} \alpha\_i^k \phi(u\_i) \qquad \text{with } \alpha\_i^k = \frac{a\_i^k}{\sqrt{\lambda^k}}.$$

where *λ<sup>k</sup>* and *a<sup>k</sup> <sup>i</sup>* are the *<sup>k</sup>th* eigenvalue and its associated eigenvector components, respectively, of the kernel matrix *K*. The reconstruction of the point *φ*(*u*) in the Hilbert space H using *m* eigenvectors V is

$$\phi\_{\mathcal{I}}(u) = \mathcal{V} \cdot \mathcal{f}(u),\tag{4}$$

where f(u) = [ *f* <sup>1</sup>(*u*),..., *f <sup>m</sup>*(*u*)]*<sup>T</sup>* is a vector whose components are the projections onto each of the *m* eigenvectors. The projections are given by

$$f^k(\boldsymbol{\mu}) = V^k \cdot \boldsymbol{\phi}(\boldsymbol{\mu}) = \sum\_{i=1}^{n\_u} \boldsymbol{a}\_i^k \mathbf{k}(\boldsymbol{\mu}\_{i\prime} \boldsymbol{\mu}), \tag{5}$$

This procedure is schematically described in Figure 2.

Kernel principal component analysis (**KPCA**) Scholköpf et al. (1998) is a non-linear extension of principal component analysis using a Mercer kernel k. Eigenvectors V are the principal components and the KPCA projections are given by Equation (5).

The reconstructed points, *φr*(*u*), are used to compute the numerical estimate of the robust mean mapping *μr*[*Pu*]:

$$\mu\_r[P\_u] = \frac{1}{n\_u} \sum\_{i=1}^{n\_u} \phi\_r(u\_i) = \frac{1}{n\_u} \sum\_{i=1}^{n\_u} \sum\_{k=1}^m V^k f^k(u\_i) = \sum\_{k=1}^m \omega^k V^k,$$

where

$$
\omega^k = \frac{1}{n\_u} \sum\_{i=1}^{n\_u} f^k(u\_i) \tag{6}
$$

The density at a point *u* is then estimated by the inner-product of the robust mean map *μr*[*Pu*] and the mapped point *φ*(*u*).

$$p(\boldsymbol{\mu}) = \mu\_r[P\_{\boldsymbol{\mu}}] \cdot \boldsymbol{\phi}(\boldsymbol{\mu}) = \sum\_{k=1}^{m} \omega^k f^k(\boldsymbol{\mu}). \tag{7}$$

Retention of only the leading eigenvectors in the procedure minimizes the effects of noise on the density estimate. Figure 3(d) shows density estimation of a multimodal Gaussian distribution in the presence of noise using the robust method. The effect of noise is less pronounced as compared to the kernel density estimation (Figure 3(b)). An alternate procedure that reaches the same result (Equation 7) from a different perspective is proposed by Girolami (2002). There, the probability density is estimated using orthogonal series of functions, which are then approximated using the KPCA eigenfunctions.

#### **5.2.1 Example:**

4 Will-be-set-by-IN-TECH

*<sup>φ</sup>* : **<sup>R</sup>***<sup>d</sup>* → H

Hilbert space

Reduced space

*<sup>i</sup>* <sup>=</sup> *<sup>a</sup><sup>k</sup>* √ *i λk* ,

*φr*(*u*) = V · f(u), (4)

*<sup>i</sup>* k(*ui*, *u*), (5)

*<sup>i</sup>* are the *<sup>k</sup>th* eigenvalue and its associated eigenvector components,

H → **<sup>R</sup>***<sup>m</sup>*

*<sup>i</sup>*=<sup>1</sup> are mapped

Input space

reduced space.

**5.2 Robust density function**

eigenvector *V<sup>k</sup>* is given by

H using *m* eigenvectors V is

where *λ<sup>k</sup>* and *a<sup>k</sup>*

mean mapping *μr*[*Pu*]:

where

Projections, f

*V<sup>k</sup>* =

*f k*

components and the KPCA projections are given by Equation (5).

*nu* ∑ *i*=1

of the *m* eigenvectors. The projections are given by

This procedure is schematically described in Figure 2.

*<sup>μ</sup>r*[*Pu*] = <sup>1</sup>

*nu*

*nu* ∑ *i*=1 *αk*

Fig. 2. Eigenvalue decomposition in the Hilbert space <sup>H</sup>. Observations {*ui*}*nu*

implicitly to the Hilbert space where eigenvalue decomposition results in an *m*-dimensional

Let <sup>V</sup> = [*V*1, ··· , *<sup>V</sup>m*] be the m leading eigenvectors of the covariance matrix *<sup>C</sup>*H, where the

*<sup>i</sup> <sup>φ</sup>*(*ui*) with *<sup>α</sup><sup>k</sup>*

respectively, of the kernel matrix *K*. The reconstruction of the point *φ*(*u*) in the Hilbert space

where f(u) = [ *f* <sup>1</sup>(*u*),..., *f <sup>m</sup>*(*u*)]*<sup>T</sup>* is a vector whose components are the projections onto each

Kernel principal component analysis (**KPCA**) Scholköpf et al. (1998) is a non-linear extension of principal component analysis using a Mercer kernel k. Eigenvectors V are the principal

The reconstructed points, *φr*(*u*), are used to compute the numerical estimate of the robust

*nu* ∑ *i*=1

*nu* ∑ *i*=1 *f k*

*m* ∑ *k*=1

*V<sup>k</sup> f <sup>k</sup>*

(*ui*) =

*m* ∑ *k*=1 *ωk Vk* ,

(*ui*) (6)

*nu*

*nu* ∑ *i*=1 *αk*

(*u*) = *<sup>V</sup><sup>k</sup>* · *<sup>φ</sup>*(*u*) =

*<sup>φ</sup>r*(*ui*) = <sup>1</sup>

*<sup>ω</sup><sup>k</sup>* <sup>=</sup> <sup>1</sup> *nu* As mentioned before a kernel density estimate, obtained as per Equation (3), is computable using the inner product of the mapped test point and the mean mapping. The sample mean can be influenced by outliers and noise. In Kim & Scott (2008), the sample mean is replaced with a robust estimate using M-estimation Huber et al. (1981). The resulting density function is given by

$$p(\mu) = \left< \hat{\mu}[P\_{\mu}], \phi(\mu) \right>,\tag{8}$$

where the sample mean *μ*[*Pu*] is replaced with a robust mean estimator *μ*ˆ[*Pu*]. The robust mean estimator is computed using the M-estimation criterion

$$\#\{P\_{\boldsymbol{\mu}}\} = \underset{\boldsymbol{\mu}[P\_{\boldsymbol{\mu}}] \in \mathcal{H}}{\arg\min} \sum\_{i=1}^{n\_{\boldsymbol{u}}} \rho(||\boldsymbol{\phi}(\boldsymbol{u}\_{i}) - \boldsymbol{\mu}[P\_{\boldsymbol{u}}]||),\tag{9}$$

where *ρ* is robust loss function. The iterative re-weighted least squares (IRWLS) is used to compute the robust mean estimate. IRWLS depends only on the inner products and can be efficiently implemented using the Mercer kernel Kim & Scott (2008). The resulting density estimation function is

$$p(\boldsymbol{\mu}) = \frac{1}{n\_{\boldsymbol{\mu}}} \sum\_{i=1}^{n\_{\boldsymbol{\mu}}} \gamma\_i \mathbf{k}(\boldsymbol{\mu}, \boldsymbol{u}\_i), \tag{10}$$

where *γ<sup>i</sup>* ≥ 0, ∑ *γ<sup>i</sup>* = 1 and *γ<sup>i</sup>* are obtained through IRWLS algorithm. The *γ<sup>i</sup>* values tend to be low for outlier data points.

In this section, we compare the performance of kernel density estimation (Equation (3)), robust density estimation using M-estimation (Equation (10)) and robust density estimation using eigenvalue decomposition (Equation 7).

A sample set *X* of 150 points is generated from a 2-dimensional multimodal Gaussian distribution

$$X \sim \frac{1}{3} \mathcal{N}\_1(\mu\_1, \Sigma\_1) + \frac{1}{3} \mathcal{N}\_2(\mu\_2, \Sigma\_2) + \frac{1}{3} \mathcal{N}\_3(\mu\_3, \Sigma\_3), \tag{11}$$

where *μ*<sup>1</sup> = [3, 3] *<sup>T</sup>*, *<sup>μ</sup>*<sup>2</sup> = [−3, 3] *<sup>T</sup>*, *<sup>μ</sup>*<sup>3</sup> = [0, <sup>−</sup>3] *<sup>T</sup>* and Σ<sup>1</sup> = Σ<sup>2</sup> = Σ<sup>3</sup> = *I*. Outliers are added from a uniform distribution over the domain [−6, 6] × [−6, 6]. Figure 3 shows the true and the estimated density using the three methods. Data points corresponding to the outliers

<sup>20</sup> <sup>30</sup> <sup>40</sup> <sup>50</sup> <sup>60</sup> <sup>70</sup> <sup>80</sup> <sup>90</sup> <sup>100</sup> 0.89

Robust Density Comparison Using Eigenvalue Decomposition 213

Fig. 4. Bhattacharyya distance measure between true and estimated densities. Red: Robust density estimation using eigenvalue decomposition, Green: robust density estimation using

to use the eigenvectors *V<sup>k</sup>* of the distribution *Pu*. The similarity measure between the samples

*<sup>v</sup>*,..., *ω<sup>m</sup> v* ]

• Form kernel matrix *K* using the samples from the distribution *Pu*. Diagonalize the kernel

As a simple synthetic example (visual tracking examples will be given in the next section), we compute MMD and robust MMD between two distributions. The first one is a multi-modal

for the other distribution are obtained from the first one by adding Gaussian noise to about 50% of the samples. Figure 5(c) shows the MMD and robust MMD measure as the standard deviation of the noise is increased. The slope of robust MMD is lower than MMD showing that it is less sensitive to noise. In Figure 6, the absolute value of the difference between the two distributions is plotted for MMD and rMMD measure. The samples from the two distributions are shown in red and blue color. The effect of noise is more pronounced in case of MMD.

3

<sup>1</sup>,..., *<sup>a</sup><sup>k</sup>*

*<sup>i</sup>*=<sup>1</sup> from two distributions *Pu* and *Pv*.

*<sup>v</sup>* = <sup>1</sup> *nv* <sup>∑</sup>*nv*

<sup>N</sup>2(*μ*2, <sup>Σ</sup>2) + <sup>1</sup>

3

*Dr*(*Pu*, *Pv*) = ||ω<sup>u</sup> − ωv|| , (12)

*<sup>T</sup>*. Since both mean maps live in the same

*nu* ] and eigenvalues *<sup>λ</sup><sup>k</sup>* for *<sup>k</sup>* = 1, ..., *<sup>m</sup>*, where *<sup>m</sup>* is

*<sup>T</sup>* and <sup>Σ</sup><sup>1</sup> <sup>=</sup> <sup>Σ</sup><sup>2</sup> <sup>=</sup> <sup>Σ</sup><sup>3</sup> <sup>=</sup> .5 <sup>×</sup> *<sup>I</sup>*. The sample points

N3(*μ*3, Σ3), (13)

*<sup>i</sup>*=<sup>1</sup> *<sup>f</sup> <sup>k</sup>*(*vi*).

Number of outliers

0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98

*<sup>T</sup>* and ω<sup>v</sup> = [*ω*<sup>1</sup>

*<sup>i</sup>*=<sup>1</sup> and {*vi*}*nv*

eigenspace, the eigenvectors *V<sup>k</sup>* have been dropped from the (Equation 12).

Bhattacharyya Coeff

M-estimation, Blue: Kernel density estimation.

*<sup>u</sup>*,..., *ω<sup>m</sup> u* ]

The procedure is summarized below.

matrix to get eigenvectors a*<sup>k</sup>* = [*a<sup>k</sup>*

the total number of eigenvectors retained. • Calculate ω<sup>u</sup> using Equation (6), and ω<sup>v</sup> by *ω<sup>k</sup>*

• The similarity of *Pv* to *Pu* is given by Equation (12).

*<sup>X</sup>* <sup>∼</sup> <sup>1</sup> 3

*<sup>T</sup>*, *μ*<sup>2</sup> = [5, 5]

<sup>N</sup>1(*μ*1, <sup>Σ</sup>1) + <sup>1</sup>

*<sup>T</sup>*, *<sup>μ</sup>*<sup>3</sup> = [5, <sup>−</sup>5]

• Given samples {*ui*}*nu*

Gaussian distribution given by

is then given by

where ω<sup>u</sup> = [*ω*<sup>1</sup>

**5.4 Summary**

**5.5 Example 1**

where *μ*<sup>1</sup> = [0, 0]

Fig. 3. Density estimation comparisons. The effect of outliers is less pronounced in the robust density estimation using eigenvalue decomposition.

are marked as +. To measure the performance of the density estimates, the Bhattacharyya distance is used. The number of outliers used in the tests are Γ = [20, 40, 60, 80, 100]. At each Γ the simulations are run 50 times and the average Bhattacharyya distance is recorded. The results are shown in Figure 4. The number of eigenvectors retained for the robust density estimation were 8.

#### **5.3 Robust maximum mean discrepancy**

The robust mean map *<sup>μ</sup><sup>r</sup>* : *Pu* <sup>→</sup> *<sup>μ</sup>r*[*Pu*], with *<sup>μ</sup>r*[*Pu*] :<sup>=</sup> <sup>∑</sup>*nu <sup>k</sup>*=<sup>1</sup> *<sup>ω</sup>kVk*, is used to define the similarity measure between the two distributions *Pu* and *Pv*. We call it the robust MMD (rMMD),

$$D\_r(P\_{\mu\nu}P\_{\upsilon}) := |\left|\mu\_r[P\_{\mu}] - \mu\_r[P\_{\upsilon}]\right||\cdot|$$

The mean map *<sup>μ</sup>r*[*Pv*] for the samples {*vi*}*nv <sup>i</sup>*=<sup>1</sup> is calculated by repeating the same procedure as for *Pu*. This may be computationally expensive as it requires eigenvalue decomposition of the kernel matrices. Further, the two eigenspaces may be unrelated. The proposed solution is

Fig. 4. Bhattacharyya distance measure between true and estimated densities. Red: Robust density estimation using eigenvalue decomposition, Green: robust density estimation using M-estimation, Blue: Kernel density estimation.

to use the eigenvectors *V<sup>k</sup>* of the distribution *Pu*. The similarity measure between the samples is then given by

$$D\_r(P\_{\mathfrak{U}}, P\_{\mathfrak{v}}) = ||\omega\_{\mathfrak{u}} - \omega\_{\mathfrak{v}}||\_{\prime} \tag{12}$$

where ω<sup>u</sup> = [*ω*<sup>1</sup> *<sup>u</sup>*,..., *ω<sup>m</sup> u* ] *<sup>T</sup>* and ω<sup>v</sup> = [*ω*<sup>1</sup> *<sup>v</sup>*,..., *ω<sup>m</sup> v* ] *<sup>T</sup>*. Since both mean maps live in the same eigenspace, the eigenvectors *V<sup>k</sup>* have been dropped from the (Equation 12).

#### **5.4 Summary**

6 Will-be-set-by-IN-TECH

−4

−4

Fig. 3. Density estimation comparisons. The effect of outliers is less pronounced in the robust

are marked as +. To measure the performance of the density estimates, the Bhattacharyya distance is used. The number of outliers used in the tests are Γ = [20, 40, 60, 80, 100]. At each Γ the simulations are run 50 times and the average Bhattacharyya distance is recorded. The results are shown in Figure 4. The number of eigenvectors retained for the robust density

similarity measure between the two distributions *Pu* and *Pv*. We call it the robust MMD

*Dr*(*Pu*, *Pv*) := ||*μr*[*Pu*] − *μr*[*Pv*]|| .

as for *Pu*. This may be computationally expensive as it requires eigenvalue decomposition of the kernel matrices. Further, the two eigenspaces may be unrelated. The proposed solution is

−2

0

2

4

6

−2

0

2

4

6

−6 −4 −2 <sup>0</sup> <sup>2</sup> <sup>4</sup> <sup>6</sup> −6

(b) Kernel density estimation

−6 −4 −2 <sup>0</sup> <sup>2</sup> <sup>4</sup> <sup>6</sup> −6

(d) Robust Kernel density estimation using

*<sup>i</sup>*=<sup>1</sup> is calculated by repeating the same procedure

*<sup>k</sup>*=<sup>1</sup> *<sup>ω</sup>kVk*, is used to define the

eigenvalue decomposition

−6 −4 −2 <sup>0</sup> <sup>2</sup> <sup>4</sup> <sup>6</sup> −6

(a) True density

−6 −4 −2 <sup>0</sup> <sup>2</sup> <sup>4</sup> <sup>6</sup> −6

(c) Robust Kernel density estimation using

density estimation using eigenvalue decomposition.

The robust mean map *<sup>μ</sup><sup>r</sup>* : *Pu* <sup>→</sup> *<sup>μ</sup>r*[*Pu*], with *<sup>μ</sup>r*[*Pu*] :<sup>=</sup> <sup>∑</sup>*nu*

**5.3 Robust maximum mean discrepancy**

The mean map *<sup>μ</sup>r*[*Pv*] for the samples {*vi*}*nv*

−4

−4

M-estimation

estimation were 8.

(rMMD),

−2

0

2

4

6

−2

0

2

4

6

The procedure is summarized below.


#### **5.5 Example 1**

As a simple synthetic example (visual tracking examples will be given in the next section), we compute MMD and robust MMD between two distributions. The first one is a multi-modal Gaussian distribution given by

$$X \sim \frac{1}{3} \mathcal{N}\_1(\mu\_1 \Sigma\_1) + \frac{1}{3} \mathcal{N}\_2(\mu\_2 \Sigma\_2) + \frac{1}{3} \mathcal{N}\_3(\mu\_3 \Sigma\_3),\tag{13}$$

where *μ*<sup>1</sup> = [0, 0] *<sup>T</sup>*, *μ*<sup>2</sup> = [5, 5] *<sup>T</sup>*, *<sup>μ</sup>*<sup>3</sup> = [5, <sup>−</sup>5] *<sup>T</sup>* and <sup>Σ</sup><sup>1</sup> <sup>=</sup> <sup>Σ</sup><sup>2</sup> <sup>=</sup> <sup>Σ</sup><sup>3</sup> <sup>=</sup> .5 <sup>×</sup> *<sup>I</sup>*. The sample points for the other distribution are obtained from the first one by adding Gaussian noise to about 50% of the samples. Figure 5(c) shows the MMD and robust MMD measure as the standard deviation of the noise is increased. The slope of robust MMD is lower than MMD showing that it is less sensitive to noise. In Figure 6, the absolute value of the difference between the two distributions is plotted for MMD and rMMD measure. The samples from the two distributions are shown in red and blue color. The effect of noise is more pronounced in case of MMD.

−10 −5 0 5 10

0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018

> −12 −10 −8 −6 −4 −2 0 2 4 6 8

Tannenbaum (2006). Blue: noise data set, Red: reconstructed points.

leading eigenvectors match faithfully to the original data set.

Similarity measure

−10 −5 0 5 10

−15 −10 −5 0 5 10

(b) Noisy swiss roll data set at *σ* = 1

MMD rMMD

> −12 −10 −8 −6 −4 −2 0 2 4 6 8

−10

<sup>0</sup> 0.5 <sup>1</sup> 1.5 <sup>0</sup>

Noise levels

Fig. 8. Curves measure the similarity between the two data sets as the noise level is increased.

−15 −10 −5 0 5 10

Fig. 9. Reconstruction of the noisy points using 10, 20, 30 eigenvectors Rathi, Dambreville &

to compute the robust mean map. The reconstructed points are obtained by using only the leading eigenvectors. Therefore, the effect of noise on the reconstructed points is reduced. We use the method descibed in Rathi, Dambreville & Tannenbaum (2006) to visualize the reconstructed points in the input space. Figure 9 shows the reconstructed points using 10, 20 and 30 eigenvectors. The blue dots are the noisy data set and the red dots are the reconstructed points. It is clear from the figure that the reconstructed data points using few

−5

0

5

10

Robust Density Comparison Using Eigenvalue Decomposition 215

(a) Swiss roll data set

−10 −8 −6 −4 −2 0 2 4 6

Fig. 7. Swiss roll example

−15 −10 −5 0 5 10

−12 −10 −8 −6 −4 −2 0 2 4 6 8

by adding noise to distribution 1. similarity measure between the distributions as the noise level increases.

Fig. 5. MMD vs robust MMD.

(b) Difference function for rMMD

Fig. 6. Illustration of the effect of noise on the difference between the the two distributions. The samples from the two distributions are shown in red and blue.

#### **5.6 Example 2**

Consider another example of a 2-dimensional swiss roll. The data set is generated by the following function.

$$\begin{aligned} t &= \frac{3}{2} \cdot \pi \cdot (1 + 2r) \text{ where } r \ge 0, \\ x &= t \cdot \cos(t)\_{\prime} \\ y &= t \cdot \sin(t)\_{\prime} \end{aligned}$$

where *x* and *y* are the coordinates of the data points. 300 points are uniformly sampled and are shown in Figure 7(a). The noisy data sets are obtained by adding Gaussian noise to the original data set at standard deviations, *σ* = [0 − 1.5]. For example Figure 7(b) shows a noisy data at *σ* = 1. We measure the similarity between the two data sets using MMD and rMMD. 20 eigenvectors are retained for the rMMD computation. Figure 8 shows that the MMD and rMMD measure as the standard deviation of the noise is increased. The slope of rMMD is lower than MMD showing that it is less sensitive to noise.

As mentioned earlier, the eigenvectors corresponding to the lower eigenvalues capture noise present in the data set. The rMMD measure uses the the reconstructed points *φr* (Equation 4)

Fig. 7. Swiss roll example

8 Will-be-set-by-IN-TECH

−2 0 2 4 6 8

0.5 1 1.5 2 2.5

1

2

3

4

MMD rMMD

Noise level

(c) Curves measure the similarity measure between the distributions as the noise level

0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045

increases.

−6 −4 −2 0 2 4 6 8 10

10 0

(b) Difference function for rMMD

Distance

(b) Distribution 2 is obtained by adding noise to distribution

> −10 −8 −6 −4 −2 0 2 4 6 8

1

Fig. 6. Illustration of the effect of noise on the difference between the the two distributions.

Consider another example of a 2-dimensional swiss roll. The data set is generated by the

where *x* and *y* are the coordinates of the data points. 300 points are uniformly sampled and are shown in Figure 7(a). The noisy data sets are obtained by adding Gaussian noise to the original data set at standard deviations, *σ* = [0 − 1.5]. For example Figure 7(b) shows a noisy data at *σ* = 1. We measure the similarity between the two data sets using MMD and rMMD. 20 eigenvectors are retained for the rMMD computation. Figure 8 shows that the MMD and rMMD measure as the standard deviation of the noise is increased. The slope of rMMD is

As mentioned earlier, the eigenvectors corresponding to the lower eigenvalues capture noise present in the data set. The rMMD measure uses the the reconstructed points *φr* (Equation 4)

<sup>2</sup> · *<sup>π</sup>* · (<sup>1</sup> <sup>+</sup> <sup>2</sup>*r*) *where r* <sup>≥</sup> 0,

2

3

4

−6 −4 −2 0 2 4 6

1.

−6 −4 −2 0 2 4 6 8 10

10 0

(a) Difference function for MMD

The samples from the two distributions are shown in red and blue.

*<sup>t</sup>* <sup>=</sup> <sup>3</sup>

lower than MMD showing that it is less sensitive to noise.

*x* = *t* · *cos*(*t*), *y* = *t* · *sin*(*t*),

−2 0 2 4 6 8

(a) Distribution 1 samples.

Fig. 5. MMD vs robust MMD.

−10 −8 −6 −4 −2 0 2 4 6 8

**5.6 Example 2**

following function.

−6 −4 −2 0 2 4 6

Fig. 8. Curves measure the similarity between the two data sets as the noise level is increased.

Fig. 9. Reconstruction of the noisy points using 10, 20, 30 eigenvectors Rathi, Dambreville & Tannenbaum (2006). Blue: noise data set, Red: reconstructed points.

to compute the robust mean map. The reconstructed points are obtained by using only the leading eigenvectors. Therefore, the effect of noise on the reconstructed points is reduced. We use the method descibed in Rathi, Dambreville & Tannenbaum (2006) to visualize the reconstructed points in the input space. Figure 9 shows the reconstructed points using 10, 20 and 30 eigenvectors. The blue dots are the noisy data set and the red dots are the reconstructed points. It is clear from the figure that the reconstructed data points using few leading eigenvectors match faithfully to the original data set.

of density estimation. Also, the model density function is designed to capture the appearance

Robust Density Comparison Using Eigenvalue Decomposition 217

The feature vector associated to a given pixel is a *d*-dimensional concatenation of a *p*-dimensional appearance vector and a 2-dimensional spatial vector *u* = [F(*x*), *x*], where F(*x*) is the *p*-dimensional appearance vector extracted from I at the spatial location *x*,

F(*x*) = Γ(I, *x*),

where Γ can be any mapping such as color I(*x*), image gradient, edge, texture, etc., any

The feature vectors are extracted from the segmented target template image(s). The set of all

**D** = {*u*1, *u*2, ..., *un*}, where *n* is the total number of feature vectors extracted from the template image(s). The set

underlying density function *Pu*. To locate the object in an image, a region *R*˜ (with samples

where Σ is a *d* × *d* diagonal matrix with bandwidths for each appearance-spatial coordinate,

An exhaustive search can be performed to find the region or, starting from an initial guess, gradient based methods can be used to find the local minimum. For the latter approach, we

Assume that the target object undergoes a geometric transformation from region *R* to a region *R*˜, such that *R* = *T*(*R*˜, *a*), where *a* = [*a*1,..., *ag*] is a vector containing the parameters of transformation and *<sup>g</sup>* is the total number of transformation parameters. Let {*ui*}*nu*

*Dr* =

where the *m*-dimensional robust mean maps for the two regions are *ω<sup>k</sup>*

*m* ∑ *k*=1

 *ωk <sup>u</sup>* <sup>−</sup> *<sup>ω</sup><sup>k</sup> v* 2

*<sup>i</sup>*=<sup>1</sup> be the samples extracted from region *<sup>R</sup>* and *<sup>R</sup>*˜, and let *vi* = [F(*x*˜*i*), *<sup>T</sup>*(*x*˜*i*, *<sup>a</sup>*)]*<sup>T</sup>* <sup>=</sup>

*<sup>T</sup>*. The rMMD measure between the distributions of the regions *R* and *R*˜ is given by

*<sup>i</sup>*=<sup>1</sup> *<sup>f</sup> <sup>k</sup>*(*vi*). Gradient descent can be used to minimize the rMMD measure with

−1 2

*<sup>i</sup>*=1) with density *Pv* is sought which minimizes the rMMD measure given by Equation

*<sup>i</sup>*=1, extracted from the template region *R*, are observations from an

, (14)

, (15)

*<sup>u</sup>* = <sup>1</sup> *nu* <sup>∑</sup>*nu* *<sup>i</sup>*=<sup>1</sup> and

*<sup>i</sup>*=<sup>1</sup> *<sup>f</sup> <sup>k</sup>*(*ui*) and

(*ui* <sup>−</sup> *uj*)*T*Σ−1(*ui* <sup>−</sup> *uj*)

combination of these, or the output from a filter bank (Gabor filter, wavelet, etc.).

and spatial characteristics of the target object.

feature vectors define the target input space **D**,

<sup>k</sup>(*ui*, *uj*) = exp

provide a variational localization procedure below.

of all pixel vectors, {*ui*}*nu*

(12). The kernel in this case is

**6.2 Variational target localization**

the Equation (12), with the *L*<sup>2</sup> norm is

{*σF*<sup>1</sup> ,..., *σFp* , *σs*<sup>1</sup> , *σs*<sup>2</sup> }.

{*vi*}*nv*

{*vi*}*nv*

*ωk <sup>v</sup>* = <sup>1</sup>

*nv* <sup>∑</sup>*nv*

[F(*x*˜*i*), *xi*]

**6.1 Extracting target feature vectors**
