**6. Visual tracking through density comparison**

In the first part of the chapter, we presented a technique to robustly compare two distributions represented by their samples. An application of the technique is visual target tracking. The object is tracked by finding the correspondence of the object region in consecutive images by using a template or a set of templates of the target object to define a model distribution. To track the object, each image of the sequence is searched to find the region whose candidate distribution closely matches the model distribution. A key requirement here is that the similarity measure should be robust to noise and outliers, which arise for a number of reasons such as noise in imaging procedure, background clutter, partial occlusions, etc.

One popular algorithm, the mean shift tracker Comaniciu et al. (2003), uses a histogram weighted by a spatial kernel as a probability density function of the object region. The correspondence of the target object between sequential frames is established at the region level by maximizing the Bhattacharyya coefficient between the target and the candidate distributions using mean-shift Cheng (1995). Instead of using the Bhattacharyya coefficient as a distance measure between the two distributions, Hager et al. (2004) use the Matusita distance between kernel-modulated histograms. The Matusita distance is optimized using Newton-style iterations, which provides faster convergence than the mean-shift. Histograms discard spatial information, which becomes problematic when faced with occlusions and/or the presence of target features in the background. In Birchfield & Rangarajan (2005), histograms were generalized to include spatial information, leading to spatiograms. A spatiogram augments each histogram bin with the spatial means and covariances of the pixels comprising the bin. The spatiograms captures the probability density function of the image values. The similarity between the two density functions was computed using Bhattacharyya coefficient. Elgammal Elgammal et al. (2003) employs a joint appearance-spatial density estimate and measure the similarity of the model and the candidate distributions using Kullback-Leigbler information distance.

Similarly, measuring the similarity/distance between two distributions is also required in image segmentation. For example, in some contour based segmentation algorithms (Freedman & Zhang (2004); Rathi, Malcolm & Tannenbaum (2006)), the contour is evolved either to separate the distribution of the pixels inside and outside of the contour, or to evolve the contour so that the distribution of the pixels inside matches a prior distribution of the target object. In both cases, the distance between the distributions is calculated using Bhattacharyya coefficient or Kullback-Liebler information distance.

The algorithms defined above require computing the probability density functions using the samples, which becomes computationally expensive for higher dimensions. Another problem associated with computing probability density functions is the sparseness of the observations within the *d*-dimensional feature space, especially when the sample set size is small. This makes the similarity measures, such as Kullback-Leibler divergence and Bhattacharyya coefficient, computationally unstable, Yang & R. Duraiswami (2005). Additionally, these techniques require sophisticated space partitioning and/or bias correction strategies Smola et al. (2007).

This section describes a method to use robust maximum mean discrepancy (rMMD) measure, described in the first part of this chapter, for visual tracking. The similarity between the two distributions can be computed directly on the samples without requiring the intermediate step of density estimation. Also, the model density function is designed to capture the appearance and spatial characteristics of the target object.

#### **6.1 Extracting target feature vectors**

10 Will-be-set-by-IN-TECH

In the first part of the chapter, we presented a technique to robustly compare two distributions represented by their samples. An application of the technique is visual target tracking. The object is tracked by finding the correspondence of the object region in consecutive images by using a template or a set of templates of the target object to define a model distribution. To track the object, each image of the sequence is searched to find the region whose candidate distribution closely matches the model distribution. A key requirement here is that the similarity measure should be robust to noise and outliers, which arise for a number of reasons

One popular algorithm, the mean shift tracker Comaniciu et al. (2003), uses a histogram weighted by a spatial kernel as a probability density function of the object region. The correspondence of the target object between sequential frames is established at the region level by maximizing the Bhattacharyya coefficient between the target and the candidate distributions using mean-shift Cheng (1995). Instead of using the Bhattacharyya coefficient as a distance measure between the two distributions, Hager et al. (2004) use the Matusita distance between kernel-modulated histograms. The Matusita distance is optimized using Newton-style iterations, which provides faster convergence than the mean-shift. Histograms discard spatial information, which becomes problematic when faced with occlusions and/or the presence of target features in the background. In Birchfield & Rangarajan (2005), histograms were generalized to include spatial information, leading to spatiograms. A spatiogram augments each histogram bin with the spatial means and covariances of the pixels comprising the bin. The spatiograms captures the probability density function of the image values. The similarity between the two density functions was computed using Bhattacharyya coefficient. Elgammal Elgammal et al. (2003) employs a joint appearance-spatial density estimate and measure the similarity of the model and the candidate distributions using

Similarly, measuring the similarity/distance between two distributions is also required in image segmentation. For example, in some contour based segmentation algorithms (Freedman & Zhang (2004); Rathi, Malcolm & Tannenbaum (2006)), the contour is evolved either to separate the distribution of the pixels inside and outside of the contour, or to evolve the contour so that the distribution of the pixels inside matches a prior distribution of the target object. In both cases, the distance between the distributions is calculated using

The algorithms defined above require computing the probability density functions using the samples, which becomes computationally expensive for higher dimensions. Another problem associated with computing probability density functions is the sparseness of the observations within the *d*-dimensional feature space, especially when the sample set size is small. This makes the similarity measures, such as Kullback-Leibler divergence and Bhattacharyya coefficient, computationally unstable, Yang & R. Duraiswami (2005). Additionally, these techniques require sophisticated space partitioning and/or bias correction strategies Smola

This section describes a method to use robust maximum mean discrepancy (rMMD) measure, described in the first part of this chapter, for visual tracking. The similarity between the two distributions can be computed directly on the samples without requiring the intermediate step

Bhattacharyya coefficient or Kullback-Liebler information distance.

such as noise in imaging procedure, background clutter, partial occlusions, etc.

**6. Visual tracking through density comparison**

Kullback-Leigbler information distance.

et al. (2007).

The feature vector associated to a given pixel is a *d*-dimensional concatenation of a *p*-dimensional appearance vector and a 2-dimensional spatial vector *u* = [F(*x*), *x*], where F(*x*) is the *p*-dimensional appearance vector extracted from I at the spatial location *x*,

$$\mathcal{F}(\mathfrak{x}) = \Gamma(\mathcal{T}, \mathfrak{x})\_{\mathsf{x}}$$

where Γ can be any mapping such as color I(*x*), image gradient, edge, texture, etc., any combination of these, or the output from a filter bank (Gabor filter, wavelet, etc.).

The feature vectors are extracted from the segmented target template image(s). The set of all feature vectors define the target input space **D**,

$$\mathbb{D} = \{ \mu\_1, \mu\_2, \dots, \mu\_n \}\_{\mathcal{A}}$$

where *n* is the total number of feature vectors extracted from the template image(s). The set of all pixel vectors, {*ui*}*nu <sup>i</sup>*=1, extracted from the template region *R*, are observations from an underlying density function *Pu*. To locate the object in an image, a region *R*˜ (with samples {*vi*}*nv <sup>i</sup>*=1) with density *Pv* is sought which minimizes the rMMD measure given by Equation (12). The kernel in this case is

$$\mathbf{k}(u\_i, u\_j) = \exp\left(-\frac{1}{2}(u\_i - u\_j)^T \Sigma^{-1} (u\_i - u\_j)\right),\tag{14}$$

where Σ is a *d* × *d* diagonal matrix with bandwidths for each appearance-spatial coordinate, {*σF*<sup>1</sup> ,..., *σFp* , *σs*<sup>1</sup> , *σs*<sup>2</sup> }.

An exhaustive search can be performed to find the region or, starting from an initial guess, gradient based methods can be used to find the local minimum. For the latter approach, we provide a variational localization procedure below.

#### **6.2 Variational target localization**

Assume that the target object undergoes a geometric transformation from region *R* to a region *R*˜, such that *R* = *T*(*R*˜, *a*), where *a* = [*a*1,..., *ag*] is a vector containing the parameters of transformation and *<sup>g</sup>* is the total number of transformation parameters. Let {*ui*}*nu <sup>i</sup>*=<sup>1</sup> and {*vi*}*nv <sup>i</sup>*=<sup>1</sup> be the samples extracted from region *<sup>R</sup>* and *<sup>R</sup>*˜, and let *vi* = [F(*x*˜*i*), *<sup>T</sup>*(*x*˜*i*, *<sup>a</sup>*)]*<sup>T</sup>* <sup>=</sup> [F(*x*˜*i*), *xi*] *<sup>T</sup>*. The rMMD measure between the distributions of the regions *R* and *R*˜ is given by the Equation (12), with the *L*<sup>2</sup> norm is

$$D\_r = \sum\_{k=1}^{m} \left(\omega\_u^k - \omega\_v^k\right)^2 \tag{15}$$

where the *m*-dimensional robust mean maps for the two regions are *ω<sup>k</sup> <sup>u</sup>* = <sup>1</sup> *nu* <sup>∑</sup>*nu <sup>i</sup>*=<sup>1</sup> *<sup>f</sup> <sup>k</sup>*(*ui*) and *ωk <sup>v</sup>* = <sup>1</sup> *nv* <sup>∑</sup>*nv <sup>i</sup>*=<sup>1</sup> *<sup>f</sup> <sup>k</sup>*(*vi*). Gradient descent can be used to minimize the rMMD measure with

Sequence Resolution Object size Total Frames

Robust Density Comparison Using Eigenvalue Decomposition 219

Construction 1 320 × 240 15 × 15 240 Construction 2 320 × 240 10 × 15 240 Pool player 352 × 240 40 × 40 90 Fish 320 × 240 30 × 30 309

Jogging (1st row) 352 × 288 25 × 60 303 Jogging (2nd row) 352 × 288 30 × 70 111

The tracker was applied to a collection of video sequences. The pixel vectors are constructed using the color values and the spatial values. The value of *σ* used in the Gaussian kernel is *σ<sup>F</sup>* = 60 for the color values and *σ<sup>s</sup>* = 4 for the spatial domain. The number of eigenvectors, *m*, retained for the density estimation were chosen following Girolami (2002). In particular,

> *n* ∑ *i*=1 *f k* (*ui*) <sup>2</sup>

 1 *n*

*n* ∑ *i*=1

(*f <sup>k</sup>*(*ui*))<sup>2</sup>

1 1 + *n*

In practice, about 25 of the top eigenvectors were kept, i.e, *M* = 25. The tracker was implemented using Matlab on an Intel Core2 1.86 GHz processor with 2GB RAM. The run time for the proposed tracker was about 0.5-1 frames/sec, depending upon the object size. In all the experiments, we consider translation motion and the initial size and location of the target objects are chosen manually. Figure 10 shows results of tracking two people under different levels of Gaussian noise. Matlab command imnoise was used to add zero mean Gaussian noise of *σ* = [.1, .2, .3]. The sample frames are shown in Figure 10(b), 10(c) and 10(e). The trajectories of the track points are also shown. The tracker was able to track in all cases. The mean shift tracker (Comaniciu et al. (2003)) lost track within few frames in case of noise

Figure 11 shows the result of tracking the face of a pool player. The method was able to track 100% at different noise levels. The covariance tracker Porikli et al. (2006) could detect the face correctly for 47.7% of the frames, for the case of no model update (no noise case). The mean

Figure 12 shows tracking results of a fish sequence. The sequence contains noise, background clutter and fish size changes. The jogging sequence (Figure 13) was tracked in conjunction with Kalman filtering (Kalman (1960)) to successfully track through short-term

, (16)

. (17)

Table 1. Tracking sequence

given that the error associated with the eigenvector *k* is

 1 *n* *�<sup>k</sup>* = (*ω<sup>k</sup>*

*f <sup>k</sup>*(*ui*)

the eigenvectors satisfying the following inequality were retained,

shift tracker Comaniciu et al. (2003) lost track at noise level *σ* = .1.

*n* ∑ *i*=1 ) <sup>2</sup> = 1 *n*

<sup>2</sup> >

**6.3 Results**

level *σ* = .1.

total occlusions.

respect to the transformation parameter *a*. The gradient of Equation (15) with respect to the transformation parameters *a* is

$$\nabla\_{\boldsymbol{a}} D\_{\boldsymbol{r}} = -2 \sum\_{k=1}^{m} \left( \omega\_{\boldsymbol{u}}^{k} - \omega\_{\boldsymbol{v}}^{k} \right) \nabla\_{\boldsymbol{a}} \omega\_{\boldsymbol{v}}^{k}$$

where <sup>∇</sup>*aω<sup>k</sup> <sup>v</sup>* = <sup>1</sup> *nv* <sup>∑</sup>*nv <sup>i</sup>*=<sup>1</sup> <sup>∇</sup>*<sup>a</sup> <sup>f</sup> <sup>k</sup>*(*vi*). The gradient of *<sup>f</sup> <sup>k</sup>*(*vi*) with respect to *<sup>a</sup>* is,

$$\nabla\_a f^k(v\_i) = \nabla\_x f^k(v\_i) \cdot \nabla\_a T(\mathfrak{x}, a)\_\prime$$

where <sup>∇</sup>*aT*(*x*˜, *<sup>a</sup>*) is a *<sup>g</sup>* <sup>×</sup> 2 Jacobian matrix of *<sup>T</sup>* and is given by <sup>∇</sup>*aT* = [ *<sup>∂</sup><sup>T</sup> ∂a*<sup>1</sup> ,..., *<sup>∂</sup><sup>T</sup> ∂ag* ] *<sup>T</sup>*. The gradient <sup>∇</sup>*<sup>x</sup> <sup>f</sup> <sup>k</sup>*(*vi*) is computed as,

$$\nabla\_{\mathbf{x}} f^k(v\_i) = \frac{1}{\sigma\_s^2} \sum\_{j=1}^{n\_u} w\_j^k \mathbf{k}(\mu\_{j\prime} v\_i) (\pi\_s(\mu\_j) - \pi\_i)\_{\prime\prime}$$

where *πs* is a projection from *d*-dimensional pixel vector to its spatial coordinates, such that *πs*(*u*) = *x* and *σ<sup>s</sup>* is the spatial bandwidth parameter used in kernel k. The transformation parameters are updated using the following equation,

$$a(t+1) = a(t) - \delta t \nabla\_a D\_{r\_{\nu}}$$

where *δt* is the time step.

Frame 1 (a) Original Frame 1 (b) Noise *σ* = .1 Frame 1 (c) Noise *σ* = .2 Frame 1 (d) Noise *σ* = .3 120 240 Frame

Fig. 10. Construction Sequence. Trajectories of the track points are shown. Red: No noise added, Green: *σ* = .1, Blue: *σ* = .2, Black: *σ* = .3.


Table 1. Tracking sequence

#### **6.3 Results**

12 Will-be-set-by-IN-TECH

respect to the transformation parameter *a*. The gradient of Equation (15) with respect to the

 *ωk <sup>u</sup>* <sup>−</sup> *<sup>ω</sup><sup>k</sup> v* <sup>∇</sup>*aω<sup>k</sup> v*,

*<sup>i</sup>*=<sup>1</sup> <sup>∇</sup>*<sup>a</sup> <sup>f</sup> <sup>k</sup>*(*vi*). The gradient of *<sup>f</sup> <sup>k</sup>*(*vi*) with respect to *<sup>a</sup>* is,

(*vi*) = <sup>∇</sup>*<sup>x</sup> <sup>f</sup> <sup>k</sup>*(*vi*) · ∇*aT*(*x*˜, *<sup>a</sup>*),

*<sup>j</sup>* k(*uj*, *vi*)(*πs*(*uj*) − *xi*),

Frame 1 (c) Noise *σ* = .2 *∂a*<sup>1</sup>

Frame 1 (d) Noise *σ* = .3

,..., *<sup>∂</sup><sup>T</sup> ∂ag* ] *<sup>T</sup>*. The

*m* ∑ *k*=1

∇*aDr* = −2

where <sup>∇</sup>*aT*(*x*˜, *<sup>a</sup>*) is a *<sup>g</sup>* <sup>×</sup> 2 Jacobian matrix of *<sup>T</sup>* and is given by <sup>∇</sup>*aT* = [ *<sup>∂</sup><sup>T</sup>*

*σ*2 *s*

Frame 1 (b) Noise *σ* = .1

*nu* ∑ *j*=1 *wk*

where *πs* is a projection from *d*-dimensional pixel vector to its spatial coordinates, such that *πs*(*u*) = *x* and *σ<sup>s</sup>* is the spatial bandwidth parameter used in kernel k. The transformation

*a*(*t* + 1) = *a*(*t*) − *δt*∇*aDr*,

120 240 Frame

Fig. 10. Construction Sequence. Trajectories of the track points are shown. Red: No noise

<sup>∇</sup>*<sup>a</sup> <sup>f</sup> <sup>k</sup>*

<sup>∇</sup>*<sup>x</sup> <sup>f</sup> <sup>k</sup>*(*vi*) = <sup>1</sup>

parameters are updated using the following equation,

added, Green: *σ* = .1, Blue: *σ* = .2, Black: *σ* = .3.

transformation parameters *a* is

*<sup>v</sup>* = <sup>1</sup> *nv* <sup>∑</sup>*nv*

gradient <sup>∇</sup>*<sup>x</sup> <sup>f</sup> <sup>k</sup>*(*vi*) is computed as,

where *δt* is the time step.

Frame 1 (a) Original

where <sup>∇</sup>*aω<sup>k</sup>*

The tracker was applied to a collection of video sequences. The pixel vectors are constructed using the color values and the spatial values. The value of *σ* used in the Gaussian kernel is *σ<sup>F</sup>* = 60 for the color values and *σ<sup>s</sup>* = 4 for the spatial domain. The number of eigenvectors, *m*, retained for the density estimation were chosen following Girolami (2002). In particular, given that the error associated with the eigenvector *k* is

$$\epsilon^k = (\omega^k)^2 = \left\{ \frac{1}{n} \sum\_{i=1}^n f^k(u\_i) \right\}^2. \tag{16}$$

the eigenvectors satisfying the following inequality were retained,

$$\left\{\frac{1}{n}\sum\_{i=1}^{n}f^{k}(u\_{i})\right\}^{2} > \frac{1}{1+n}\left\{\frac{1}{n}\sum\_{i=1}^{n}(f^{k}(u\_{i}))^{2}\right\}.\tag{17}$$

In practice, about 25 of the top eigenvectors were kept, i.e, *M* = 25. The tracker was implemented using Matlab on an Intel Core2 1.86 GHz processor with 2GB RAM. The run time for the proposed tracker was about 0.5-1 frames/sec, depending upon the object size.

In all the experiments, we consider translation motion and the initial size and location of the target objects are chosen manually. Figure 10 shows results of tracking two people under different levels of Gaussian noise. Matlab command imnoise was used to add zero mean Gaussian noise of *σ* = [.1, .2, .3]. The sample frames are shown in Figure 10(b), 10(c) and 10(e). The trajectories of the track points are also shown. The tracker was able to track in all cases. The mean shift tracker (Comaniciu et al. (2003)) lost track within few frames in case of noise level *σ* = .1.

Figure 11 shows the result of tracking the face of a pool player. The method was able to track 100% at different noise levels. The covariance tracker Porikli et al. (2006) could detect the face correctly for 47.7% of the frames, for the case of no model update (no noise case). The mean shift tracker Comaniciu et al. (2003) lost track at noise level *σ* = .1.

Figure 12 shows tracking results of a fish sequence. The sequence contains noise, background clutter and fish size changes. The jogging sequence (Figure 13) was tracked in conjunction with Kalman filtering (Kalman (1960)) to successfully track through short-term total occlusions.

1 40 120 160

Robust Density Comparison Using Eigenvalue Decomposition 221

170 210 250 300

 1 56 65 80 300 Frame

304 316 323 330 414 Frame

This chapter presented a novel density comparison method, given two sets of points sampled from two distributions. The method does not require explicit density estimation as an intermediate step. Instead it works directly on the data points to compute the similarity measure. The proposed similarity measure is robust to noise and outliers. Possible applications of the proposed density comparison method in computer vision are visual tracking, segmentation, image registration, and stereo registration. We used the technique

for visual tracking and provided a variational localization procedure.

Fig. 12. Fish Sequence.

Fig. 13. Jogging sequence.

**7. Conclusion**

(a) Sample Frame. (b) No Noise

(c) Noise *σ* = .1. Noise is shown in only two columns for better visualization. (d) Noise *σ* = .2. Noise is shown in only two columns for better visualization.

Fig. 11. Face sequence. Montages of extracted results from 90 consecutive frames for different noise levels.

Fig. 12. Fish Sequence.

14 Will-be-set-by-IN-TECH

(a) Sample Frame. (b) No Noise

Fig. 11. Face sequence. Montages of extracted results from 90 consecutive frames for different

(d) Noise *σ* = .2. Noise is shown in only two

columns for better visualization.

(c) Noise *σ* = .1. Noise is shown in only two

columns for better visualization.

noise levels.

Fig. 13. Jogging sequence.
