**2. Classical visual tracking**

The purpose of this section is to give an overview of classical visual tracking. As a popular component present in many methods, an overview of techniques used for background subtraction will be provided. Next, the focus will shift to the probabilistic tracking frameworks that define the Kalman and particle filters. This will be followed by a presentation of an effective application-specific method: the mean shift tracker.

An intuitive, albeit naive, approach to this problem is to presume a static background model with respect to time. A common form of this assumption is that *FB*,*i*,*<sup>t</sup>* is Gaussian with the same covariance for all *i*. Such a distribution is parametrized only by its mean, and let *μ<sup>i</sup>* specify this value. Substituting the Gaussian density function for *fB*(*p*) in (2.3) yields the following

Compressive Sensing in Visual Tracking 3

*<sup>i</sup>* − *μi*�<sup>2</sup>

for some threshold *η* dependent on *θ* and the covariance. In essence, the above rule amounts to a simple thresholding of the background likelihood function evaluated at the pixel value of interest. This is an intuitive way to perform background subtraction in that if the difference

belonging to the foreground. Further, this method is computationally advantageous in that it simply requires storing a background image, *μ<sup>i</sup>* for all *i*, and thresholding the difference

Fig. 1. Background subtraction results for the static unimodal Gaussian model. Left: static background image. Middle: image with human. Right: background subtraction results using

The static approach outlined above is simple, but suffers from the inability to cope with a dynamic background. Such a background is common in video due to illumination shifts, camera and object motion, and other changes in the environment. For example, a tree in the background may sway in the breeze, causing pixel measurements to change significantly from one frame to the next (e.g. tree to sky). However, each shift should not cause the pixel to be classified as foreground, which will occur under the unimodal Gaussian model. A solution to this problem is to use *kernel density estimation (KDE)* (Elgammal et al., 2002; Stauffer &

*fB*,*i*,*t*(*p*) = <sup>1</sup>

where *Kj* is a kernel density function dependent on the observation *p*

*j i* }*t*−<sup>1</sup>

be defined as a Gaussian with fixed covariance and mean *p*

*N*

*t*−1 ∑ *j*=*t*−*N*

*H*<sup>0</sup> ≶ *H*<sup>1</sup>

*η* (2.4)

*<sup>i</sup>* is high enough, the pixel is classified as

*Kj*(*p*) (2.5)

. Using this definition, *B<sup>t</sup>*

. For example, *Kj* may

*<sup>i</sup>* can

*j i*

*j i*

*<sup>j</sup>*=*t*−*N*, and *FB*,*i*,*<sup>t</sup>* becomes a mixture of Gaussians. This

�*pt*

between it and a test image. An example of this method is shown in Figure 1.

between the background *μ<sup>i</sup>* and the observation *p<sup>t</sup>*

decision rule:

the method in (2.4)

**2.1.3 Dynamic background modeling**

be thought of as the pixel history {*p*

Grimson, 1999) to estimate *fB*,*i*,*<sup>t</sup>* from past data, i.e.

#### **2.1 Background subtraction**

An important first step in many visual tracking systems is the extraction of regions of interest (e.g, those containing objects) from the rest of the scene. These regions are collectively termed the *foreground*, and the technique of *background subtraction* aims to segment it from the background (i.e., the rest of the frame). Once the foreground has been identified, the task of feature extraction becomes much easier due to the resulting decrease in data.

#### **2.1.1 Hypothesis testing formulation**

When dealing with digital images, one can pose the problem of background subtraction as a hypothesis test (Poor, 1994; Sankaranarayanan et al., 2008) for each pixel in the image. The null hypothesis (*H*0) is that a pixel belongs to the background, while the alternate hypothesis (*H*1) is that it belongs to the foreground. Let *p* denote the measurement observed at an arbitrary pixel. The form of *p* varies with the sensing modality, however its most common forms are that of a scalar (e.g., light intensity in a gray scale image) or a three-vector (e.g., a color triple in a color image). Whatever they physically represent, let *FB* denote the probability distribution over the possible values of *p* when the pixel belongs to the background, and *FT* the distribution for pixels in the foreground. The hypothesis test formulation of background subtraction can then be written as:

$$\begin{aligned} \,^p H\_0: \quad &p \sim \text{F}\_B\\ \,^p H\_1: \quad &p \sim \text{F}\_T \end{aligned} \tag{2.1}$$

The optimal Bayes decision rule for (2.1) is given by:

$$\frac{f\_B(p)}{f\_T(p)} \overset{H\_0}{\underset{H\_1}{\gtrless}} \tau \tag{2.2}$$

where *fB*(*p*) and *fT*(*p*) denote the densities corresponding to *FB* and *FT* respectively, and *τ* is a threshold determined by the Bayes risk. It is often the case, however, that very little is known about the foreground, and thus the form of *FT*. One way of handling this is to assume *FT* to be the uniform distribution over the possible values of *p*. In this case, the above reduces to:

$$\left(f\_{\mathcal{B}}(p)\right)\_{\stackrel{H\_0}{\underset{H\_1}{\rightleftharpoons}}}^{H\_0}\theta\tag{2.3}$$

where *θ* is dependent on *τ* the range of *p*.

In practice, the optimum value of *θ* is typically unknown. Therefore, *θ* is often chosen in an *ad-hoc* fashion such that the decision rule gives pleasing results for the data of interest.

#### **2.1.2 A simple background model**

It will now be useful to introduce some notation to handle the temporal and spatial dimensions intrinsic to video data. Let *p<sup>t</sup> <sup>i</sup>* denote the value of the *i* th pixel in the *t* th frame. Further, let *B<sup>t</sup> <sup>i</sup>* parametrize the corresponding background distribution, denoted *FB*,*i*,*t*, which may vary with respect to both time and space. In order to select a good hypothesis test, the focus of the background subtraction problem is on how to determine *B<sup>t</sup> <sup>i</sup>* from the available data.

An important first step in many visual tracking systems is the extraction of regions of interest (e.g, those containing objects) from the rest of the scene. These regions are collectively termed the *foreground*, and the technique of *background subtraction* aims to segment it from the background (i.e., the rest of the frame). Once the foreground has been identified, the task

When dealing with digital images, one can pose the problem of background subtraction as a hypothesis test (Poor, 1994; Sankaranarayanan et al., 2008) for each pixel in the image. The null hypothesis (*H*0) is that a pixel belongs to the background, while the alternate hypothesis (*H*1) is that it belongs to the foreground. Let *p* denote the measurement observed at an arbitrary pixel. The form of *p* varies with the sensing modality, however its most common forms are that of a scalar (e.g., light intensity in a gray scale image) or a three-vector (e.g., a color triple in a color image). Whatever they physically represent, let *FB* denote the probability distribution over the possible values of *p* when the pixel belongs to the background, and *FT* the distribution for pixels in the foreground. The hypothesis test formulation of background subtraction can

> *H*<sup>0</sup> : *p* ∼*FB H*<sup>1</sup> : *p* ∼*FT*

> > *H*<sup>0</sup> ≷ *H*<sup>1</sup>

*fB*(*p*) *fT*(*p*)

*fB*(*p*)

In practice, the optimum value of *θ* is typically unknown. Therefore, *θ* is often chosen in an *ad-hoc* fashion such that the decision rule gives pleasing results for the data of interest.

It will now be useful to introduce some notation to handle the temporal and spatial

may vary with respect to both time and space. In order to select a good hypothesis test, the

focus of the background subtraction problem is on how to determine *B<sup>t</sup>*

*<sup>i</sup>* denote the value of the *i*

*<sup>i</sup>* parametrize the corresponding background distribution, denoted *FB*,*i*,*t*, which

*H*<sup>0</sup> ≷ *H*<sup>1</sup>

where *fB*(*p*) and *fT*(*p*) denote the densities corresponding to *FB* and *FT* respectively, and *τ* is a threshold determined by the Bayes risk. It is often the case, however, that very little is known about the foreground, and thus the form of *FT*. One way of handling this is to assume *FT* to be the uniform distribution over the possible values of *p*. In this case, the above reduces

(2.1)

*τ* (2.2)

*θ* (2.3)

th pixel in the *t*

th frame.

*<sup>i</sup>* from the available

of feature extraction becomes much easier due to the resulting decrease in data.

**2.1 Background subtraction**

**2.1.1 Hypothesis testing formulation**

The optimal Bayes decision rule for (2.1) is given by:

where *θ* is dependent on *τ* the range of *p*.

dimensions intrinsic to video data. Let *p<sup>t</sup>*

**2.1.2 A simple background model**

Further, let *B<sup>t</sup>*

data.

then be written as:

to:

An intuitive, albeit naive, approach to this problem is to presume a static background model with respect to time. A common form of this assumption is that *FB*,*i*,*<sup>t</sup>* is Gaussian with the same covariance for all *i*. Such a distribution is parametrized only by its mean, and let *μ<sup>i</sup>* specify this value. Substituting the Gaussian density function for *fB*(*p*) in (2.3) yields the following decision rule:

$$\|p\_i^t - \mu\_i\|\_2 \lessapprox\_{H\_1}^{H\_0} \eta \tag{2.4}$$

for some threshold *η* dependent on *θ* and the covariance. In essence, the above rule amounts to a simple thresholding of the background likelihood function evaluated at the pixel value of interest. This is an intuitive way to perform background subtraction in that if the difference between the background *μ<sup>i</sup>* and the observation *p<sup>t</sup> <sup>i</sup>* is high enough, the pixel is classified as belonging to the foreground. Further, this method is computationally advantageous in that it simply requires storing a background image, *μ<sup>i</sup>* for all *i*, and thresholding the difference between it and a test image. An example of this method is shown in Figure 1.

Fig. 1. Background subtraction results for the static unimodal Gaussian model. Left: static background image. Middle: image with human. Right: background subtraction results using the method in (2.4)

#### **2.1.3 Dynamic background modeling**

The static approach outlined above is simple, but suffers from the inability to cope with a dynamic background. Such a background is common in video due to illumination shifts, camera and object motion, and other changes in the environment. For example, a tree in the background may sway in the breeze, causing pixel measurements to change significantly from one frame to the next (e.g. tree to sky). However, each shift should not cause the pixel to be classified as foreground, which will occur under the unimodal Gaussian model. A solution to this problem is to use *kernel density estimation (KDE)* (Elgammal et al., 2002; Stauffer & Grimson, 1999) to estimate *fB*,*i*,*<sup>t</sup>* from past data, i.e.

$$f\_{B,i,t}(p) = \frac{1}{N} \sum\_{j=t-N}^{t-1} K\_j(p) \tag{2.5}$$

where *Kj* is a kernel density function dependent on the observation *p j i* . For example, *Kj* may be defined as a Gaussian with fixed covariance and mean *p j i* . Using this definition, *B<sup>t</sup> <sup>i</sup>* can be thought of as the pixel history {*p j i* }*t*−<sup>1</sup> *<sup>j</sup>*=*t*−*N*, and *FB*,*i*,*<sup>t</sup>* becomes a mixture of Gaussians. This

The required relationship between **y***<sup>t</sup>* and **x***<sup>t</sup>* is specified by:

**y***<sup>t</sup>* = **H***<sup>T</sup>*

as *measurement noise*, and is assumed to be independent of {**w***t*}<sup>∞</sup>

 **x**0 **y**0 

which yields

up to time 0.

covariance given by

**v***<sup>t</sup>* ∼ N (**0**, **R***t*), **E**

 **v***k***v***<sup>T</sup> l* 

Compressive Sensing in Visual Tracking 5

Notice that, just as in the state model, the relationship between the observation and the state is assumed to be linear and affected by white Gaussian noise **v***t*. This is referred to

With the above assumptions, the goal of the Kalman filter is to compute the best estimate of **x***<sup>k</sup>* from the observations (**y**0,..., **y***t*). What is meant by "best" can vary from application to application, but common criterion yield the *maximum a posteriori (MAP)* and *minimum mean squared error (MMSE)* estimators. Regardless of the estimator chosen, the value it yields can be computed using the posterior density *p*(**x***t*|**y**0,..., **y***t*). For example, the MMSE estimate is

Under the assumptions made when specifying the state and observation equations, the MMSE and MAP estimates are identical. Since successive estimates can be calculated recursively, the Kalman filter provides this estimate without having to re-compute *p*(**x***t*|**y**0,..., **y***t*) each time a new observation is received. This benefit requires the additional assumption that **x**<sup>0</sup> ∼

> **H***<sup>T</sup>* <sup>0</sup> **<sup>P</sup>**<sup>0</sup> **<sup>H</sup>***<sup>T</sup>*

Since **x**0|**y**<sup>0</sup> is Gaussian, both its MMSE and MAP estimates are given by the mean of this distribution, i.e., **ˆx**0|0. The subscript indicates that this is the estimate of **<sup>x</sup>**<sup>0</sup> given observations

From this starting point, the Kalman filter calculates subsequent estimates (**ˆx***t*|*<sup>t</sup>* in general) using a two step procedure. First, it can be seen that **x***t*+1|**y**0:*<sup>t</sup>* is also Gaussian, with mean and

The above are known as the *time update equations*. Once **y***t*+<sup>1</sup> is observed, the second step of the Kalman filter is to adjust the prediction **ˆx***t*+1|*<sup>t</sup>* to one that incorporates the information

**<sup>Σ</sup>***t*+1|*<sup>t</sup>* <sup>=</sup> **<sup>F</sup>***t***Σ***t*|*t***F***<sup>T</sup>*

provided by the new observation. This is done via the *measurement update equations*:

**P**<sup>0</sup> **P**0**H0**

<sup>0</sup> **<sup>P</sup>**0**H**<sup>0</sup> <sup>+</sup> **<sup>R</sup>**0)−1(**y**<sup>0</sup> <sup>−</sup> **<sup>H</sup>***<sup>T</sup>*

*<sup>t</sup>* + **Q***<sup>t</sup>* .

<sup>0</sup> **<sup>P</sup>**0**H**<sup>0</sup> + **<sup>R</sup>**0)−1**H***<sup>T</sup>*

<sup>0</sup> **P**0**H**<sup>0</sup> + **R**<sup>0</sup>

<sup>0</sup> **¯x**0)

**ˆx***t*+1|*<sup>t</sup>* <sup>=</sup> **<sup>F</sup>***<sup>t</sup>* **ˆx***t*|*<sup>t</sup>* (2.11)

, (2.8)

<sup>0</sup> **P**<sup>0</sup> . (2.10)

(2.9)

the mean of this density and the MAP estimate is the value of **x***<sup>t</sup>* that maximizes it.

N (**¯x**0, **P**0), which is equivalent to assuming **x**<sup>0</sup> and **y**<sup>0</sup> to be jointly Gaussian, i.e.,

**ˆx**0|0, **<sup>Σ</sup>**0|<sup>0</sup>

∼ N **¯x**<sup>0</sup> **H***<sup>T</sup>* <sup>0</sup> **¯x**<sup>0</sup> ,

**ˆx**0|<sup>0</sup> <sup>=</sup> **¯x**<sup>0</sup> <sup>+</sup> **<sup>P</sup>**0**H**0(**H***<sup>T</sup>*

**<sup>Σ</sup>**0|<sup>0</sup> <sup>=</sup> **<sup>P</sup>**<sup>0</sup> <sup>−</sup> **<sup>P</sup>**0**H**0(**H***<sup>T</sup>*

**<sup>x</sup>**0|**y**<sup>0</sup> ∼ N

*<sup>t</sup>* **x***<sup>t</sup>* + **v***<sup>t</sup>* (2.7)

*<sup>t</sup>*=0.

= **R***kδkl* .

method is also adaptive to temporally recent changes in the background, as only the previous *N* observations are used in the density estimate.

#### **2.2 Tracking**

In general, *tracking* is the sequential estimation of a random variable based on observations over which it exerts influence. In the field of video surveillance, this random variable represents certain physical qualities belonging to objects of interest. For example, Broida and Chellappa (Broida & Chellappa, 1986) characterize a two-dimensional object in the image plane via its center of mass and translational velocity. They also incorporate other quantities to capture shape, global scale, and rotational motion. The time sequential estimates of such quantities are referred to as *tracks*.

To facilitate subsequent discussion, it is useful to consider the discrete time state space representation of the overall system that encompasses object motion and observation. The *state* of the system represents the unknown values of interest (e.g., object position), and in this section it will be denoted by a *state vector*, **x***t*, whose components correspond to these quantities. Observations of the system will be denoted by **y***t*, and are obtained via a mapping from the image to the observation space. This process is referred to as *feature extraction*, which will not be the focus of this chapter. Instead, it is assumed that observations are provided to the tracker with some specified probabilistic relationship between observation and state. Given the complicated nature of feature extraction, it is often the case that this relationship is heuristically selected based on some intuition regarding the feature extraction process.

In the context of the above discussion, the goal of a tracker is to provide sequential estimates of **x***<sup>t</sup>* using the observations (**y**0,..., **y***t*). In the following sections, a few prominent methods by which this is done will be considered.

#### **2.2.1 Kalman filtering**

The Kalman filter is a recursive tracking technique that is widely popular due to its computational efficiency and ease of implementation. Under specific system assumptions, it is able to provide a state estimate that is optimal according to a few popular metrics. This section will outline these assumptions and detail the Kalman filtering method that is used to compute the sequential state estimates.

Specifically, the assumptions that yield optimality are that the physical process governing the behavior of the state should be linear and affected by additive white Gaussian *process noise*, **w***t*, i.e. (Anderson & Moore, 1979),

$$\mathbf{x}\_{t+1} = \mathbf{F}\_t \mathbf{x}\_t + \mathbf{w}\_t \tag{2.6}$$

$$\mathbf{w}\_t \sim \mathcal{N}(\mathbf{0}, \mathbf{Q}\_t)\_\prime \; \mathbb{E}\left[\mathbf{w}\_k \mathbf{w}\_l^T\right] = \mathbf{Q}\_k \delta\_{kl} \tag{2.7}$$

where *δkl* is equal to one when *k* = *l*, and is zero otherwise. The process noise allows for the model to remain valid even when the relationship between **x***t*+<sup>1</sup> and **x***<sup>t</sup>* is not completely captured by **F***t*.

method is also adaptive to temporally recent changes in the background, as only the previous

In general, *tracking* is the sequential estimation of a random variable based on observations over which it exerts influence. In the field of video surveillance, this random variable represents certain physical qualities belonging to objects of interest. For example, Broida and Chellappa (Broida & Chellappa, 1986) characterize a two-dimensional object in the image plane via its center of mass and translational velocity. They also incorporate other quantities to capture shape, global scale, and rotational motion. The time sequential estimates of such

To facilitate subsequent discussion, it is useful to consider the discrete time state space representation of the overall system that encompasses object motion and observation. The *state* of the system represents the unknown values of interest (e.g., object position), and in this section it will be denoted by a *state vector*, **x***t*, whose components correspond to these quantities. Observations of the system will be denoted by **y***t*, and are obtained via a mapping from the image to the observation space. This process is referred to as *feature extraction*, which will not be the focus of this chapter. Instead, it is assumed that observations are provided to the tracker with some specified probabilistic relationship between observation and state. Given the complicated nature of feature extraction, it is often the case that this relationship is heuristically selected based on some intuition regarding the feature extraction process.

In the context of the above discussion, the goal of a tracker is to provide sequential estimates of **x***<sup>t</sup>* using the observations (**y**0,..., **y***t*). In the following sections, a few prominent methods

The Kalman filter is a recursive tracking technique that is widely popular due to its computational efficiency and ease of implementation. Under specific system assumptions, it is able to provide a state estimate that is optimal according to a few popular metrics. This section will outline these assumptions and detail the Kalman filtering method that is used to

Specifically, the assumptions that yield optimality are that the physical process governing the behavior of the state should be linear and affected by additive white Gaussian *process noise*,

> **w***k***w***<sup>T</sup> l*

where *δkl* is equal to one when *k* = *l*, and is zero otherwise. The process noise allows for the model to remain valid even when the relationship between **x***t*+<sup>1</sup> and **x***<sup>t</sup>* is not completely

**w***<sup>t</sup>* ∼ N (**0**, **Q***t*), **E**

**x***t*+<sup>1</sup> = **F***t***x***<sup>t</sup>* + **w***<sup>t</sup>* (2.6)

= **Q***kδkl* ,

*N* observations are used in the density estimate.

quantities are referred to as *tracks*.

by which this is done will be considered.

compute the sequential state estimates.

**w***t*, i.e. (Anderson & Moore, 1979),

**2.2.1 Kalman filtering**

captured by **F***t*.

**2.2 Tracking**

The required relationship between **y***<sup>t</sup>* and **x***<sup>t</sup>* is specified by:

$$\mathbf{y}\_t = \mathbf{H}\_t^T \mathbf{x}\_t + \mathbf{v}\_t \tag{2.7}$$

$$\mathbf{v}\_t \sim \mathcal{N}(\mathbf{0}, \mathbf{R}\_t), \; \mathbb{E}\left[\mathbf{v}\_k \mathbf{v}\_l^T\right] = \mathbf{R}\_k \delta\_{kl} \quad .$$

Notice that, just as in the state model, the relationship between the observation and the state is assumed to be linear and affected by white Gaussian noise **v***t*. This is referred to as *measurement noise*, and is assumed to be independent of {**w***t*}<sup>∞</sup> *<sup>t</sup>*=0.

With the above assumptions, the goal of the Kalman filter is to compute the best estimate of **x***<sup>k</sup>* from the observations (**y**0,..., **y***t*). What is meant by "best" can vary from application to application, but common criterion yield the *maximum a posteriori (MAP)* and *minimum mean squared error (MMSE)* estimators. Regardless of the estimator chosen, the value it yields can be computed using the posterior density *p*(**x***t*|**y**0,..., **y***t*). For example, the MMSE estimate is the mean of this density and the MAP estimate is the value of **x***<sup>t</sup>* that maximizes it.

Under the assumptions made when specifying the state and observation equations, the MMSE and MAP estimates are identical. Since successive estimates can be calculated recursively, the Kalman filter provides this estimate without having to re-compute *p*(**x***t*|**y**0,..., **y***t*) each time a new observation is received. This benefit requires the additional assumption that **x**<sup>0</sup> ∼ N (**¯x**0, **P**0), which is equivalent to assuming **x**<sup>0</sup> and **y**<sup>0</sup> to be jointly Gaussian, i.e.,

$$
\begin{bmatrix}
\mathbf{x}\_{0} \\
\mathbf{y}\_{0}
\end{bmatrix} \sim \mathcal{N} \left( \begin{bmatrix}
\ddot{\mathbf{x}}\_{0} \\
\mathbf{H}\_{0}^{T}\ddot{\mathbf{x}}\_{0}
\end{bmatrix}, \begin{bmatrix}
\mathbf{P}\_{0} & \mathbf{P}\_{0}\mathbf{H}\_{0} \\
\mathbf{H}\_{0}^{T}\mathbf{P}\_{0} & \mathbf{H}\_{0}^{T}\mathbf{P}\_{0}\mathbf{H}\_{0} + \mathbf{R}\_{0}
\end{bmatrix} \right) \quad \text{/} \tag{2.8}
$$

which yields

$$\mathbf{x}\_{0}|\mathbf{y}\_{0} \sim \mathcal{N}\left(\hat{\mathbf{x}}\_{0|0}, \boldsymbol{\Sigma}\_{0|0}\right) \tag{2.9}$$

$$\mathbf{x}\_{0|0} = \bar{\mathbf{x}}\_{0} + \mathbf{P}\_{0}\mathbf{H}\_{0}(\mathbf{H}\_{0}^{T}\mathbf{P}\_{0}\mathbf{H}\_{0} + \mathbf{R}\_{0})^{-1}(\mathbf{y}\_{0} - \mathbf{H}\_{0}^{T}\bar{\mathbf{x}}\_{0})$$

$$\boldsymbol{\Sigma}\_{0|0} = \mathbf{P}\_{0} - \mathbf{P}\_{0}\mathbf{H}\_{0}(\mathbf{H}\_{0}^{T}\mathbf{P}\_{0}\mathbf{H}\_{0} + \mathbf{R}\_{0})^{-1}\mathbf{H}\_{0}^{T}\mathbf{P}\_{0} \ . \tag{2.10}$$

Since  $\mathbf{x}\_0|\mathbf{y}\_0$  is Gaussian, both its MMSE and MAP estimates are given by the mean of this distribution, i.e.,  $\mathbf{x}\_{0|0}$ . The subscript indicates that this is the estimate of  $\mathbf{x}\_0$  given observations, up to time 0.

From this starting point, the Kalman filter calculates subsequent estimates (**ˆx***t*|*<sup>t</sup>* in general) using a two step procedure. First, it can be seen that **x***t*+1|**y**0:*<sup>t</sup>* is also Gaussian, with mean and covariance given by

$$\begin{aligned} \mathfrak{K}\_{t+1|t} &= \mathbf{F}\_t \mathfrak{K}\_{t|t} \\ \mathfrak{L}\_{t+1|t} &= \mathbf{F}\_t \mathfrak{L}\_{t|t} \mathbf{F}\_t^T + \mathbf{Q}\_t \quad . \end{aligned} \tag{2.11}$$

The above are known as the *time update equations*. Once **y***t*+<sup>1</sup> is observed, the second step of the Kalman filter is to adjust the prediction **ˆx***t*+1|*<sup>t</sup>* to one that incorporates the information provided by the new observation. This is done via the *measurement update equations*:

the density by placing a Dirac delta mass at the location of each sample, i.e.,

*<sup>p</sup>*(**x**0:*t*|**y**1:*t*) <sup>≈</sup> *PN*(**x**0:*t*|**y**1:*t*) = <sup>1</sup>

*p*(**x**0), from which it is straightforward to generate samples {**x**

(*i*)

generated via draws from a discrete distribution over {**˜x**

Due to the selection step, those candidate particles **˜x**

(Comaniciu & Meer, 2002) is then used to find this mode.

to obtain the samples {**x**

(*i*)

(*i*)

<sup>1</sup> from *p*(**x**1|**x**

(*i*)

by drawing **˜x**

such that ∑*<sup>i</sup> w*˜

and so forth.

{**x** (*i*) *<sup>t</sup>* }*<sup>N</sup>*

cost.

**2.2.3 Mean shift tracking**

element given by *w*˜

(*i*) 0:*t*}*<sup>N</sup> <sup>i</sup>*=1. *N*

<sup>0</sup> ) for each *i*. From here, *importance weights w*˜

(*i*) <sup>1</sup> }*<sup>N</sup>*

(*i*)

It would then be straightforward to use *PN* to calculate an estimate of the random variable (i.e. a track). However, this method presents its own difficulty in that it is usually impractical

Compressive Sensing in Visual Tracking 7

The bootstrap filter is based on a technique called *sequential importance sampling*, which is used to overcome the issue above. Samples are initially drawn from the known prior distribution

are calculated based on the observation **y**<sup>1</sup> and adjusted such that they are normalized (i.e.

propagate to the next stage. The samples that survive are those that explain the data well, and are thus concentrated in the most dense areas of *p*(**x***t*|**y**1:*t*). Therefore, the computed value for common estimators such as the mean and mode will be good approximations of their actual values. Further, note that the candidate particles are drawn from *<sup>p</sup>*(**x***t*|**x***t*−1), which introduces

Using the estimate calculated from the density approximation yielded by the particles

Unlike the Kalman and particle filters, the *mean shift* tracker (Comaniciu et al., 2003) is a procedure designed specifically for visual data. The feature employed, a spatially weighted color histogram, is computed directly from the input images. The estimate for the object position in the image plane is defined as the mode of a density over spatial locations, where this density is defined using a similarity measure between the histogram for an object model (i.e. a "template") and the histogram at a location of interest. The mean shift procedure

In general, the mean shift procedure provides a way to perform gradient ascent on an unknown density using only samples generated by this density. It achieves this via selecting a

*<sup>i</sup>*=1, the particle filter is able provide tracks that are optimal for a wide variety of criteria in a more general setting than that required by the Kalman filter. However, the validity of the track depends on the ability of the particles to sufficiently characterize the underlying density. Often, this may require a large number of particles, which can lead to a high computational

<sup>1</sup> = 1). The filter then enters the selection step, where samples {**x**

sampling occurs. First, a prediction step takes place, generating candidate samples {**˜x**

<sup>1</sup> . This process is then repeated to obtain {**x**

process noise to prevent the particles from becoming too short-sighted.

*N* ∑ *i*=1

*δ*(**x**0:*<sup>t</sup>* − **x**

(*i*)

(*i*) <sup>0</sup> }*<sup>N</sup>*

(*i*) <sup>2</sup> }*<sup>N</sup>*

*<sup>t</sup>* for which *<sup>p</sup>*(**y***t*|**˜x***<sup>i</sup>*

0:*t*) . (2.19)

*<sup>i</sup>*=1. Next, importance

<sup>1</sup> = *p*(**y**1|**˜x**

(*i*) <sup>1</sup> }*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> are

*<sup>t</sup>*) is low will not

(*i*)

(*i*) <sup>1</sup> }*<sup>N</sup>*

*<sup>i</sup>*=<sup>1</sup> with the probability for the *i*

*<sup>i</sup>*=<sup>1</sup> from {**x**

(*i*) <sup>1</sup> }*<sup>N</sup> i*=1

*<sup>i</sup>*=<sup>1</sup> and **y**2,

(*i*) <sup>1</sup> )

*th*

$$\hat{\mathbf{x}}\_{t+1|t+1} = \hat{\mathbf{x}}\_{t+1|t} + \Sigma\_{t+1|t} \mathbf{H}\_{t+1} (\mathbf{H}\_{t+1}^T \Sigma\_{t+1|t} \mathbf{H}\_{t+1} + \mathbf{R}\_{t+1})^{-1} (\mathbf{y}\_{t+1} - \mathbf{H}\_{t+1}^T \hat{\mathbf{x}}\_{t+1|t}) \tag{2.12}$$

$$\boldsymbol{\Sigma}\_{t+1|t+1} = \boldsymbol{\Sigma}\_{t+1|t} - \boldsymbol{\Sigma}\_{t+1|t} \mathbf{H}\_{t+1} (\mathbf{H}\_{t+1}^T \boldsymbol{\Sigma}\_{t+1|t} \mathbf{H}\_{t+1} + \mathbf{R}\_{t+1})^{-1} \mathbf{H}\_{t+1}^T \boldsymbol{\Sigma}\_{t+1|t} \ . \tag{2.13}$$

Using the above steps at each time instant, the Kalman filter provides optimal tracks {**ˆx***t*|*t*}<sup>∞</sup> *t*=0 that are calculated in a recursive and efficient manner. The optimality of the estimates comes at the cost of requiring the assumptions of linearity and Gaussianity in the state space formulation of the system. Even without the Gaussian assumptions, the filter is optimal among the class of linear filters.

#### **2.2.2 Particle filtering**

Since it is able to operate in an unconstrained setting, the *particle filter* (Doucet et al., 2001; Isard & Blake, 1996) is a more general approach to sequential estimation. However, this expanded utility comes at the cost of high computational complexity. The particle filter is a *sequential Monte Carlo method*, using samples of the conditional distribution in order to approximate it and thus the desired estimates. There are many variations of the particle filter, but the focus of this section shall be on the so-called *bootstrap filter*.

Assume the system of interest behaves according to the following known densities:

$$p(\mathbf{x}\_0) \quad \text{ \\_} \tag{2.14}$$

$$p(\mathbf{x}\_t|\mathbf{x}\_{t-1}), \quad t \ge 1 \quad \text{ } \quad \text{and} \tag{2.15}$$

$$p(\mathbf{y}\_t|\mathbf{x}\_t)\_\prime \quad t \ge 1 \quad . \tag{2.16}$$

Note that the more general specifications *<sup>p</sup>*(**x***t*|**x***t*−1) and *<sup>p</sup>*(**y***t*|**x***t*) replace the linear, Gaussian descriptions of the system and observation behaviors necessary for the Kalman filter. In order to achieve the goal of tracking, it is necessary to have some information regarding *p*(**x**0:*t*|**y**1:*t*) (from which *p*(**x***t*|**y**1:*t*) is apparent), where **x**0:*<sup>t</sup>* = (**x**0,..., **x***t*), and similarly for **y**1:*t*. Here, we depart from the previous notation and assume that the first observation is available at *t* = 1.

In a purely Bayesian sense, one could compute the conditional density as

$$p(\mathbf{x}\_{0:t}|\mathbf{y}\_{1:t}) = \frac{p(\mathbf{y}\_{1:t}|\mathbf{x}\_{0:t})p(\mathbf{x}\_{0:t})}{\int p(\mathbf{y}\_{1:t}|\mathbf{x}\_{0:t})p(\mathbf{x}\_{0:t})d\mathbf{x}\_{0:t}}\tag{2.17}$$

which leads to a recursive formula

$$p(\mathbf{x}\_{0:t}|\mathbf{y}\_{1:t}) = p(\mathbf{x}\_{0:t-1}|\mathbf{y}\_{1:t-1}) \frac{p(\mathbf{y}\_t|\mathbf{x}\_t)p(\mathbf{x}\_t|\mathbf{x}\_{t-1})}{p(\mathbf{y}\_t|\mathbf{y}\_{t-1})} \quad . \tag{2.18}$$

A similar type of recursion can be shown to exist for the marginal density *p*(**x***t*|**y**1:*t*). While the above expressions seem simple, for general distributions in (2.14) (2.15) and (2.16), they often become prohibitively difficult to evaluate due to analytic and computational complexity.

The particle filter avoids the analytic difficulties above using Monte Carlo sampling. If *N* i.i.d. *particles* (samples), {**x** (*i*) 0:*t*}*<sup>N</sup> <sup>i</sup>*=1, drawn from *p*(**x**0:*t*|**y**1:*t*) were available, one could approximate

Using the above steps at each time instant, the Kalman filter provides optimal tracks {**ˆx***t*|*t*}<sup>∞</sup>

that are calculated in a recursive and efficient manner. The optimality of the estimates comes at the cost of requiring the assumptions of linearity and Gaussianity in the state space formulation of the system. Even without the Gaussian assumptions, the filter is optimal

Since it is able to operate in an unconstrained setting, the *particle filter* (Doucet et al., 2001; Isard & Blake, 1996) is a more general approach to sequential estimation. However, this expanded utility comes at the cost of high computational complexity. The particle filter is a *sequential Monte Carlo method*, using samples of the conditional distribution in order to approximate it and thus the desired estimates. There are many variations of the particle filter, but the focus

Note that the more general specifications *<sup>p</sup>*(**x***t*|**x***t*−1) and *<sup>p</sup>*(**y***t*|**x***t*) replace the linear, Gaussian descriptions of the system and observation behaviors necessary for the Kalman filter. In order to achieve the goal of tracking, it is necessary to have some information regarding *p*(**x**0:*t*|**y**1:*t*) (from which *p*(**x***t*|**y**1:*t*) is apparent), where **x**0:*<sup>t</sup>* = (**x**0,..., **x***t*), and similarly for **y**1:*t*. Here, we depart from the previous notation and assume that the first observation is available at *t* = 1.

*<sup>p</sup>*(**x**0:*t*|**y**1:*t*) = *<sup>p</sup>*(**y**1:*t*|**x**0:*t*)*p*(**x**0:*t*)

A similar type of recursion can be shown to exist for the marginal density *p*(**x***t*|**y**1:*t*). While the above expressions seem simple, for general distributions in (2.14) (2.15) and (2.16), they often become prohibitively difficult to evaluate due to analytic and computational complexity.

The particle filter avoids the analytic difficulties above using Monte Carlo sampling. If *N* i.i.d.

*<sup>p</sup>*(**y**1:*t*|**x**0:*t*)*p*(**x**0:*t*)*d***x**0:*<sup>t</sup>*

*<sup>p</sup>*(**y***t*|**x***t*)*p*(**x***t*|**x***t*−1)

*<sup>i</sup>*=1, drawn from *p*(**x**0:*t*|**y**1:*t*) were available, one could approximate

Assume the system of interest behaves according to the following known densities:

In a purely Bayesian sense, one could compute the conditional density as

*<sup>p</sup>*(**x**0:*t*|**y**1:*t*) = *<sup>p</sup>*(**x**0:*t*−1|**y**1:*t*−1)

*<sup>t</sup>*+1**Σ***t*+1|*t***H***t*+<sup>1</sup> <sup>+</sup> **<sup>R</sup>***t*+1)−1(**y***t*+<sup>1</sup> <sup>−</sup> **<sup>H</sup>***<sup>T</sup>*

*p*(**x**0) , (2.14) *<sup>p</sup>*(**x***t*|**x***t*−1), *<sup>t</sup>* ≥ 1 , and (2.15) *p*(**y***t*|**x***t*), *t* ≥ 1 . (2.16)

*<sup>t</sup>*+1**Σ***t*+1|*t***H***t*+<sup>1</sup> <sup>+</sup> **<sup>R</sup>***t*+1)−1**H***<sup>T</sup>*

*<sup>t</sup>*+<sup>1</sup> **ˆx***t*+1|*t*) (2.12)

*t*=0

*<sup>t</sup>*+1**Σ***t*+1|*<sup>t</sup>* . (2.13)

, (2.17)

*<sup>p</sup>*(**y***t*|**y***t*−1) . (2.18)

**ˆx***t*+1|*t*+<sup>1</sup> <sup>=</sup> **ˆx***t*+1|*<sup>t</sup>* <sup>+</sup> **<sup>Σ</sup>***t*+1|*t***H***t*+1(**H***<sup>T</sup>*

**<sup>Σ</sup>***t*+1|*t*+<sup>1</sup> <sup>=</sup> **<sup>Σ</sup>***t*+1|*<sup>t</sup>* <sup>−</sup> **<sup>Σ</sup>***t*+1|*t***H***t*+1(**H***<sup>T</sup>*

of this section shall be on the so-called *bootstrap filter*.

among the class of linear filters.

which leads to a recursive formula

(*i*) 0:*t*}*<sup>N</sup>*

*particles* (samples), {**x**

**2.2.2 Particle filtering**

the density by placing a Dirac delta mass at the location of each sample, i.e.,

$$p(\mathbf{x}\_{0:t}|\mathbf{y}\_{1:t}) \approx P\_N(\mathbf{x}\_{0:t}|\mathbf{y}\_{1:t}) = \frac{1}{N} \sum\_{i=1}^N \delta(\mathbf{x}\_{0:t} - \mathbf{x}\_{0:t}^{(i)}) \quad . \tag{2.19}$$

It would then be straightforward to use *PN* to calculate an estimate of the random variable (i.e. a track). However, this method presents its own difficulty in that it is usually impractical to obtain the samples {**x** (*i*) 0:*t*}*<sup>N</sup> <sup>i</sup>*=1.

The bootstrap filter is based on a technique called *sequential importance sampling*, which is used to overcome the issue above. Samples are initially drawn from the known prior distribution *p*(**x**0), from which it is straightforward to generate samples {**x** (*i*) <sup>0</sup> }*<sup>N</sup> <sup>i</sup>*=1. Next, importance sampling occurs. First, a prediction step takes place, generating candidate samples {**˜x** (*i*) <sup>1</sup> }*<sup>N</sup> i*=1 by drawing **˜x** (*i*) <sup>1</sup> from *p*(**x**1|**x** (*i*) <sup>0</sup> ) for each *i*. From here, *importance weights w*˜ (*i*) <sup>1</sup> = *p*(**y**1|**˜x** (*i*) <sup>1</sup> ) are calculated based on the observation **y**<sup>1</sup> and adjusted such that they are normalized (i.e. such that ∑*<sup>i</sup> w*˜ (*i*) <sup>1</sup> = 1). The filter then enters the selection step, where samples {**x** (*i*) <sup>1</sup> }*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> are generated via draws from a discrete distribution over {**˜x** (*i*) <sup>1</sup> }*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> with the probability for the *i th* element given by *w*˜ (*i*) <sup>1</sup> . This process is then repeated to obtain {**x** (*i*) <sup>2</sup> }*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> from {**x** (*i*) <sup>1</sup> }*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> and **y**2, and so forth.

Due to the selection step, those candidate particles **˜x** (*i*) *<sup>t</sup>* for which *<sup>p</sup>*(**y***t*|**˜x***<sup>i</sup> <sup>t</sup>*) is low will not propagate to the next stage. The samples that survive are those that explain the data well, and are thus concentrated in the most dense areas of *p*(**x***t*|**y**1:*t*). Therefore, the computed value for common estimators such as the mean and mode will be good approximations of their actual values. Further, note that the candidate particles are drawn from *<sup>p</sup>*(**x***t*|**x***t*−1), which introduces process noise to prevent the particles from becoming too short-sighted.

Using the estimate calculated from the density approximation yielded by the particles {**x** (*i*) *<sup>t</sup>* }*<sup>N</sup> <sup>i</sup>*=1, the particle filter is able provide tracks that are optimal for a wide variety of criteria in a more general setting than that required by the Kalman filter. However, the validity of the track depends on the ability of the particles to sufficiently characterize the underlying density. Often, this may require a large number of particles, which can lead to a high computational cost.

#### **2.2.3 Mean shift tracking**

Unlike the Kalman and particle filters, the *mean shift* tracker (Comaniciu et al., 2003) is a procedure designed specifically for visual data. The feature employed, a spatially weighted color histogram, is computed directly from the input images. The estimate for the object position in the image plane is defined as the mode of a density over spatial locations, where this density is defined using a similarity measure between the histogram for an object model (i.e. a "template") and the histogram at a location of interest. The mean shift procedure (Comaniciu & Meer, 2002) is then used to find this mode.

In general, the mean shift procedure provides a way to perform gradient ascent on an unknown density using only samples generated by this density. It achieves this via selecting a

value at **x**∗

i.e.,

window location.

where *ρ*ˆ(**y**) = ∑*<sup>m</sup>*

weights {*wi*}*<sup>n</sup>*

**2.3 The data challenge**

constant to ensure that **q** is a true histogram.

a manner similar to **ˆq**, except *k*(�**x**<sup>∗</sup>

*u*=1

An approximation of *ρ*ˆ(**y**) is provided by

*<sup>ρ</sup>*ˆ(**y**) = <sup>1</sup> 2

process with the above observation in mind.

**3. Compressive sensing**

*m* ∑ *u*=1

is then taken to be the location estimate (track) for the current frame.

*<sup>i</sup>* falls into the *<sup>u</sup>th* bin of the histogram, and 0 otherwise. Finally, *<sup>C</sup>* is a normalizing

*<sup>i</sup>* �2) is replaced by *<sup>k</sup>*(�**<sup>y</sup>** <sup>−</sup> **<sup>x</sup>***i*�2) to account for the new

� **<sup>y</sup>** <sup>−</sup> **<sup>x</sup>***<sup>i</sup> <sup>h</sup>* �<sup>2</sup> 

1 − *ρ*ˆ(**y**) , (2.26)

. (2.27)

An object candidate feature located at position **y** is denoted by **ˆp**(**y**), and is calculated in

Compressive Sensing in Visual Tracking 9

To capture a notion of similarity between **ˆp**(**y**) and **ˆq**, the Bhattacharyya coefficient is used,

*Ch* 2

Above, **y**<sup>0</sup> represents an initial location provided by the track from the previous frame. The

in (2.26), the second term of (2.27) should be maximized with respect to **y**. This term can be interpreted as a nonparametric weighted KDE with kernel function *k*(·). Thus, the mean shift procedure can be used to iterate over **y** and find that value which minimizes *d*(**y**). The result

Given the above background, it can be seen how large amounts of data can be of detriment to tracking. Background subtraction techniques may require complicated density estimates for each pixel, which become burdensome in the presence of high-resolution imagery. The filtering methods presented above are not specific to the amount of data, but more of it leads to greater computational complexity when performing the estimation. Likewise, higher data dimensionality is of detriment to mean shift tracking, specifically during the required density estimation and mode search. This extra data could be due to higher sensor resolution or perhaps the presence of multiple sensors (Sankaranarayanan et al., 2008)(Sankaranarayanan & Chellappa, 2008). Therefore, new tracking strategies must be developed. The hope for finding such strategies comes from the fact that there is a substantial difference in the amount of data collected by these systems compared to the quantity of information that is ultimately of use. Compressive sensing provides a new perspective that radically changes the sensing

Compressive sensing is an emerging theory that allows for a certain class of discrete signals to be adequately sensed using far fewer measurements than the dimension of the ambient space in which they reside. By "adequately sensed," it is meant that the signal of interest can be accurately inferred from the measurements collected during the sensing process. In

*n* ∑ *i*=1 *wik* 

*<sup>i</sup>*=<sup>1</sup> are calculated as a function of **ˆq**, **ˆp**(**y**0), and *b*(**x***i*). To minimize the distance

*<sup>d</sup>*(**y**) =

*p*ˆ*u*(**y**)*q*ˆ*<sup>u</sup>* is the Bhattacharyya coefficient.

*p*ˆ*u*(**y**0)*q*ˆ*<sup>u</sup>* +

specific method of density estimation and analytically deriving a data-dependent term that corresponds to the gradient of the estimate. This term is known as the mean shift, and it can be used as the step term in a mode-seeking gradient ascent procedure. Specifically, non-parametric KDE is employed, i.e.,

$$\hat{f}(\mathbf{x}) = \frac{1}{nh^d} \sum\_{i=1}^{n} K\left(\frac{\mathbf{x} - \mathbf{x}\_i}{h}\right) \quad , \tag{2.20}$$

where the *d*-dimensional vector **x** represents the feature, ˆ *f*(·) the estimated density, and *K*(·) a *kernel function*. The kernel function is assumed to be radially symmetric, i.e., *K*(**x**) = *ck*,*dk*(�**x**�2) for some function *<sup>k</sup>*(·) and normalizing constant *ck*,*d*. Using this in (2.20), <sup>ˆ</sup> *f*(**x**) becomes

$$f\_{h, \mathbf{K}}(\mathbf{x}) = \frac{c\_{k,d}}{nh^d} \sum\_{i=1}^{n} k(\|\frac{\mathbf{x} - \mathbf{x}\_i}{h}\|^2) \quad . \tag{2.21}$$

Ultimately, it is the gradient of this approximation, <sup>∇</sup> <sup>ˆ</sup> *fh*,*K*, that is of interest. Letting *g*(·) = −*k*� (·), it is given by

$$\nabla f\_{h,\mathbf{K}}(\mathbf{x}) = \frac{2\mathbf{c}\_{k,d}}{nh^{d+2}} \left[ \sum\_{i=1}^{n} \mathbf{g}\left( \|\frac{\mathbf{x} - \mathbf{x}\_{i}}{h}\|^{2} \right) \right] \left[ \frac{\sum\_{i=1}^{n} \mathbf{x}\_{i} \mathbf{g}\left( \|\frac{\mathbf{x} - \mathbf{x}\_{i}}{h}\|^{2} \right)}{\sum\_{i=1}^{n} \mathbf{g}\left( \|\frac{\mathbf{x} - \mathbf{x}\_{i}}{h}\|^{2} \right)} - \mathbf{x} \right] \quad . \tag{2.22}$$

Using *<sup>g</sup>*(·) to define a new kernel *<sup>G</sup>*(**x**) = *cg*,*<sup>d</sup> <sup>g</sup>*(�**x**�2), (2.22) can be rewritten as

$$\nabla \hat{f}\_{\hbar, \mathbf{K}}(\mathbf{x}) = \frac{2c\_{\mathbf{k}, \mathbf{d}}}{n^2 c\_{\mathbf{g}, \mathbf{d}}} \hat{f}\_{\hbar, \mathbf{G}}(\mathbf{x}) \mathbf{m}\_{\hbar, \mathbf{G}}(\mathbf{x}) \quad , \tag{2.23}$$

where **m***h*,*G*(**x**) denotes the mean shift:

$$\mathbf{m}\_{h,G}(\mathbf{x}) = \left[ \frac{\sum\_{i=1}^{n} \mathbf{x}\_{i} \mathbf{g}\left( ||\frac{\mathbf{x} - \mathbf{x}\_{i}}{h}||^{2} \right)}{\sum\_{i=1}^{n} \mathbf{g}\left( ||\frac{\mathbf{x} - \mathbf{x}\_{i}}{h}||^{2} \right)} - \mathbf{x} \right] \quad . \tag{2.24}$$

It can be seen from (2.23) that **<sup>m</sup>***h*,*G*(**x**) is proportional to <sup>∇</sup> <sup>ˆ</sup> *fh*,*K*(**x**), and thus may be used as a step direction in a gradient ascent procedure to find a maximum of ˆ *fh*,*K*(**x**) (i.e., a mode).

(Comaniciu et al., 2003) utilize the above procedure when tracking objects in the image plane. The selected feature is a spatially weighted color histogram computed over a normalized window of finite spatial support. The spatial weighting is defined by an isotropic kernel *k*(·), and the object model is given by an *<sup>m</sup>*-bin histogram **ˆq** <sup>=</sup> {*q*ˆ*u*}*<sup>m</sup> <sup>u</sup>*=1, where

$$\hat{q}\_{\mu} = \mathbb{C} \sum\_{i=1}^{n} k(\|\mathbf{x}\_{i}^{\*}\|^{2}) \delta \left[ b(\mathbf{x}\_{i}^{\*}) - u \right] \quad . \tag{2.25}$$

**x**∗ *<sup>i</sup>* denotes the spatial location of the *i th* pixel in the *n* pixel window containing the object model, assuming the center of the window to be located at **0**. *δ b*(**x**∗ *<sup>i</sup>* − *u* is 1 when the pixel

specific method of density estimation and analytically deriving a data-dependent term that corresponds to the gradient of the estimate. This term is known as the mean shift, and it can be used as the step term in a mode-seeking gradient ascent procedure. Specifically,

> *n* ∑ *i*=1 *K*

a *kernel function*. The kernel function is assumed to be radially symmetric, i.e., *K*(**x**) = *ck*,*dk*(�**x**�2) for some function *<sup>k</sup>*(·) and normalizing constant *ck*,*d*. Using this in (2.20), <sup>ˆ</sup>

> *n* ∑ *i*=1

 **<sup>x</sup>** <sup>−</sup> **<sup>x</sup>***<sup>i</sup> h*

*<sup>k</sup>*(� **<sup>x</sup>** <sup>−</sup> **<sup>x</sup>***<sup>i</sup>*

 ∑*n <sup>i</sup>*=<sup>1</sup> **x***ig* � **<sup>x</sup>**−**x***<sup>i</sup> <sup>h</sup>* �2

∑*n <sup>i</sup>*=<sup>1</sup> *g* � **<sup>x</sup>**−**x***<sup>i</sup> <sup>h</sup>* �<sup>2</sup>

<sup>−</sup> **<sup>x</sup>**

, (2.20)

*f*(**x**)

. (2.22)

*f*(·) the estimated density, and *K*(·)

*<sup>h</sup>* �2) . (2.21)

*fh*,*K*, that is of interest. Letting *g*(·) =

<sup>−</sup> **<sup>x</sup>**

*fh*,*G*(**x**)**m***h*,*G*(**x**) , (2.23)

*<sup>u</sup>*=1, where

*th* pixel in the *n* pixel window containing the object

 *b*(**x**∗ *<sup>i</sup>* − *u* 

*<sup>i</sup>* ) − *u*] . (2.25)

. (2.24)

*fh*,*K*(**x**), and thus may be used as a

*fh*,*K*(**x**) (i.e., a mode).

is 1 when the pixel

non-parametric KDE is employed, i.e.,

becomes

−*k*�

**x**∗

(·), it is given by

∇ ˆ

ˆ *<sup>f</sup>*(**x**) = <sup>1</sup>

where the *d*-dimensional vector **x** represents the feature, ˆ

ˆ

 *n* ∑ *i*=1 *g* � **<sup>x</sup>** <sup>−</sup> **<sup>x</sup>***<sup>i</sup> <sup>h</sup>* �<sup>2</sup>

Ultimately, it is the gradient of this approximation, <sup>∇</sup> <sup>ˆ</sup>

*nhd*+<sup>2</sup>

∇ ˆ

**m***h*,*G*(**x**) =

and the object model is given by an *<sup>m</sup>*-bin histogram **ˆq** <sup>=</sup> {*q*ˆ*u*}*<sup>m</sup>*

*q*ˆ*<sup>u</sup>* = *C*

model, assuming the center of the window to be located at **0**. *δ*

*n* ∑ *i*=1

*k*(�**x**<sup>∗</sup>

step direction in a gradient ascent procedure to find a maximum of ˆ

It can be seen from (2.23) that **<sup>m</sup>***h*,*G*(**x**) is proportional to <sup>∇</sup> <sup>ˆ</sup>

*fh*,*K*(**x**) = <sup>2</sup>*ck*,*<sup>d</sup>*

where **m***h*,*G*(**x**) denotes the mean shift:

*<sup>i</sup>* denotes the spatial location of the *i*

*fh*,*K*(**x**) = *ck*,*<sup>d</sup>*

*nh<sup>d</sup>*

*nh<sup>d</sup>*

Using *<sup>g</sup>*(·) to define a new kernel *<sup>G</sup>*(**x**) = *cg*,*<sup>d</sup> <sup>g</sup>*(�**x**�2), (2.22) can be rewritten as

*n*<sup>2</sup>*cg*,*<sup>d</sup>*

ˆ

(Comaniciu et al., 2003) utilize the above procedure when tracking objects in the image plane. The selected feature is a spatially weighted color histogram computed over a normalized window of finite spatial support. The spatial weighting is defined by an isotropic kernel *k*(·),

*<sup>i</sup>* �2)*<sup>δ</sup>* [*b*(**x**<sup>∗</sup>

*fh*,*K*(**x**) = <sup>2</sup>*ck*,*<sup>d</sup>*

 ∑*n <sup>i</sup>*=<sup>1</sup> **x***ig* � **<sup>x</sup>**−**x***<sup>i</sup> <sup>h</sup>* �2

∑*n <sup>i</sup>*=<sup>1</sup> *g* � **<sup>x</sup>**−**x***<sup>i</sup> <sup>h</sup>* �<sup>2</sup> value at **x**∗ *<sup>i</sup>* falls into the *<sup>u</sup>th* bin of the histogram, and 0 otherwise. Finally, *<sup>C</sup>* is a normalizing constant to ensure that **q** is a true histogram.

An object candidate feature located at position **y** is denoted by **ˆp**(**y**), and is calculated in a manner similar to **ˆq**, except *k*(�**x**<sup>∗</sup> *<sup>i</sup>* �2) is replaced by *<sup>k</sup>*(�**<sup>y</sup>** <sup>−</sup> **<sup>x</sup>***i*�2) to account for the new window location.

To capture a notion of similarity between **ˆp**(**y**) and **ˆq**, the Bhattacharyya coefficient is used, i.e.,

$$d(\mathbf{y}) = \sqrt{1 - \beta(\mathbf{y})} \quad \text{ }\tag{2.26}$$

where *ρ*ˆ(**y**) = ∑*<sup>m</sup> u*=1 *p*ˆ*u*(**y**)*q*ˆ*<sup>u</sup>* is the Bhattacharyya coefficient.

An approximation of *ρ*ˆ(**y**) is provided by

$$\boldsymbol{\beta}(\mathbf{y}) = \frac{1}{2} \sum\_{u=1}^{m} \sqrt{\boldsymbol{\beta}\_{\boldsymbol{h}}(\mathbf{y}\_{0}) \boldsymbol{\eta}\_{\boldsymbol{u}}} + \frac{\mathbf{C}\_{\boldsymbol{h}}}{2} \sum\_{i=1}^{n} w\_{i} \boldsymbol{k} \left( \| \frac{\mathbf{y} - \mathbf{x}\_{i}}{\boldsymbol{h}} \| ^{2} \right) \quad . \tag{2.27}$$

Above, **y**<sup>0</sup> represents an initial location provided by the track from the previous frame. The weights {*wi*}*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> are calculated as a function of **ˆq**, **ˆp**(**y**0), and *b*(**x***i*). To minimize the distance in (2.26), the second term of (2.27) should be maximized with respect to **y**. This term can be interpreted as a nonparametric weighted KDE with kernel function *k*(·). Thus, the mean shift procedure can be used to iterate over **y** and find that value which minimizes *d*(**y**). The result is then taken to be the location estimate (track) for the current frame.

#### **2.3 The data challenge**

Given the above background, it can be seen how large amounts of data can be of detriment to tracking. Background subtraction techniques may require complicated density estimates for each pixel, which become burdensome in the presence of high-resolution imagery. The filtering methods presented above are not specific to the amount of data, but more of it leads to greater computational complexity when performing the estimation. Likewise, higher data dimensionality is of detriment to mean shift tracking, specifically during the required density estimation and mode search. This extra data could be due to higher sensor resolution or perhaps the presence of multiple sensors (Sankaranarayanan et al., 2008)(Sankaranarayanan & Chellappa, 2008). Therefore, new tracking strategies must be developed. The hope for finding such strategies comes from the fact that there is a substantial difference in the amount of data collected by these systems compared to the quantity of information that is ultimately of use. Compressive sensing provides a new perspective that radically changes the sensing process with the above observation in mind.

#### **3. Compressive sensing**

Compressive sensing is an emerging theory that allows for a certain class of discrete signals to be adequately sensed using far fewer measurements than the dimension of the ambient space in which they reside. By "adequately sensed," it is meant that the signal of interest can be accurately inferred from the measurements collected during the sensing process. In

C, it is overwhelmingly likely that **Φ** will be 2*k*-RIP. There are other constructions that provide similar guarantees given slightly different bounds on *M*, but the concept remains unchanged: if *M* is "large enough," **Φ** will exhibit the RIP with overwhelming probability. Given such a matrix, and considering that this implies a unique **y** for each *k*-sparse **x**, an estimate **ˆx** of **x** is

Compressive Sensing in Visual Tracking 11

where �·�0, referred to as the �<sup>0</sup> "norm," counts the number of nonzero entries in **z**. Thus, (3.2) seeks the sparsest vector that explains the observation **y**. In practice, (3.2) is not very useful since the program it specifies has combinatorial complexity. However, this problem is also mitigated due to the special construction of **Φ** and the fact that **x** is *k*-sparse. Under these conditions, the solution of the following program yields the same results as (3.2) with

Thus, by modifying the sensor to use **Φ** and the decoder to use (3.3), *M* << *N* measurements

Sensors based on the above theory are beginning emerge (Willett et al., 2011). One of the most notable is the single pixel camera (Duarte et al., 2008), where measurements specified by each row of **Φ** are sequentially computed in the optical domain via a digital micromirror device and a single photodiode. Many of the strategies discussed in the following section assume that the tracking system is such that these compressive sensors replace more traditional cameras.

Compressive sensing can help alleviate some of the challenges associated with performing classical tracking in the presence of overwhelming amounts of data. By replacing traditional cameras with compressive sensors or by making use of CS techniques in other areas of the process, the amount of data that the system must handle can be drastically reduced. However, this capability should not come at the cost of a significant decrease in tracking performance. This section will present a few methods for performing various tracking tasks that take advantage of CS in order to reduce the quantity of data that must be processed. Specifically, recent methods using CS to perform background subtraction, more general signal tracking,

One of the most intuitive applications of compressive sensing in visual tracking is the modification of background subtraction such that it is able to operate on compressive measurements. As mentioned in Section 2.1, background subtraction aims to segment the object-containing foreground from the uninteresting background. This process not only helps to localize objects, but also reduces the amount of data that must be processed at later stages of tracking. However, traditional background subtraction techniques require that the full image be available before the process can begin. Such a scenario is reminiscent of the problem that

�**z**�<sup>0</sup> subject to **Φz** = **y** , (3.2)

�**z**�<sup>1</sup> subject to **Φz** = **y** . (3.3)

**ˆx** = min **z**∈**R***<sup>N</sup>*

**ˆx** = min **x**∈**R***<sup>N</sup>*

of a *k*-sparse **x** suffice to retain the ability to reconstruct it.

**4. Compressive sensing in video surveillance**

multi-view visual tracking, and particle filtering will be discussed.

**4.1 Compressive sensing for background subtraction**

ideally calculated from **y** as

overwhelming probability:

the context of imaging, consider an unknown *<sup>n</sup>* <sup>×</sup> *<sup>n</sup>* grayscale image **<sup>F</sup>**, i.e., **<sup>F</sup>** <sup>∈</sup> **<sup>R</sup>***n*×*n*. A traditional camera measures **F** using an *n* × *n* array of photodetectors, where the measurement collected at each detector corresponds to a single pixel value in **F**. If **F** is vectorized as **x** ∈ **R***<sup>N</sup>* (*N* = *n*2), then the imaging strategy described above amounts to (in the noiseless case) **ˆx** = **y** = **Ix** (Romberg, 2008), where **ˆx** is the inferred value of **x** using the measurements **y**. Each component of **y** (i.e., a measurement) corresponds to a single component of **x**, and this relationship is captured by representing the sensing process as the identity matrix **I**. Since **x** is the quantity of interest, estimating it from **y** also amounts to a simple identity mapping, i.e. **ˆx**(**y**) = **y**. However, both the measurement and estimation process can change, giving rise to interesting and useful signal acquisition methodologies.

For practical purposes, it is often the case that **x** is represented using far fewer measurements than the *N* collected above. For example, using *transform coding* methods (e.g., JPEG 2000), **x** can usually be closely approximated by specifying very few values compared to *N* (Bruckstein et al., 2009). This is accomplished via obtaining **b** = **Bx** for some orthonormal basis **B** (e.g., the wavelet basis), and setting all but the *k* largest components of **b** to zero. If this new vector is denoted **<sup>b</sup>***k*, then the transform coding approximation of **<sup>x</sup>** is given by **ˆx** <sup>=</sup> **<sup>B</sup>**−1**b***k*. If �**<sup>x</sup>** <sup>−</sup> **ˆx**�<sup>2</sup> is small, then this approximation is a good one. Since **B** is orthonormal, this condition also requires that �**b** − **b***k*�<sup>2</sup> be small as well. If such is the case, **b** is said to be *k-sparse* (and **x** *k-sparse in* **B**), i.e., most of the energy in **b** is distributed among very few of its components. Thus, if the value of **x** is known, and **x** is *k*-sparse in **B**, a good approximation of **x** can be obtained from **b***k*. Compression comes about since **b***<sup>k</sup>* (and thus **x**) can be specified using just 2*k* quantities instead of *N*: the values and locations of the *k* largest coefficients in **b**. However, extracting such information requires full knowledge of **x**, which necessitates *N* measurements using the traditional imaging system above. Thus, *N* data points must be collected when in essence all but 2*k* are thrown away. This is not completely unjustified, as one cannot hope to form **b***<sup>k</sup>* without knowing **b**. On the other hand, such a large disparity between the amount of data collected and the amount that is truly useful seems wasteful.

This glaring disparity is what CS seeks to address. Instead of collecting *N* measurements of **x**, the CS strategy is to collect *M*, where *M* << *N* and depends on *k*. As long as **x** is *k*-sparse in some basis and an appropriate decoding procedure is employed, these *M* values yield a good approximation of **<sup>x</sup>**. For example, let **<sup>Φ</sup>** <sup>∈</sup> **<sup>R</sup>***M*×*<sup>N</sup>* be the *measurement matrix* by which these values, **<sup>y</sup>** <sup>∈</sup> **<sup>R</sup>***M*, are obtained as **<sup>y</sup>** <sup>=</sup> **<sup>Φ</sup>x**. Further, assume **<sup>x</sup>** is *<sup>k</sup>*-sparse. It is possible to recover **x** from **y** if **Φ** has the *restricted isometry property (RIP)* of order 2*k* (Candès & Wakin, 2008), i.e., the smallest *δ* for which

$$(1 - \delta) \le \frac{\|\Phi \mathbf{x}\|\_2^2}{\|\mathbf{x}\|\_2^2} \le (1 + \delta) \tag{3.1}$$

holds for all 2*k*-sparse vectors is not too close to 1. An intuitive interpretation of this property is that it ensures that all 2*k*-sparse vectors do not lie in Null(**Φ**). This guarantees that a unique measurement **y** is generated for each *k*-sparse **x** even though **Φ** is underdetermined.

An example **Φ** that satisfies the above conditions is one for which entries are drawn from the Bernoulli distribution over the discrete set { √−<sup>1</sup> *<sup>N</sup>* , <sup>√</sup> 1 *<sup>N</sup>* } and each realization is equally likely (Baraniuk, 2007). If, in addition, *M* is selected such that *M* > *Ck* log *N* for a specific constant

the context of imaging, consider an unknown *<sup>n</sup>* <sup>×</sup> *<sup>n</sup>* grayscale image **<sup>F</sup>**, i.e., **<sup>F</sup>** <sup>∈</sup> **<sup>R</sup>***n*×*n*. A traditional camera measures **F** using an *n* × *n* array of photodetectors, where the measurement collected at each detector corresponds to a single pixel value in **F**. If **F** is vectorized as **x** ∈ **R***<sup>N</sup>* (*N* = *n*2), then the imaging strategy described above amounts to (in the noiseless case) **ˆx** = **y** = **Ix** (Romberg, 2008), where **ˆx** is the inferred value of **x** using the measurements **y**. Each component of **y** (i.e., a measurement) corresponds to a single component of **x**, and this relationship is captured by representing the sensing process as the identity matrix **I**. Since **x** is the quantity of interest, estimating it from **y** also amounts to a simple identity mapping, i.e. **ˆx**(**y**) = **y**. However, both the measurement and estimation process can change, giving rise to

For practical purposes, it is often the case that **x** is represented using far fewer measurements than the *N* collected above. For example, using *transform coding* methods (e.g., JPEG 2000), **x** can usually be closely approximated by specifying very few values compared to *N* (Bruckstein et al., 2009). This is accomplished via obtaining **b** = **Bx** for some orthonormal basis **B** (e.g., the wavelet basis), and setting all but the *k* largest components of **b** to zero. If this new vector is denoted **<sup>b</sup>***k*, then the transform coding approximation of **<sup>x</sup>** is given by **ˆx** <sup>=</sup> **<sup>B</sup>**−1**b***k*. If �**<sup>x</sup>** <sup>−</sup> **ˆx**�<sup>2</sup> is small, then this approximation is a good one. Since **B** is orthonormal, this condition also requires that �**b** − **b***k*�<sup>2</sup> be small as well. If such is the case, **b** is said to be *k-sparse* (and **x** *k-sparse in* **B**), i.e., most of the energy in **b** is distributed among very few of its components. Thus, if the value of **x** is known, and **x** is *k*-sparse in **B**, a good approximation of **x** can be obtained from **b***k*. Compression comes about since **b***<sup>k</sup>* (and thus **x**) can be specified using just 2*k* quantities instead of *N*: the values and locations of the *k* largest coefficients in **b**. However, extracting such information requires full knowledge of **x**, which necessitates *N* measurements using the traditional imaging system above. Thus, *N* data points must be collected when in essence all but 2*k* are thrown away. This is not completely unjustified, as one cannot hope to form **b***<sup>k</sup>* without knowing **b**. On the other hand, such a large disparity between the amount

This glaring disparity is what CS seeks to address. Instead of collecting *N* measurements of **x**, the CS strategy is to collect *M*, where *M* << *N* and depends on *k*. As long as **x** is *k*-sparse in some basis and an appropriate decoding procedure is employed, these *M* values yield a good approximation of **<sup>x</sup>**. For example, let **<sup>Φ</sup>** <sup>∈</sup> **<sup>R</sup>***M*×*<sup>N</sup>* be the *measurement matrix* by which these values, **<sup>y</sup>** <sup>∈</sup> **<sup>R</sup>***M*, are obtained as **<sup>y</sup>** <sup>=</sup> **<sup>Φ</sup>x**. Further, assume **<sup>x</sup>** is *<sup>k</sup>*-sparse. It is possible to recover **x** from **y** if **Φ** has the *restricted isometry property (RIP)* of order 2*k* (Candès & Wakin,

> 2 �**x**�<sup>2</sup> 2

holds for all 2*k*-sparse vectors is not too close to 1. An intuitive interpretation of this property is that it ensures that all 2*k*-sparse vectors do not lie in Null(**Φ**). This guarantees that a unique

An example **Φ** that satisfies the above conditions is one for which entries are drawn from the

(Baraniuk, 2007). If, in addition, *M* is selected such that *M* > *Ck* log *N* for a specific constant

*<sup>N</sup>* , <sup>√</sup> 1

≤ (1 + *δ*) (3.1)

*<sup>N</sup>* } and each realization is equally likely

(<sup>1</sup> <sup>−</sup> *<sup>δ</sup>*) <sup>≤</sup> �**Φx**�<sup>2</sup>

measurement **y** is generated for each *k*-sparse **x** even though **Φ** is underdetermined.

interesting and useful signal acquisition methodologies.

of data collected and the amount that is truly useful seems wasteful.

2008), i.e., the smallest *δ* for which

Bernoulli distribution over the discrete set { √−<sup>1</sup>

C, it is overwhelmingly likely that **Φ** will be 2*k*-RIP. There are other constructions that provide similar guarantees given slightly different bounds on *M*, but the concept remains unchanged: if *M* is "large enough," **Φ** will exhibit the RIP with overwhelming probability. Given such a matrix, and considering that this implies a unique **y** for each *k*-sparse **x**, an estimate **ˆx** of **x** is ideally calculated from **y** as

$$\hat{\mathbf{x}} = \min\_{\mathbf{z} \in \mathbb{R}^N} \|\mathbf{z}\|\_0 \quad \text{subject to} \quad \Phi \mathbf{z} = \mathbf{y} \quad , \tag{3.2}$$

where �·�0, referred to as the �<sup>0</sup> "norm," counts the number of nonzero entries in **z**. Thus, (3.2) seeks the sparsest vector that explains the observation **y**. In practice, (3.2) is not very useful since the program it specifies has combinatorial complexity. However, this problem is also mitigated due to the special construction of **Φ** and the fact that **x** is *k*-sparse. Under these conditions, the solution of the following program yields the same results as (3.2) with overwhelming probability:

$$\hat{\mathbf{x}} = \min\_{\mathbf{x} \in \mathbb{R}^N} \|\mathbf{z}\|\_1 \quad \text{subject to} \quad \Phi \mathbf{z} = \mathbf{y} \quad . \tag{3.3}$$

Thus, by modifying the sensor to use **Φ** and the decoder to use (3.3), *M* << *N* measurements of a *k*-sparse **x** suffice to retain the ability to reconstruct it.

Sensors based on the above theory are beginning emerge (Willett et al., 2011). One of the most notable is the single pixel camera (Duarte et al., 2008), where measurements specified by each row of **Φ** are sequentially computed in the optical domain via a digital micromirror device and a single photodiode. Many of the strategies discussed in the following section assume that the tracking system is such that these compressive sensors replace more traditional cameras.

#### **4. Compressive sensing in video surveillance**

Compressive sensing can help alleviate some of the challenges associated with performing classical tracking in the presence of overwhelming amounts of data. By replacing traditional cameras with compressive sensors or by making use of CS techniques in other areas of the process, the amount of data that the system must handle can be drastically reduced. However, this capability should not come at the cost of a significant decrease in tracking performance. This section will present a few methods for performing various tracking tasks that take advantage of CS in order to reduce the quantity of data that must be processed. Specifically, recent methods using CS to perform background subtraction, more general signal tracking, multi-view visual tracking, and particle filtering will be discussed.

#### **4.1 Compressive sensing for background subtraction**

One of the most intuitive applications of compressive sensing in visual tracking is the modification of background subtraction such that it is able to operate on compressive measurements. As mentioned in Section 2.1, background subtraction aims to segment the object-containing foreground from the uninteresting background. This process not only helps to localize objects, but also reduces the amount of data that must be processed at later stages of tracking. However, traditional background subtraction techniques require that the full image be available before the process can begin. Such a scenario is reminiscent of the problem that

Fig. 2. Block diagram of the compressive sensing for background subtraction technique.

construction techniques. **Φ<sup>t</sup>** is then formed by selecting only the first *Mt* rows of **Φ** and column-normalizing the result. The fixed background estimate, **y***b*, is estimated from a set of measurements of the background only obtained via **Φ**. In order to use this estimate at each

Compressive Sensing in Visual Tracking 13

*<sup>t</sup>* is formed by retaining only the first *Mt* components of **<sup>y</sup>***b*. In parallel to **Φ***t*, the method also requires an extra set of compressive measurements via

obtained via a *cross validation* matrix **<sup>Ψ</sup>** <sup>∈</sup> **<sup>C</sup>***r*×*N*, which is constructed in a manner similar to **Φ**. *r* depends on the desired accuracy of the cross validation error estimate (given below), is negligible compared to *N*, and constant for all *t*. In order to use the measurements **z***<sup>t</sup>* = **Ψx***t*, it is necessary to perform background subtraction in this domain via an estimate of the

true foreground and the reconstruction provided by <sup>Δ</sup> at time *<sup>t</sup>*, is provided by �(**z***<sup>t</sup>* <sup>−</sup> **<sup>z</sup>***b*) <sup>−</sup>

enough compared to *kt*. The overall algorithm is termed *adaptive rate compressive sensing (ARCS)*, and the performance of this method compared to a non-adaptive approach is shown

Both techniques assume that the tracking system can only collect compressive measurements and provide a method by which foreground images can be reconstructed. These foreground images can then be used just as in classical tracking applications. Thus, CS has provided a means by which to reduce the up-front data costs associated with the system while retaining

A more general problem regarding signal tracking using compressive observations is

**f***t*�2. *Mt*+<sup>1</sup> is set to be greater or less than *Mt* depending on the hypothesis test

�(**z***<sup>t</sup>* <sup>−</sup> **<sup>z</sup>***b*) <sup>−</sup> **<sup>Ψ</sup><sup>ˆ</sup>**

**<sup>f</sup>***<sup>t</sup>* <sup>=</sup> <sup>Δ</sup>(**Φ***t*, **<sup>y</sup>***<sup>t</sup>* <sup>−</sup> **<sup>y</sup>***<sup>b</sup>*

**ft** depends on the relationship between *kt* and *Mt*. Using a technique

*<sup>t</sup>*), is determined. These are

**f***t*�2, i.e., the error between the

**f***t*�<sup>2</sup> assuming *Mt* to be large

*<sup>t</sup>*=0, is assumed to be both sparse

**f***t*�<sup>2</sup> ≶ *τ<sup>t</sup>* . (4.5)

Figure originally appears in (Cevher et al., 2008).

which the quality of the foreground estimate, **ˆ**

background, **z***b*, which is obtained in a manner similar to **y***<sup>b</sup>* above.

operationally similar to cross validation, an estimate of �**f***<sup>t</sup>* <sup>−</sup> **<sup>ˆ</sup>**

Here, *<sup>τ</sup><sup>t</sup>* is a quantity set based on the expected value of �**f***<sup>t</sup>* <sup>−</sup> **<sup>ˆ</sup>**

considered in (Vaswani, 2008). The signal being tracked, {**x***t*}<sup>∞</sup>

time instant *t*, **y***<sup>b</sup>*

The quality of **ˆ**

**Ψˆ**

in Figure 3.

the information necessary to track.

**4.2 Kalman filtered compressive sensing**

CS aims to address. Noting that the foreground signal (image) is sparse in the spatial domain, (Cevher et al., 2008) have presented a technique via which background subtraction can be performed on compressive measurements of a scene, resulting in a reduced data rate while simultaneously retaining the ability to reconstruct the foreground. More recently, (Warnell et al., 2012) have proposed a modification to this technique which adaptively adjusts the number of compressive measurements collected to the dynamic foreground sparsity typical to surveillance data.

Denote the images comprising a video sequence as {**x***t*}<sup>∞</sup> *<sup>t</sup>*=0, where **<sup>x</sup>***<sup>t</sup>* <sup>∈</sup> **<sup>R</sup>***<sup>N</sup>* is the vectorized image captured at time *t*. Cevher et al. model each image as the sum of foreground and background components **f***<sup>t</sup>* and **b***t*, respectively. That is,

$$\mathbf{x}\_{l} = \mathbf{f}\_{l} + \mathbf{b}\_{l} \quad . \tag{4.1}$$

Assume **<sup>x</sup>***<sup>t</sup>* is sensed using **<sup>Φ</sup>** <sup>∈</sup> **<sup>C</sup>***M*×*<sup>N</sup>* to obtain compressive measurements **<sup>y</sup>***<sup>t</sup>* <sup>=</sup> **<sup>Φ</sup>x***t*. If Δ(**Φ**, **y**) represents a CS decoding procedure such as (3.3), then the proposed method for estimating **f***<sup>t</sup>* from **y***<sup>t</sup>* is

$$\hat{\mathbf{f}}\_{l} = \Delta(\boldsymbol{\Phi}, \mathbf{y} - \mathbf{y}\_{l}^{b}) \quad , \tag{4.2}$$

where it is assumed that **y***<sup>b</sup> <sup>t</sup>* = **Φbt** is known via an estimation and update procedure.

To begin, **y***<sup>b</sup>* <sup>0</sup> is initialized using a sequence of *N* compressively sensed background-only frames {**y***<sup>b</sup> <sup>j</sup>* }*<sup>N</sup> <sup>j</sup>*=<sup>1</sup> that appear before the sequence of interest begins. These measurements are assumed to be realizations of a multivariate Gaussian random variable, and the maximum likelihood (ML) procedure is used to estimate its mean as **y***<sup>b</sup>* <sup>0</sup> <sup>=</sup> <sup>1</sup> *<sup>N</sup>* <sup>∑</sup>*<sup>N</sup> <sup>j</sup>*=<sup>1</sup> **<sup>y</sup>***<sup>b</sup> <sup>j</sup>* . This estimate is used as the known background for *t* = 0 in (4.2). Since the background typically changes over time, a method is proposed for updating the background estimate based on previous observations. Specifically, the following is proposed:

$$\mathbf{y}\_{t+1}^{b} = \boldsymbol{\alpha}(\mathbf{y}\_{t} - \boldsymbol{\Phi}\boldsymbol{\Delta}(\boldsymbol{\Phi}\_{\prime}\mathbf{y}\_{t+1}^{ma})) + (1 - \boldsymbol{\alpha})\mathbf{y}\_{t}^{b} \tag{4.3}$$

$$\mathbf{y}\_{t+1}^{ma} = \gamma \mathbf{y}\_t + (1 - \gamma) \mathbf{y}\_t^{ma} \quad , \tag{4.4}$$

where *<sup>α</sup>*, *<sup>γ</sup>* <sup>∈</sup> (0, 1) are learning rate parameters and **<sup>y</sup>***ma <sup>t</sup>*+<sup>1</sup> is a moving average term. This method compensates for both gradual and sudden changes to the background. A block diagram of the proposed system is shown in Figure 2.

The above procedure assumes a fixed **<sup>Φ</sup>** <sup>∈</sup> **<sup>C</sup>***M*×*N*. Therefore, *<sup>M</sup>* compressive measurements of **x***<sup>t</sup>* are collected at time *t* regardless of its content. It is not hard to imagine that the number of significant components of **f***t*, *kt*, might vary widely with *t*. For example, consider a scenario in which the foreground consists of a single object at *t* = *t*0, but many more at *t* = *t*1. Then *k*<sup>1</sup> > *k*0, and *M* > *Ck*<sup>1</sup> log *N* implies that **x***t*<sup>0</sup> has been oversampled due to the fact that only *M* > *Ck*<sup>0</sup> log *N* measurements are necessary to obtain a good approximation of **f***t*<sup>0</sup> . Foregoing the ability to update the background, (Warnell et al., 2012) propose a modification to the above method for which the number of compressive measurements at each frame, *Mt*, can vary.

Such a scheme requires a different measurement matrix for each time instant, i.e. **Φ<sup>t</sup>** ∈ **<sup>C</sup>***Mt*×*N*. To form **<sup>Φ</sup>t**, one first constructs **<sup>Φ</sup>** <sup>∈</sup> **<sup>C</sup>***N*×*<sup>N</sup>* via standard CS measurement matrix

CS aims to address. Noting that the foreground signal (image) is sparse in the spatial domain, (Cevher et al., 2008) have presented a technique via which background subtraction can be performed on compressive measurements of a scene, resulting in a reduced data rate while simultaneously retaining the ability to reconstruct the foreground. More recently, (Warnell et al., 2012) have proposed a modification to this technique which adaptively adjusts the number of compressive measurements collected to the dynamic foreground sparsity typical

image captured at time *t*. Cevher et al. model each image as the sum of foreground and

Assume **<sup>x</sup>***<sup>t</sup>* is sensed using **<sup>Φ</sup>** <sup>∈</sup> **<sup>C</sup>***M*×*<sup>N</sup>* to obtain compressive measurements **<sup>y</sup>***<sup>t</sup>* <sup>=</sup> **<sup>Φ</sup>x***t*. If Δ(**Φ**, **y**) represents a CS decoding procedure such as (3.3), then the proposed method for

**<sup>f</sup>***<sup>t</sup>* <sup>=</sup> <sup>Δ</sup>(**Φ**, **<sup>y</sup>** <sup>−</sup> **<sup>y</sup>***<sup>b</sup>*

assumed to be realizations of a multivariate Gaussian random variable, and the maximum

used as the known background for *t* = 0 in (4.2). Since the background typically changes over time, a method is proposed for updating the background estimate based on previous

method compensates for both gradual and sudden changes to the background. A block

The above procedure assumes a fixed **<sup>Φ</sup>** <sup>∈</sup> **<sup>C</sup>***M*×*N*. Therefore, *<sup>M</sup>* compressive measurements of **x***<sup>t</sup>* are collected at time *t* regardless of its content. It is not hard to imagine that the number of significant components of **f***t*, *kt*, might vary widely with *t*. For example, consider a scenario in which the foreground consists of a single object at *t* = *t*0, but many more at *t* = *t*1. Then *k*<sup>1</sup> > *k*0, and *M* > *Ck*<sup>1</sup> log *N* implies that **x***t*<sup>0</sup> has been oversampled due to the fact that only *M* > *Ck*<sup>0</sup> log *N* measurements are necessary to obtain a good approximation of **f***t*<sup>0</sup> . Foregoing the ability to update the background, (Warnell et al., 2012) propose a modification to the above method for which the number of compressive measurements at each frame, *Mt*, can vary.

Such a scheme requires a different measurement matrix for each time instant, i.e. **Φ<sup>t</sup>** ∈ **<sup>C</sup>***Mt*×*N*. To form **<sup>Φ</sup>t**, one first constructs **<sup>Φ</sup>** <sup>∈</sup> **<sup>C</sup>***N*×*<sup>N</sup>* via standard CS measurement matrix

*<sup>t</sup>*+<sup>1</sup> <sup>=</sup> *<sup>α</sup>*(**y***<sup>t</sup>* <sup>−</sup> **<sup>Φ</sup>**Δ(**Φ**, **<sup>y</sup>***ma*

*<sup>t</sup>*+<sup>1</sup> <sup>=</sup> *<sup>γ</sup>***y***<sup>t</sup>* + (<sup>1</sup> <sup>−</sup> *<sup>γ</sup>*)**y***ma*

*<sup>t</sup>* = **Φbt** is known via an estimation and update procedure.

*<sup>t</sup>*+1)) + (<sup>1</sup> <sup>−</sup> *<sup>α</sup>*)**y***<sup>b</sup>*

<sup>0</sup> is initialized using a sequence of *N* compressively sensed background-only

*<sup>j</sup>*=<sup>1</sup> that appear before the sequence of interest begins. These measurements are

**ˆ**

likelihood (ML) procedure is used to estimate its mean as **y***<sup>b</sup>*

observations. Specifically, the following is proposed:

**y***b*

**y***ma*

diagram of the proposed system is shown in Figure 2.

where *<sup>α</sup>*, *<sup>γ</sup>* <sup>∈</sup> (0, 1) are learning rate parameters and **<sup>y</sup>***ma*

*<sup>t</sup>*=0, where **<sup>x</sup>***<sup>t</sup>* <sup>∈</sup> **<sup>R</sup>***<sup>N</sup>* is the vectorized

*<sup>t</sup>*) , (4.2)

*<sup>t</sup>* , (4.4)

*<sup>t</sup>*+<sup>1</sup> is a moving average term. This

*<sup>j</sup>* . This estimate is

*<sup>t</sup>* (4.3)

**x***<sup>t</sup>* = **f***<sup>t</sup>* + **b***<sup>t</sup>* . (4.1)

<sup>0</sup> <sup>=</sup> <sup>1</sup> *<sup>N</sup>* <sup>∑</sup>*<sup>N</sup> <sup>j</sup>*=<sup>1</sup> **<sup>y</sup>***<sup>b</sup>*

to surveillance data.

estimating **f***<sup>t</sup>* from **y***<sup>t</sup>* is

where it is assumed that **y***<sup>b</sup>*

*<sup>j</sup>* }*<sup>N</sup>*

To begin, **y***<sup>b</sup>*

frames {**y***<sup>b</sup>*

Denote the images comprising a video sequence as {**x***t*}<sup>∞</sup>

background components **f***<sup>t</sup>* and **b***t*, respectively. That is,

Fig. 2. Block diagram of the compressive sensing for background subtraction technique. Figure originally appears in (Cevher et al., 2008).

construction techniques. **Φ<sup>t</sup>** is then formed by selecting only the first *Mt* rows of **Φ** and column-normalizing the result. The fixed background estimate, **y***b*, is estimated from a set of measurements of the background only obtained via **Φ**. In order to use this estimate at each time instant *t*, **y***<sup>b</sup> <sup>t</sup>* is formed by retaining only the first *Mt* components of **<sup>y</sup>***b*.

In parallel to **Φ***t*, the method also requires an extra set of compressive measurements via which the quality of the foreground estimate, **ˆ <sup>f</sup>***<sup>t</sup>* <sup>=</sup> <sup>Δ</sup>(**Φ***t*, **<sup>y</sup>***<sup>t</sup>* <sup>−</sup> **<sup>y</sup>***<sup>b</sup> <sup>t</sup>*), is determined. These are obtained via a *cross validation* matrix **<sup>Ψ</sup>** <sup>∈</sup> **<sup>C</sup>***r*×*N*, which is constructed in a manner similar to **Φ**. *r* depends on the desired accuracy of the cross validation error estimate (given below), is negligible compared to *N*, and constant for all *t*. In order to use the measurements **z***<sup>t</sup>* = **Ψx***t*, it is necessary to perform background subtraction in this domain via an estimate of the background, **z***b*, which is obtained in a manner similar to **y***<sup>b</sup>* above.

The quality of **ˆ ft** depends on the relationship between *kt* and *Mt*. Using a technique operationally similar to cross validation, an estimate of �**f***<sup>t</sup>* <sup>−</sup> **<sup>ˆ</sup> f***t*�2, i.e., the error between the true foreground and the reconstruction provided by <sup>Δ</sup> at time *<sup>t</sup>*, is provided by �(**z***<sup>t</sup>* <sup>−</sup> **<sup>z</sup>***b*) <sup>−</sup> **Ψˆ f***t*�2. *Mt*+<sup>1</sup> is set to be greater or less than *Mt* depending on the hypothesis test

$$\|(\mathbf{z}\_{l} - \mathbf{z}^{b}) - \mathbf{Y}\mathbf{\hat{f}}\_{l}\|\_{2} \lessapprox \tau\_{l} \quad . \tag{4.5}$$

Here, *<sup>τ</sup><sup>t</sup>* is a quantity set based on the expected value of �**f***<sup>t</sup>* <sup>−</sup> **<sup>ˆ</sup> f***t*�<sup>2</sup> assuming *Mt* to be large enough compared to *kt*. The overall algorithm is termed *adaptive rate compressive sensing (ARCS)*, and the performance of this method compared to a non-adaptive approach is shown in Figure 3.

Both techniques assume that the tracking system can only collect compressive measurements and provide a method by which foreground images can be reconstructed. These foreground images can then be used just as in classical tracking applications. Thus, CS has provided a means by which to reduce the up-front data costs associated with the system while retaining the information necessary to track.

#### **4.2 Kalman filtered compressive sensing**

A more general problem regarding signal tracking using compressive observations is considered in (Vaswani, 2008). The signal being tracked, {**x***t*}<sup>∞</sup> *<sup>t</sup>*=0, is assumed to be both sparse

The above algorithm is useful in surveillance scenarios when objects under observation are stationary or slowly-moving. Under such assumptions, this method is able to perform signal

Compressive Sensing in Visual Tracking 15

(Cossalter et al., 2010) consider a collection of methods via which systems utilizing compressive imaging devices can perform visual tracking. Of particular note is a method referred to as *joint compressive video coding and analysis*, via which the tracker output is used to improve the overall effectiveness of the system. Instrumental to this method is work from theoretical CS literature which proposes a weighted decoding procedure that iteratively determines the locations and values of the (nonzero) sparse vector coefficients. Modifying this decoder, the joint coding and analysis method utilizes the tracker estimate to directly influence the weights. The result is a foreground estimate of higher quality compared to one obtained

where **<sup>y</sup>***<sup>f</sup>* <sup>=</sup> **<sup>y</sup>** <sup>−</sup> **<sup>y</sup>***b*, **<sup>W</sup>** is a diagonal matrix with weights [*w*(1)... *<sup>w</sup>*(*N*)], and *<sup>σ</sup>* captures the expected measurement and quantization noise in **y***<sup>f</sup>* . Ideally, the weights are selected

values are not known in advance, but the closer the weights are to their actual value, the

The actual task of tracking is accomplished using a particle filter similar to that presented in Section 2.2.2. The state vector for an object at time *t* is denoted by **z***<sup>t</sup>* = [**c***<sup>t</sup>* **s***<sup>t</sup>* **u***t*], where **s***<sup>t</sup>* represents the size of the bounding box defined by the object appearance, **c***<sup>t</sup>* the centroid of this box, and **u***<sup>t</sup>* the object velocity in the image plane. A suitable kinematic motion model is utilized to describe the expected behavior of these quantities with respect to time, and

observations from time *t* is accurate, a reliable tracker estimate can be computed. This estimate, ˆ**z***t*, can then be used to select values for the weights [*w*(1)... *w*(*N*)] at time *t* + 1.

decoding procedure will be of higher quality than that obtained from a more generic CS decoder. (Cossalter et al., 2010) explore two methods via which the weights at time *t* + 1

motion model and ˆ**z***t*, and *3)* dilating the translated silhouettes using a predefined dilation

**f***<sup>t</sup>* and ˆ**z***t*. The best of these consists of three steps: *1)* thresholding the

**f***t*, *2)* translating the thresholded silhouettes for a single time step according to the

**f** becomes. The joint coding and analysis approach utilizes the tracker output

*<sup>w</sup>*(*i*) = <sup>1</sup>

*<sup>θ</sup>* �**W***θ*�<sup>1</sup> s.t. �**y***<sup>f</sup>* <sup>−</sup> **<sup>Φ</sup>***θ*�<sup>2</sup> <sup>≤</sup> *<sup>σ</sup>* , (4.8)

th coefficient in the true foreground image. Of course, these

<sup>|</sup> *<sup>f</sup>*(*i*)<sup>|</sup> <sup>+</sup> *�* , (4.9)

**f***<sup>t</sup>* obtained via decoding the compressive

**f***t*+<sup>1</sup> obtained from the weighted

tracking with a low data rate and low computational complexity.

The weighted CS decoding procedure calculates the foreground estimate via

**4.3 Joint compressive video coding and analysis**

ˆ **f** = min

in selecting appropriate values for these weights.

Assuming the foreground reconstruction ˆ

foreground reconstructions are used to generate observations.

If the weights are close to their ideal value (4.9), the value of ˆ

via standard CS decoding techniques.

where *f*(*i*) is the value of the *i*

according to

more accurate ˆ

can be selected using ˆ

entries of ˆ

Fig. 3. Comparison between ARCS and a non-adaptive method for a dataset consisting of vehicles moving in and out of the field of view. (a) Foreground sparsity estimates for each frame, including ground truth. (b) <sup>2</sup> foreground reconstruction error. (c) Number of measurements required. Note the measurements savings provided by ARCS for most frames, and its ability to track the dynamic foreground sparsity. Figure originally appears in (Warnell et al., 2012).

and have a slowly-changing sparsity pattern. Given these assumptions, if the support set of **x***t*, *Tt*, is known, the relationship between **x***<sup>t</sup>* and **y***<sup>t</sup>* can be written as:

$$\mathbf{y}\_{l} = \Phi\_{T\_{l}}(\mathbf{x})\_{T\_{l}} + \mathbf{w}\_{l} \quad . \tag{4.6}$$

Above, **Φ** is the CS measurement matrix, and **Φ***Tt* retains only those columns of **Φ** whose indices lie in *Tt*. Likewise, (**x***t*)*Tt* contains only those components corresponding to *Tt*. Finally, **w***<sup>t</sup>* is assumed to be zero mean Gaussian noise. If **x***<sup>t</sup>* is assumed to also follow the state model **<sup>x</sup>***<sup>t</sup>* = **<sup>x</sup>***t*−<sup>1</sup> + **<sup>v</sup>***<sup>t</sup>* with **<sup>v</sup>***<sup>t</sup>* zero mean Gaussian noise, then the MMSE estimate of **<sup>x</sup>***<sup>t</sup>* from **<sup>y</sup>***<sup>t</sup>* can be computed using a Kalman filter instead of a CS decoder.

The above is only valid if *Tt* is known, which is often not the case. This is handled by using the Kalman filter output to detect changes in *Tt* and re-estimate it if necessary. **y˜***t*, *<sup>f</sup>* = **y***<sup>t</sup>* − **Φxˆ**, the filter error, is used to detect changes in the signal support via a likelihood ratio test given by

$$\|\tilde{\mathbf{y}}'\_{t,f}\Sigma\tilde{\mathbf{y}}\_{t,f}\| \lesssim \tau \tag{4.7}$$

where *τ* is a threshold and **Σ** is the filtering error covariance. If the term on the left hand side exceeds the threshold, then changes to the support set are found by applying a procedure based on the Dantzig selector. Once *Tt* has been re-estimated, **xˆ** is re-evaluated using this new support set.

Fig. 3. Comparison between ARCS and a non-adaptive method for a dataset consisting of vehicles moving in and out of the field of view. (a) Foreground sparsity estimates for each frame, including ground truth. (b) <sup>2</sup> foreground reconstruction error. (c) Number of measurements required. Note the measurements savings provided by ARCS for most frames, and its ability to track the dynamic foreground sparsity. Figure originally appears in

and have a slowly-changing sparsity pattern. Given these assumptions, if the support set of

Above, **Φ** is the CS measurement matrix, and **Φ***Tt* retains only those columns of **Φ** whose indices lie in *Tt*. Likewise, (**x***t*)*Tt* contains only those components corresponding to *Tt*. Finally, **w***<sup>t</sup>* is assumed to be zero mean Gaussian noise. If **x***<sup>t</sup>* is assumed to also follow the state model **<sup>x</sup>***<sup>t</sup>* = **<sup>x</sup>***t*−<sup>1</sup> + **<sup>v</sup>***<sup>t</sup>* with **<sup>v</sup>***<sup>t</sup>* zero mean Gaussian noise, then the MMSE estimate of **<sup>x</sup>***<sup>t</sup>* from **<sup>y</sup>***<sup>t</sup>* can

The above is only valid if *Tt* is known, which is often not the case. This is handled by using the Kalman filter output to detect changes in *Tt* and re-estimate it if necessary. **y˜***t*, *<sup>f</sup>* = **y***<sup>t</sup>* − **Φxˆ**, the filter error, is used to detect changes in the signal support via a likelihood ratio test given

where *τ* is a threshold and **Σ** is the filtering error covariance. If the term on the left hand side exceeds the threshold, then changes to the support set are found by applying a procedure based on the Dantzig selector. Once *Tt* has been re-estimated, **xˆ** is re-evaluated using this new

**y˜**�

**y***<sup>t</sup>* = **Φ***Tt*(**x**)*Tt* + **w***<sup>t</sup>* . (4.6)

*<sup>t</sup>*, *<sup>f</sup>* **Σy˜***t*, *<sup>f</sup>* ≷ *τ* (4.7)

**x***t*, *Tt*, is known, the relationship between **x***<sup>t</sup>* and **y***<sup>t</sup>* can be written as:

be computed using a Kalman filter instead of a CS decoder.

(Warnell et al., 2012).

by

support set.

The above algorithm is useful in surveillance scenarios when objects under observation are stationary or slowly-moving. Under such assumptions, this method is able to perform signal tracking with a low data rate and low computational complexity.

#### **4.3 Joint compressive video coding and analysis**

(Cossalter et al., 2010) consider a collection of methods via which systems utilizing compressive imaging devices can perform visual tracking. Of particular note is a method referred to as *joint compressive video coding and analysis*, via which the tracker output is used to improve the overall effectiveness of the system. Instrumental to this method is work from theoretical CS literature which proposes a weighted decoding procedure that iteratively determines the locations and values of the (nonzero) sparse vector coefficients. Modifying this decoder, the joint coding and analysis method utilizes the tracker estimate to directly influence the weights. The result is a foreground estimate of higher quality compared to one obtained via standard CS decoding techniques.

The weighted CS decoding procedure calculates the foreground estimate via

$$\hat{\mathbf{f}} = \min\_{\theta} \|\mathbf{W}\theta\|\_1 \quad \text{s.t.} \quad \|\mathbf{y}^f - \Phi\theta\|\_2 \le \sigma \quad , \tag{4.8}$$

where **<sup>y</sup>***<sup>f</sup>* <sup>=</sup> **<sup>y</sup>** <sup>−</sup> **<sup>y</sup>***b*, **<sup>W</sup>** is a diagonal matrix with weights [*w*(1)... *<sup>w</sup>*(*N*)], and *<sup>σ</sup>* captures the expected measurement and quantization noise in **y***<sup>f</sup>* . Ideally, the weights are selected according to

$$w(i) = \frac{1}{|f(i)| + \epsilon} \quad \text{ } \tag{4.9}$$

where *f*(*i*) is the value of the *i* th coefficient in the true foreground image. Of course, these values are not known in advance, but the closer the weights are to their actual value, the more accurate ˆ **f** becomes. The joint coding and analysis approach utilizes the tracker output in selecting appropriate values for these weights.

The actual task of tracking is accomplished using a particle filter similar to that presented in Section 2.2.2. The state vector for an object at time *t* is denoted by **z***<sup>t</sup>* = [**c***<sup>t</sup>* **s***<sup>t</sup>* **u***t*], where **s***<sup>t</sup>* represents the size of the bounding box defined by the object appearance, **c***<sup>t</sup>* the centroid of this box, and **u***<sup>t</sup>* the object velocity in the image plane. A suitable kinematic motion model is utilized to describe the expected behavior of these quantities with respect to time, and foreground reconstructions are used to generate observations.

Assuming the foreground reconstruction ˆ **f***<sup>t</sup>* obtained via decoding the compressive observations from time *t* is accurate, a reliable tracker estimate can be computed. This estimate, ˆ**z***t*, can then be used to select values for the weights [*w*(1)... *w*(*N*)] at time *t* + 1. If the weights are close to their ideal value (4.9), the value of ˆ **f***t*+<sup>1</sup> obtained from the weighted decoding procedure will be of higher quality than that obtained from a more generic CS decoder. (Cossalter et al., 2010) explore two methods via which the weights at time *t* + 1 can be selected using ˆ **f***<sup>t</sup>* and ˆ**z***t*. The best of these consists of three steps: *1)* thresholding the entries of ˆ **f***t*, *2)* translating the thresholded silhouettes for a single time step according to the motion model and ˆ**z***t*, and *3)* dilating the translated silhouettes using a predefined dilation

Fig. 4. Physical diagram capturing the assumed setup of the multi-view tracking scenario.

Compressive Sensing in Visual Tracking 17

This lower-dimensional data is transmitted to a central station, where it is ordered into the

**Φ**1 . . . **Φ***<sup>C</sup>*

which can be written as **y** = **Φx** + **e**. This is a noisy version of the standard CS problem presented in Section 3, and an estimate of **x** can be found using a relaxed version of (3.3), i.e.,

The estimated occupancy grid (formed, e.g., by thresholding **xˆ**) can then be used as input to

The above process is also extended to three dimensions, where **x** represents an occupancy grid over 3D space, and the geometric relationship in (4.10) is modified to account for the added dimension. The rest of the process is entirely similar to the two dimensional case. Of particular note is the advantage in computational complexity: it is only on the order of the

The final application of compressive sensing in tracking presented in this chapter is the compressive particle filtering algorithm developed by (Wang et al., 2009). As in Section 4.1, it is assumed that the system uses a sensor that is able to collect compressive measurements. The goal is to obtain tracks *without* having to perform CS decoding. That is, the method solves the sequential estimation problem using the compressive measurements directly, avoiding

⎤ ⎥ ⎦ **x** + ⎡ ⎢ ⎣ **e**1 . . . **e***C* ⎤ ⎥

�**z**�<sup>1</sup> subject to �**Φz** − **y**�<sup>2</sup> ≤ �**e**�<sup>2</sup> . (4.12)

⎡ ⎢ ⎣ *j*

<sup>⎦</sup> (4.11)

, *j* = 1, . . . , *C*.

data is projected into the lower-dimensional space by computing **y***<sup>j</sup>* = **Φ***j***y**�

⎡ ⎢ ⎣

dimension of **x** as opposed to the number of measurements received.

**y**1 . . . **y***C* ⎤ ⎥ <sup>⎦</sup> <sup>=</sup>

Figure originally appears in (Reddy et al., 2008).

**xˆ** = min **z**∈**R***<sup>N</sup>*

subsequent tracker components.

**4.5 Compressive particle filtering**

following structure:

element. The final step accounts for uncertainty in the change of object appearance from one frame to the next. The result is a modified foreground image, which can then be interpreted as a prediction of **f***t*+1. This prediction is used to define the weights according to (4.9), and the weighted decoding procedure is used to obtain ˆ **f***t*+1.

The above method is repeated at each new time instant. For a fixed compressive measurement rate, it is shown to provide more accurate foreground reconstructions than decoders that do not take advantage of the tracker output. Accordingly, it is also the case that such a method is able to more successfully tolerate lower bit rates. These results reveal the benefit of using the high level tracker information in compressive sensing systems.

#### **4.4 Compressive sensing for multi-view tracking**

Another direct application of CS to a data-rich tracking problem is presented by (Reddy et al., 2008). Specifically, a method for using multiple sensors to perform multi-view tracking employing a coding scheme based on compressive sensing is developed. Assuming that the observed data contains no background component (this could be realized, e.g., by preprocessing using any of the background subtraction techniques previously discussed), the method uses known information regarding the sensor geometry to facilitate a common data encoding scheme based on CS. After data from each camera is received at a central processing station, it is fused via CS decoding and the resulting image or three dimensional grid can be used for tracking.

The first case considered is one where all objects of interest exist in a known ground plane. It is assumed that the geometric transformation between it and each sensor plane is known. That is, if there are *<sup>C</sup>* cameras, then the *homographies* {**H***j*}*<sup>C</sup> <sup>j</sup>*=<sup>1</sup> are known. The relationship between coordinates (*u*, *v*) in the *j th* image and the corresponding ground plane coordinates (*x*, *y*) is determined by **H***<sup>j</sup>* as

$$
\begin{bmatrix} u \\ v \\ 1 \end{bmatrix} \sim \mathbf{H}\_{\dot{f}} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix} \quad \text{\(\ast\)}\tag{4.10}
$$

where the coordinates are written in accordance with their homogeneous representation. Since **H***<sup>j</sup>* can vary widely across the set of cameras due to varying viewpoint, an encoding scheme designed to achieve a common data representation is presented. First, the ground plane is sampled, yielding a discrete set of coordinates {(*xi*, *yi*)}*<sup>N</sup> <sup>i</sup>*=1. An occupancy vector, **x**, is defined over these coordinates, where **x**(*n*) = 1 if foreground is present at the corresponding coordinates and is 0 otherwise. For each camera's observed foreground image in the set {**I***j*}*<sup>C</sup> <sup>j</sup>*=1, an occupancy vector **y**� *<sup>j</sup>* is formed as **y**� *j* (*i*) = **I***j*(*ui*, *vi*), where (*ui*, *vi*) are the (rounded) image plane coordinates corresponding to (*xi*, *yi*) obtained via (4.10). Thus, **y**� *<sup>j</sup>* = **x** + **e***j*, where **e***<sup>j</sup>* represents any error due to the coordinate rounding and other noise. Figure 4 illustrates the physical configuration of the system.

Noting that **x** is often sparse, the camera data {**y**� *j* }*C <sup>j</sup>*=<sup>1</sup> is encoded using compressive sensing. First, *<sup>C</sup>* measurement matrices {**Φ***j*}*<sup>C</sup> <sup>j</sup>*=<sup>1</sup> of equal dimension are formed according to a construction that affords them the RIP of appropriate order for **x**. Next, the camera

element. The final step accounts for uncertainty in the change of object appearance from one frame to the next. The result is a modified foreground image, which can then be interpreted as a prediction of **f***t*+1. This prediction is used to define the weights according to (4.9), and the

The above method is repeated at each new time instant. For a fixed compressive measurement rate, it is shown to provide more accurate foreground reconstructions than decoders that do not take advantage of the tracker output. Accordingly, it is also the case that such a method is able to more successfully tolerate lower bit rates. These results reveal the benefit of using the

Another direct application of CS to a data-rich tracking problem is presented by (Reddy et al., 2008). Specifically, a method for using multiple sensors to perform multi-view tracking employing a coding scheme based on compressive sensing is developed. Assuming that the observed data contains no background component (this could be realized, e.g., by preprocessing using any of the background subtraction techniques previously discussed), the method uses known information regarding the sensor geometry to facilitate a common data encoding scheme based on CS. After data from each camera is received at a central processing station, it is fused via CS decoding and the resulting image or three dimensional grid can be

The first case considered is one where all objects of interest exist in a known ground plane. It is assumed that the geometric transformation between it and each sensor plane is known.

> ⎡ ⎣ *x y* 1

where the coordinates are written in accordance with their homogeneous representation. Since **H***<sup>j</sup>* can vary widely across the set of cameras due to varying viewpoint, an encoding scheme designed to achieve a common data representation is presented. First, the ground

is defined over these coordinates, where **x**(*n*) = 1 if foreground is present at the corresponding coordinates and is 0 otherwise. For each camera's observed foreground image in the set

*<sup>j</sup>* is formed as **y**�

(rounded) image plane coordinates corresponding to (*xi*, *yi*) obtained via (4.10). Thus, **y**�

**x** + **e***j*, where **e***<sup>j</sup>* represents any error due to the coordinate rounding and other noise. Figure

to a construction that affords them the RIP of appropriate order for **x**. Next, the camera

⎤

*j*

*j* }*C* *<sup>j</sup>*=<sup>1</sup> are known. The relationship

*<sup>i</sup>*=1. An occupancy vector, **x**,

*<sup>j</sup>* =

⎦ , (4.10)

(*i*) = **I***j*(*ui*, *vi*), where (*ui*, *vi*) are the

*<sup>j</sup>*=<sup>1</sup> is encoded using compressive

*<sup>j</sup>*=<sup>1</sup> of equal dimension are formed according

*th* image and the corresponding ground plane coordinates

**f***t*+1.

weighted decoding procedure is used to obtain ˆ

**4.4 Compressive sensing for multi-view tracking**

used for tracking.

{**I***j*}*<sup>C</sup>*

between coordinates (*u*, *v*) in the *j*

*<sup>j</sup>*=1, an occupancy vector **y**�

4 illustrates the physical configuration of the system.

sensing. First, *<sup>C</sup>* measurement matrices {**Φ***j*}*<sup>C</sup>*

Noting that **x** is often sparse, the camera data {**y**�

(*x*, *y*) is determined by **H***<sup>j</sup>* as

high level tracker information in compressive sensing systems.

That is, if there are *<sup>C</sup>* cameras, then the *homographies* {**H***j*}*<sup>C</sup>*

plane is sampled, yielding a discrete set of coordinates {(*xi*, *yi*)}*<sup>N</sup>*

⎡ ⎣ *u v* 1 ⎤ ⎦ ∼ **H***<sup>j</sup>*

Fig. 4. Physical diagram capturing the assumed setup of the multi-view tracking scenario. Figure originally appears in (Reddy et al., 2008).

data is projected into the lower-dimensional space by computing **y***<sup>j</sup>* = **Φ***j***y**� *j* , *j* = 1, . . . , *C*. This lower-dimensional data is transmitted to a central station, where it is ordered into the following structure:

$$
\begin{bmatrix} \mathbf{y}\_1 \\ \vdots \\ \mathbf{y}\_C \end{bmatrix} = \begin{bmatrix} \Phi\_1 \\ \vdots \\ \Phi\_C \end{bmatrix} \mathbf{x} + \begin{bmatrix} \mathbf{e}\_1 \\ \vdots \\ \mathbf{e}\_C \end{bmatrix} \tag{4.11}
$$

which can be written as **y** = **Φx** + **e**. This is a noisy version of the standard CS problem presented in Section 3, and an estimate of **x** can be found using a relaxed version of (3.3), i.e.,

$$\hat{\mathbf{x}} = \min\_{\mathbf{z} \in \mathbb{R}^N} \|\mathbf{z}\|\_1 \quad \text{subject to} \quad \|\boldsymbol{\Phi}\mathbf{z} - \mathbf{y}\|\_2 \le \|\mathbf{e}\|\_2 \quad . \tag{4.12}$$

The estimated occupancy grid (formed, e.g., by thresholding **xˆ**) can then be used as input to subsequent tracker components.

The above process is also extended to three dimensions, where **x** represents an occupancy grid over 3D space, and the geometric relationship in (4.10) is modified to account for the added dimension. The rest of the process is entirely similar to the two dimensional case. Of particular note is the advantage in computational complexity: it is only on the order of the dimension of **x** as opposed to the number of measurements received.

#### **4.5 Compressive particle filtering**

The final application of compressive sensing in tracking presented in this chapter is the compressive particle filtering algorithm developed by (Wang et al., 2009). As in Section 4.1, it is assumed that the system uses a sensor that is able to collect compressive measurements. The goal is to obtain tracks *without* having to perform CS decoding. That is, the method solves the sequential estimation problem using the compressive measurements directly, avoiding

**5. Summary**

the most relevant information.

331(6018): 717–9.

24(4): 118–121.

**6. References**

This chapter presented current applications of CS in visual tracking. In the presence of large quantities of data, algorithms common to classical tracking can become cumbersome. To provide context, a review of selected classical methods was given, including background subtraction, Kalman and particle filtering, and mean shift tracking. As a means by which data reduction can be accomplished, the emerging theory of compressive sensing was presented. Compressive sensing measurements **y** = **Φx** necessitate a nonlinear decoding process, which makes accomplishing high-level tracking tasks difficult. Recent research addressing this problem was presented. Compressive background subtraction was discussed as a way to incorporate compressive sensors into a tracking system and obtain foreground-only images using a reduced amount of data. Kalman filtered CS was then discussed as a computationally and data-efficient way to track slowly moving objects. As an example of using high-level tracker information in a CS system, a method that uses it to improve the foreground estimate was presented. In the realm of multi-view tracking, CS was used as part of an encoding scheme that enabled computationally feasible occupancy map fusion in the presence of a large number of cameras. Finally, a compressive particle filtering method was discussed, via which

Compressive Sensing in Visual Tracking 19

The above research represents significant progress in the field of performing high-level tasks such as tracking in the presence of data reduction schemes such like CS. However, there is certainly room for improvement. Just as CS was developed by considering the integration of sensing and compression, future research in this field must jointly consider sensing and the end-goal of the system, i.e., high-level information. Sensing strategies devised in accordance with such considerations should be able to efficiently handle the massive quantities of data present in modern surveillance systems by only sensing and processing that which will yield

Baraniuk, R. (2011). More is less: signal processing and the data deluge., *Science*

Baraniuk, R. G. (2007). Compressive Sensing [Lecture Notes], *IEEE Signal Processing Magazine*

Broida, T. & Chellappa, R. (1986). Estimation of object motion parameters from noisy images., *IEEE Transactions on Pattern Analysis and Machine Intelligence* 8(1): 90–9. Bruckstein, A., Donoho, D. & Elad, M. (2009). From Sparse Solutions of Systems of Equations

Candès, E. & Wakin, M. (2008). An introduction to compressive sampling, *IEEE Signal*

Cevher, V., Sankaranarayanan, A., Duarte, M., Reddy, D., Baraniuk, R. & Chellappa, R. (2008).

Comaniciu, D. & Meer, P. (2002). Mean shift: a robust approach toward feature space analysis, *IEEE Transactions on Pattern Analysis and Machine Intelligence* 24(5): 603–619. Comaniciu, D., Ramesh, V. & Meer, P. (2003). Kernel-based object tracking, *IEEE Transactions*

to Sparse Modeling of Signals and Images, *SIAM Review* 51(1): 34.

Compressive sensing for background subtraction, *ECCV 2008* .

*on Pattern Analysis and Machine Intelligence* 25(5): 564–577.

tracks can be computed directly from compressive image measurements.

Anderson, B. & Moore, J. (1979). *Optimal Filtering*, Dover.

*Processing Magazine* 25(2): 21–30.

procedures such as (3.3). Specifically, the algorithm is a modification to the particle filter of Section 2.2.2.

First, the system is formulated in state space, where the state vector at time *t* is given by

$$\mathbf{s}\_{t} = [\mathbf{s}\_{t}^{\mathbf{x}} \; \mathbf{s}\_{t}^{\mathbf{y}} \; \dot{\mathbf{s}}\_{t}^{\mathbf{x}} \; \dot{\mathbf{s}}\_{t}^{\mathbf{y}} \; \boldsymbol{\Psi}\_{t}]^{T} \; \text{ .} \tag{4.13}$$

(*s<sup>x</sup> <sup>t</sup>* ,*s y <sup>t</sup>* ) and (*s*˙ *x <sup>t</sup>* ,*s*˙ *y <sup>t</sup>* ) represent the object position and velocity in the image plane, and *ψ<sup>t</sup>* is a parameter specifying the width of an appearance kernel. The appearance kernel is taken to be a Gaussian function defined over the image plane and centered at (*s<sup>x</sup> <sup>t</sup>* ,*s y <sup>t</sup>* ) with i.i.d. component variance proportional to *ψt*. That is, given **s***t*, the *j th* component of the vectorized image, **z***t*, is defined as

$$\mathbf{z}\_t^j(\mathbf{s}\_t) = \mathbb{C}\_t \exp\{-\psi\_t(\begin{bmatrix} \mathbf{s}\_k^x\\\mathbf{s}\_k^y \end{bmatrix} - \mathbf{r}^j)\} \quad , \tag{4.14}$$

where **r***<sup>j</sup>* specifies the two dimensional coordinate vector belonging to the *j th* component of **z***t*. The state equation is given by

$$\mathbf{s}\_{t+1} = f\_t(\mathbf{s}\_t, \mathbf{v}\_t) = \mathbf{D}\mathbf{s}\_t + \mathbf{v}\_t \quad , \tag{4.15}$$

where

$$D = \begin{bmatrix} 1 \ 0 \ 1 \ 0 \ 0 \\ 0 \ 1 \ 0 \ 1 \ 0 \\ 0 \ 0 \ 1 \ 0 \ 0 \\ 0 \ 0 \ 0 \ 1 \ 0 \\ 0 \ 0 \ 0 \ 0 \ 1 \end{bmatrix} \tag{4.16}$$

and **v***<sup>t</sup>* ∼ N (**0**, diag(*α*)) for a preselected noise variance vector *α*.

The observation equation specifies the mapping from the state to the observed compressive measurements **y***t*. If **Φ** is the CS measurement matrix used to sense **z***t*, this is given by

$$\mathbf{y}\_t = \Phi \mathbf{z}\_t(\mathbf{s}\_t) + \mathbf{w}\_t \tag{4.17}$$

where **w***<sup>t</sup>* is zero-mean Gaussian measurement noise with covariance **Σ**.

With the above specified, the bootstrap particle filtering algorithm presented in Section 2.2.2 can be used to sequentially estimate **s***<sup>t</sup>* from the observations **y***t*. Specifically, the importance weights belonging to candidate samples {**s˜** (*i*) *<sup>t</sup>* }*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> can be found via

$$\boldsymbol{\mathfrak{w}}\_{t}^{(i)} = p(\mathbf{y}\_{t}|\mathbf{\tilde{s}}\_{t}^{(i)}) = \mathcal{N}(\mathbf{y}\_{t}; \boldsymbol{\Phi} \mathbf{z}\_{t}(\mathbf{\tilde{s}}\_{t}^{(i)}), \boldsymbol{\Sigma}) \tag{4.18}$$

and rescaling to normalize across all *i*. These importance weights can be calculated at each time step without having to perform CS decoding on **y**. In some sense, the filter is acting purely on compressive measurements, and hence the name "compressive particle filter."

#### **5. Summary**

18 Will-be-set-by-IN-TECH

procedures such as (3.3). Specifically, the algorithm is a modification to the particle filter of

a parameter specifying the width of an appearance kernel. The appearance kernel is taken

� *sx k s y k* � <sup>−</sup> **<sup>r</sup>***<sup>j</sup>*

*<sup>t</sup>* ) represent the object position and velocity in the image plane, and *ψ<sup>t</sup>* is

**s***t*+<sup>1</sup> = *ft*(**s***t*, **v***t*) = **Ds***<sup>t</sup>* + **v***<sup>t</sup>* , (4.15)

**y***<sup>t</sup>* = **Φz***t*(**s***t*) + **w***<sup>t</sup>* , (4.17)

*<sup>t</sup>* ), **Σ**) (4.18)

*<sup>i</sup>*=<sup>1</sup> can be found via

(*i*)

*<sup>T</sup>* . (4.13)

*<sup>t</sup>* ,*s y*

*th* component of the vectorized

)} , (4.14)

*<sup>t</sup>* ) with i.i.d.

*th* component of **z***t*.

(4.16)

First, the system is formulated in state space, where the state vector at time *t* is given by

**s***<sup>t</sup>* = [*s<sup>x</sup> t s y t s*˙ *x t s*˙ *y <sup>t</sup> ψt*]

to be a Gaussian function defined over the image plane and centered at (*s<sup>x</sup>*

*<sup>t</sup>*(**s***t*) = *Ct* exp{−*ψt*(

*D* =

and **v***<sup>t</sup>* ∼ N (**0**, diag(*α*)) for a preselected noise variance vector *α*.

where **w***<sup>t</sup>* is zero-mean Gaussian measurement noise with covariance **Σ**.

*<sup>t</sup>* = *p*(**y***t*|**s˜**

weights belonging to candidate samples {**s˜**

*w*˜ (*i*) ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣

The observation equation specifies the mapping from the state to the observed compressive

With the above specified, the bootstrap particle filtering algorithm presented in Section 2.2.2 can be used to sequentially estimate **s***<sup>t</sup>* from the observations **y***t*. Specifically, the importance

and rescaling to normalize across all *i*. These importance weights can be calculated at each time step without having to perform CS decoding on **y**. In some sense, the filter is acting purely on compressive measurements, and hence the name "compressive particle filter."

*<sup>t</sup>* ) = N (**y***t*; **Φz***t*(**s˜**

(*i*) *<sup>t</sup>* }*<sup>N</sup>*

(*i*)

measurements **y***t*. If **Φ** is the CS measurement matrix used to sense **z***t*, this is given by

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦

where **r***<sup>j</sup>* specifies the two dimensional coordinate vector belonging to the *j*

component variance proportional to *ψt*. That is, given **s***t*, the *j*

**z** *j*

Section 2.2.2.

*x <sup>t</sup>* ,*s*˙ *y*

The state equation is given by

image, **z***t*, is defined as

(*s<sup>x</sup> <sup>t</sup>* ,*s y <sup>t</sup>* ) and (*s*˙

where

This chapter presented current applications of CS in visual tracking. In the presence of large quantities of data, algorithms common to classical tracking can become cumbersome. To provide context, a review of selected classical methods was given, including background subtraction, Kalman and particle filtering, and mean shift tracking. As a means by which data reduction can be accomplished, the emerging theory of compressive sensing was presented. Compressive sensing measurements **y** = **Φx** necessitate a nonlinear decoding process, which makes accomplishing high-level tracking tasks difficult. Recent research addressing this problem was presented. Compressive background subtraction was discussed as a way to incorporate compressive sensors into a tracking system and obtain foreground-only images using a reduced amount of data. Kalman filtered CS was then discussed as a computationally and data-efficient way to track slowly moving objects. As an example of using high-level tracker information in a CS system, a method that uses it to improve the foreground estimate was presented. In the realm of multi-view tracking, CS was used as part of an encoding scheme that enabled computationally feasible occupancy map fusion in the presence of a large number of cameras. Finally, a compressive particle filtering method was discussed, via which tracks can be computed directly from compressive image measurements.

The above research represents significant progress in the field of performing high-level tasks such as tracking in the presence of data reduction schemes such like CS. However, there is certainly room for improvement. Just as CS was developed by considering the integration of sensing and compression, future research in this field must jointly consider sensing and the end-goal of the system, i.e., high-level information. Sensing strategies devised in accordance with such considerations should be able to efficiently handle the massive quantities of data present in modern surveillance systems by only sensing and processing that which will yield the most relevant information.

#### **6. References**

Anderson, B. & Moore, J. (1979). *Optimal Filtering*, Dover.


**2** 

*Japan* 

Hiroto Kakiuchi, Kozo Tanigawa,

Takao Kawamura and Kazunori Sugahara

*Melco Power Systems Co., Ltd/Graduate School of Tottori University* 

**A Construction Method for Automatic Human** 

**Tracking System with Mobile Agent Technology** 

Human tracking systems that can track a specific person is being researched and developed aggressively, since the system is available for security and a flexible service like investigation of human behaviour. For example, Terashita, Kawaguchi and others propose the method for tracking an object captured by simple active video camera (Terashita et al. 2009; Kawaguchi et al. 2008), and Yin and others propose the solution of the problem of the blurring of the active video camera (Yin et al. 2008). Tanizawa and others propose a mobile agent framework that can become the base of a human tracking system (Tanizawa et al. 2002). These are component technology, and they are available in construction of the human tracking system. On the other hand, Tanaka and others propose a human tracking system using the information from video camera and sensor (Tanaka et al. 2004), and Nakazawa and others propose a human tracking system using recognition technique which recognizes same person using multiple video cameras at the same time (Nakazawa et al. 2001). However, although these proposed systems are available as human tracking system, the systems are constructed under fixed camera position and unchanged photography range of camera. On the other hand, there are several researches to track people with active cameras. Wren and others propose a class of hybrid perceptual systems that builds a comprehensive model of activity in a large space, such as a building, by merging contextual information from a dense network of ultra-lightweight sensor nodes with video from a sparse network of high-capability sensors. They explore the task of automatically recovering the relative geometry between an active camera and a network of one-bit motion detectors. Takemura and others propose a view planning of multiple cameras for tracking multiple persons for surveillance purposes (Takemura et al. 2007). They develop a multi-start local search (MLS) based planning method which iteratively selects fixation points of the cameras by which the expected number of tracked persons is maximized. Sankaranarayanan and others discuss the basic challenges in detection, tracking, and classification using multiview inputs (Sankaranarayanan et al. 2008). In particular, they discuss the role of the geometry induced by imaging with a camera in estimating target characteristics. Sommerlade and others propose a consistent probabilistic approach to control multiple, but diverse active cameras concertedly observing a scene (Sommerlade et al. 2010). The cameras react to objects moving about, arbitrating conflicting interests of target resolution and trajectory accuracy, and the cameras anticipate the appearance of new targets. Porikli and others propose an automatic

**1. Introduction** 

