2. Tracking by detection

The most general type of tracking is single-object model-free online tracking, in which the object is annotated in the first frame and tracked in the subsequent frames with no prior knowledge about the target's appearance, its motions, the background, the configurations of the camera, and other conditions of the scene. Visual tracking is still considered as a challenging problem despite numerous efforts made to address abrupt appearance changes of the target [13], complex transformations [14] and deformations [15, 16], background clutter [17],

Generative trackers attempt to construct a robust object appearance model or to learn it on the fly using advanced machine learning techniques such as subspace learning [20], hash learning [21], dictionary learning [22], and sparse code learning [13]. General object tracking is the task of tracking arbitrary objects through one-shot learning, typically with no a priori knowledge about the target's geometry, category, or appearance. Called model-free tracking, the task is to learn the target appearance and update it by adjusting to target's changes on the fly. To this end, discriminative models focus on target/background separation using correlation filters [23–25] or dedicated classifiers [26], which assist them to dominate the visual tracking benchmarks [27–29]. Using tracking-by-detection approaches is a popular trend in recent years, due to significant breakthroughs in object detection domain (deep residual neural networks [30], for instance), yielding strong discriminating power with offline training. Adopted for visual tracking, many of such trackers are adjusted for online training and accumulate knowledge

Tracking-by-detection methods primarily treat tracking as a detection problem to avoid having model object dynamics especially in the case of sudden motion changes, extreme deformations, and occlusions [34, 35]. However, there is a multitude of drawbacks in the tracking-by-

1. Label noise: inaccurate labels confuse the classifier [15] and degrade the classification accuracy [34]. The labeler is typically built upon heuristics and intuitions, rather than

2. Self-learning loop: the classifier is retrained by their own output from earlier frames, thus

3. Uniform treatment of samples: equal weight for all samples in evaluating the target [36] and training the classifier [37], despite the uneven contextual information in different samples. The classifier is trained using all the examples with equal weights, meaning that negative examples which overlap very little with the target bounding box are treated equally as

4. Stationarity assumption: assuming a stationary distribution of the target appearance does not hold for most of the real-world scenarios with drastic target appearance changes [35]. In the context of visual tracking, the non-stationarity means that the appearance of an object may change so significantly that a negative sample in the current frame looks more

5. Model update difficulties: adaptive trackers inherently suffer from the drifting problem. Noisy model update [38] and the mismatch between model update frequency and target

occlusion [18], and motion artifacts [19].

102 Human-Robot Interaction - Theory and Application

detection setting:

about a target with each successful detection (e.g., [26, 31–33]).

using the accumulated knowledge about the target.

those negative examples with significant overlaps.

similar to a positive example in the previous frames.

accumulating error over time [35].

Typically tracking-by-detection method consists of five major steps: SAMPLING, CLASSIFY-ING, LABELING, ESTIMATING, UPDATING.

SAMPLING: To obtain the positive sample(s) and negative samples (the target and the background, respectively), dense or sparse (stochastic) sampling is performed either around last known target position (using Gaussian distributions, particle filters, or various motion models) or around the saliencies or key points in the current frame [21]. Adaptive weights for the samples based on their appearance similarity to the target [42], occlusion state [18], and spatial distance to previous target location [43] have been considered; especially in the context of tracking by detection, boosting [44] has been extensively investigated [45–47].

On top of that, the frequency of update is another important role player in tracker's performance [39]. Higher update rates capture the rapid target changes but is prone to occlusions, whereas slower update paces provide a long memory for the tracker to handle temporal target variations but lack the flexibility to accommodate permanent target changes. To this end, researchers try to combine long- and short-term memories [62] and role-back improper updates [57] or utilize different temporal snapshots of the classifier to overcome non-stationary distribution of the target's appearance [63]. This pipeline, however, was altered in some studies to introduce desired properties, e.g., to avoid label noise by merging sampling and

Online visual tracking is the task to update the state vector p<sup>t</sup> involving location, size, and shape of the bounding box, at each observation of video frame t ¼ 1, …, T. The update process is sometimes written with transformation <sup>y</sup><sup>t</sup> that transforms the previous state vector <sup>p</sup><sup>t</sup>�<sup>1</sup> to

In tracking-by-discrimination framework, we utilize a classifier θ<sup>t</sup> that discriminates an image patch x into either target or background, where the classifier is denoted as a real valued discriminant function hð Þ xjθ<sup>t</sup> ∈ R and the function value s ¼ hð Þ xjθ<sup>t</sup> is called a discrimination score or, in short, score. The patch x (i.e., the area of the image bounded by the bounding box

) is labeled as target if s > τ with a threshold τ and as background if x < τ. A typical

SAMPLING: The samples are defined using these transformations, and their corresponding

<sup>t</sup> ∈ X<sup>t</sup> are selected from image. We obtain N samples of state p

<sup>p</sup>t�<sup>1</sup> <sup>∘</sup> <sup>y</sup><sup>j</sup> t <sup>t</sup> jθ<sup>t</sup>

> j <sup>t</sup> � τ

> > ∗ <sup>t</sup> s.t. j

j

s j <sup>t</sup> ¼ h x

l j

j

UPDATING: Finally, we update the classifier by its own labeled data:

<sup>t</sup> ¼ sign s

ESTIMATING: We determine the next target state <sup>p</sup><sup>t</sup> typically by selecting the best <sup>p</sup><sup>j</sup>

<sup>t</sup>, <sup>p</sup><sup>t</sup> <sup>¼</sup> <sup>p</sup><sup>t</sup>�<sup>1</sup> <sup>∘</sup> <sup>y</sup><sup>j</sup>

j

<sup>t</sup> corresponding to all sam-

<sup>t</sup> ∈Y<sup>t</sup> using dense or sparse sampling strategy, transfor-

pj t

(1)

(2)

j t.

<sup>∗</sup> <sup>¼</sup> argmax<sup>j</sup>∈f g <sup>1</sup>;…;<sup>N</sup> <sup>s</sup>

<sup>t</sup> ∈Pt.

Active Collaboration of Classifiers for Visual Tracking http://dx.doi.org/10.5772/intechopen.74199 105

<sup>t</sup> <sup>¼</sup> <sup>p</sup><sup>t</sup>�<sup>1</sup> <sup>∘</sup> <sup>y</sup><sup>j</sup>

<sup>t</sup> of each sample j using the score of the sample. If the score is

<sup>t</sup> as <sup>p</sup><sup>j</sup>

<sup>t</sup> of the image patches x

t, j ¼ 1, …, N by

<sup>t</sup> that

labeling steps [15].

2.1. Formalization

pt

image patches x

the current state <sup>p</sup><sup>t</sup> <sup>¼</sup> <sup>p</sup><sup>t</sup>�<sup>1</sup> <sup>∘</sup> <sup>y</sup><sup>t</sup>

j

LABELING: We determine label l

corresponds to the maximum score s

drawing random transformations y<sup>j</sup>

CLASSIFYING: We calculate the score s

.

procedure of the tracking-by-discrimination is written as follows.

ples, or bounding boxes, using the current classifier θ<sup>t</sup> (h : X ! R):

j

above a threshold τ, the sample is likely to be target match:

ming the previous state pt�<sup>1</sup> with a transformations <sup>y</sup><sup>j</sup>

CLASSIFYING: The classification module of tracking-by-detection schemes utilizes offlinetrained classifiers or online supervised learning methods to classify the target from its background (e.g., [48]). To robustify this module especially against label noise, supervised learning with robust loss functions [46, 49] and semi-supervised [39, 50] and multi-instance [47, 51, 52] learning approaches are considered. Efficient sparse sampling [53], leveraging context information [17, 54], considering sample information content for the classifier [55], and landmarkbased label propagation [43] are among other proposed approaches to address this issue. Another interesting approach is to reformulate to couple the labeling and updating process to bridge the gap between the objectives of these two steps, as labeling aims for predicting binary sample labels, whereas updating typically tries to estimate object location [15]. The label noise problem amplifies when the tracker does not have a forgetting mechanism or a way to obtain external scaffolds (i.e., self-learning loop). This inspired the use of co-tracking [34], ensemble tracking [56, 57], or label verification schemes [58] to break the self-learning loop using auxiliary classifiers.

LABELING: The result of classification process provides the target/background label for each sample, a process which can be enhanced by employing an ensemble of classifiers [56, 57], exchanging information between collaborative classifiers [34], and verifying labels by auxiliary classifiers [58] or landmarks [43].

ESTIMATING: The state of the target, i.e., the location and scale of the target usually described with a bounding box, is then determined by selecting the sample with the highest classification score [15], calculating the expectation of target state [41], or performing an estimated bounding box regression [59].

UPDATING: Updating the classifier is another challenge of the tracking-by-detection schemes. Updating the classifier, with the data labeled by itself previously in a closed-loop (known as self-learning loop), is susceptible to drift from the original data distribution because a tiny error or a small noise can be amplified. Therefore along with many types of research to revalidate the data labels (such as [58]), the importance of having a "teacher" to guide the classifier during training is discussed in literature [39]. Cooperative classifiers in frameworks such as ensembles of homogeneous or heterogeneous classifiers [60], co-learning [34], and hybrids of generative and discriminative models [61] are some of the approaches to provide this guidance through cooperation. Furthermore, feature selection based on its discrimination ability [45], replacing the weakest classifier of an ensemble [45] or the oldest one [60], or applying a budget on the sample pool (hence, keeping only some prototypical samples) [15, 43] is proposed to improve the performance of such solutions.

On top of that, the frequency of update is another important role player in tracker's performance [39]. Higher update rates capture the rapid target changes but is prone to occlusions, whereas slower update paces provide a long memory for the tracker to handle temporal target variations but lack the flexibility to accommodate permanent target changes. To this end, researchers try to combine long- and short-term memories [62] and role-back improper updates [57] or utilize different temporal snapshots of the classifier to overcome non-stationary distribution of the target's appearance [63]. This pipeline, however, was altered in some studies to introduce desired properties, e.g., to avoid label noise by merging sampling and labeling steps [15].

## 2.1. Formalization

SAMPLING: To obtain the positive sample(s) and negative samples (the target and the background, respectively), dense or sparse (stochastic) sampling is performed either around last known target position (using Gaussian distributions, particle filters, or various motion models) or around the saliencies or key points in the current frame [21]. Adaptive weights for the samples based on their appearance similarity to the target [42], occlusion state [18], and spatial distance to previous target location [43] have been considered; especially in the context of tracking by detection, boosting [44] has been extensively investigated

CLASSIFYING: The classification module of tracking-by-detection schemes utilizes offlinetrained classifiers or online supervised learning methods to classify the target from its background (e.g., [48]). To robustify this module especially against label noise, supervised learning with robust loss functions [46, 49] and semi-supervised [39, 50] and multi-instance [47, 51, 52] learning approaches are considered. Efficient sparse sampling [53], leveraging context information [17, 54], considering sample information content for the classifier [55], and landmarkbased label propagation [43] are among other proposed approaches to address this issue. Another interesting approach is to reformulate to couple the labeling and updating process to bridge the gap between the objectives of these two steps, as labeling aims for predicting binary sample labels, whereas updating typically tries to estimate object location [15]. The label noise problem amplifies when the tracker does not have a forgetting mechanism or a way to obtain external scaffolds (i.e., self-learning loop). This inspired the use of co-tracking [34], ensemble tracking [56, 57], or label verification schemes [58] to break the self-learning loop using auxil-

LABELING: The result of classification process provides the target/background label for each sample, a process which can be enhanced by employing an ensemble of classifiers [56, 57], exchanging information between collaborative classifiers [34], and verifying labels by auxiliary

ESTIMATING: The state of the target, i.e., the location and scale of the target usually described with a bounding box, is then determined by selecting the sample with the highest classification score [15], calculating the expectation of target state [41], or performing an

UPDATING: Updating the classifier is another challenge of the tracking-by-detection schemes. Updating the classifier, with the data labeled by itself previously in a closed-loop (known as self-learning loop), is susceptible to drift from the original data distribution because a tiny error or a small noise can be amplified. Therefore along with many types of research to revalidate the data labels (such as [58]), the importance of having a "teacher" to guide the classifier during training is discussed in literature [39]. Cooperative classifiers in frameworks such as ensembles of homogeneous or heterogeneous classifiers [60], co-learning [34], and hybrids of generative and discriminative models [61] are some of the approaches to provide this guidance through cooperation. Furthermore, feature selection based on its discrimination ability [45], replacing the weakest classifier of an ensemble [45] or the oldest one [60], or applying a budget on the sample pool (hence, keeping only some prototypical samples) [15,

[45–47].

iary classifiers.

classifiers [58] or landmarks [43].

104 Human-Robot Interaction - Theory and Application

estimated bounding box regression [59].

43] is proposed to improve the performance of such solutions.

Online visual tracking is the task to update the state vector p<sup>t</sup> involving location, size, and shape of the bounding box, at each observation of video frame t ¼ 1, …, T. The update process is sometimes written with transformation <sup>y</sup><sup>t</sup> that transforms the previous state vector <sup>p</sup><sup>t</sup>�<sup>1</sup> to the current state <sup>p</sup><sup>t</sup> <sup>¼</sup> <sup>p</sup><sup>t</sup>�<sup>1</sup> <sup>∘</sup> <sup>y</sup><sup>t</sup> .

In tracking-by-discrimination framework, we utilize a classifier θ<sup>t</sup> that discriminates an image patch x into either target or background, where the classifier is denoted as a real valued discriminant function hð Þ xjθ<sup>t</sup> ∈ R and the function value s ¼ hð Þ xjθ<sup>t</sup> is called a discrimination score or, in short, score. The patch x (i.e., the area of the image bounded by the bounding box pt ) is labeled as target if s > τ with a threshold τ and as background if x < τ. A typical procedure of the tracking-by-discrimination is written as follows.

SAMPLING: The samples are defined using these transformations, and their corresponding image patches x j <sup>t</sup> ∈ X<sup>t</sup> are selected from image. We obtain N samples of state p j t, j ¼ 1, …, N by drawing random transformations y<sup>j</sup> <sup>t</sup> ∈Y<sup>t</sup> using dense or sparse sampling strategy, transforming the previous state pt�<sup>1</sup> with a transformations <sup>y</sup><sup>j</sup> <sup>t</sup> as <sup>p</sup><sup>j</sup> <sup>t</sup> <sup>¼</sup> <sup>p</sup><sup>t</sup>�<sup>1</sup> <sup>∘</sup> <sup>y</sup><sup>j</sup> <sup>t</sup> ∈Pt.

CLASSIFYING: We calculate the score s j <sup>t</sup> of the image patches x pj t <sup>t</sup> corresponding to all samples, or bounding boxes, using the current classifier θ<sup>t</sup> (h : X ! R):

$$\mathbf{s}\_t^j = h\left(\mathbf{x}\_t^{\mathbf{p}\_{t-1}} \mathbf{v}\_t^j|\boldsymbol{\theta}\_t\right) \tag{1}$$

LABELING: We determine label l j <sup>t</sup> of each sample j using the score of the sample. If the score is above a threshold τ, the sample is likely to be target match:

$$l\_t^j = \text{sign}\left(s\_t^j - \tau\right) \tag{2}$$

ESTIMATING: We determine the next target state <sup>p</sup><sup>t</sup> typically by selecting the best <sup>p</sup><sup>j</sup> <sup>t</sup> that corresponds to the maximum score s j <sup>t</sup>, <sup>p</sup><sup>t</sup> <sup>¼</sup> <sup>p</sup><sup>t</sup>�<sup>1</sup> <sup>∘</sup> <sup>y</sup><sup>j</sup> ∗ <sup>t</sup> s.t. j <sup>∗</sup> <sup>¼</sup> argmax<sup>j</sup>∈f g <sup>1</sup>;…;<sup>N</sup> <sup>s</sup> j t.

UPDATING: Finally, we update the classifier by its own labeled data:

$$\Theta\_{t+1} = \mathfrak{u}(\Theta\_t, \mathcal{X}\_t, \mathcal{L}\_t) \tag{3}$$

as blur and trajectory variations, and low imaging resolution and noise and background objects which may cause occlusions, clutter, or target identity confusion. The performance of the trackers is compared with the area under the curve of success plots and precision plots, on

Figure 2. Quantitative performance comparison of the baseline tracker (T1), its variant without model update (T0), and

Success plot indicates the reliability of the tracker and its overall performance, while precision plot reflects the accuracy of the localization. The area under the surface of this plot (AUC) counts the number of successes of tracker over time t∈ f g 1;…; T , i.e., when the overlap of the

graphs the success of the tracker against different values of the threshold τov, and its AUC is

<sup>1</sup> <sup>∣</sup>p<sup>t</sup> <sup>∩</sup> <sup>p</sup><sup>∗</sup> t ∣

∣pt ∪p<sup>∗</sup>

where T is the length of sequence; ∣:∣ denotes the area of the region; ∩ and ∪ stand for intersection and union of the regions, respectively; and 1ð Þ: denotes the step function that returns 1 iff its argument is positive and 0 otherwise. This plot provides an overall performance of the tracker, reflecting target loss, scale mismatches, and localization accuracy.

To establish a fair comparison with the state of the art of tracking-by-detection algorithms, TLD [58] and STRUCK [15] are selected based on the results of [27], BSBT [66] and MIL [47] are selected based on popularity, and CSK [36] was selected as one of the latest algorithms in the category. Since our trackers contain random elements (in sampling and resampling), the results

Figure 2 presents the success and precision plots of T1 along with other competitive trackers for all sequences. We also included a fixed version of T1 tracker (a detector without model

t¼1

<sup>t</sup> exceeds the threshold τov. Success plot

Active Collaboration of Classifiers for Visual Tracking http://dx.doi.org/10.5772/intechopen.74199 107

<sup>t</sup> <sup>∣</sup> <sup>&</sup>gt; <sup>τ</sup>ov � �d<sup>τ</sup>ov , (4)

all of the sequences, or a subset of them with the given attribute.

AUC <sup>¼</sup> <sup>1</sup> T ð1 0 X T

tracker target estimation p<sup>t</sup> with the ground truth p<sup>∗</sup>

the state-of-the-art trackers using success plot.

reported here are the average of five independent runs.

calculated as

2.4. Results

in which u lð Þ is the update function (e.g., budgeted SVM update [15]) and Xt,L<sup>t</sup> are the set of input patches and output labels used as the training set of the discriminator.

#### 2.2. Baseline system implementation

To develop a baseline tracking-by-detection algorithm for this study, we use a robust partbased detector for the CLASSIFYING process. This detector employs strong low-level features based on histograms of oriented gradients (HOG) and uses a latent SVM to perform efficient matching for deformable part-based models (pictorial structures) [64]. From each frame, we draw N samples from a Gaussian distribution whose mean is the target's bounding box in the last frame (including its location and size). The selected detector then outputs the classification score for each sample, which is thresholded to obtain the sample's label. The highest classification score is considered as the current target location (Figure 1).

In the first frame, we generate α1N-positive samples by perturbing the first annotated target patch by few pixels in location and size, select α2N-negative samples from local neighborhood of the target, and select α3N-negative samples from global background of the object in a regular grid (α<sup>1</sup> þ α<sup>2</sup> þ α<sup>3</sup> ¼ 1). These samples are used to train the SVM detector in the first frame. From the next frames, the labels are obtained by the detector itself, and the classifier is batch-trained with all of the samples collected so far.

There are several parameters in the system such as the parameters of sampling step (number of samples N, effective search radius Σsearch). These parameters were tuned using a simulated annealing optimization on a cross validation set. The part-base detector dictionary, and the thresholds τl, τu, and the rest of abovementioned parameters have been adjusted using cross validation. With N ¼ 1000, τ ¼ 0:34 T1 achieved the speed of 47.29 fps on a Pentium IV PC @ 3.5 GHz and a Matlab/C++ implementation on a CPU.

#### 2.3. Method of evaluation

The experiments are conducted on 100 challenging video sequences, OTB-100 [65], which involves many visual tracking challenges such as target appearance, pose and geometry changes, environment lighting and camera position changes, target movement artifacts such

Figure 1. A simple tracking-by-detection pipeline. After gathering some samples from the current frame, the tracker employs its detector to label the samples as positive (target) or negative (background). The target position is estimated using these labeled samples. The labels, in turn, are used to update the classifier for the next frame.

Figure 2. Quantitative performance comparison of the baseline tracker (T1), its variant without model update (T0), and the state-of-the-art trackers using success plot.

as blur and trajectory variations, and low imaging resolution and noise and background objects which may cause occlusions, clutter, or target identity confusion. The performance of the trackers is compared with the area under the curve of success plots and precision plots, on all of the sequences, or a subset of them with the given attribute.

Success plot indicates the reliability of the tracker and its overall performance, while precision plot reflects the accuracy of the localization. The area under the surface of this plot (AUC) counts the number of successes of tracker over time t∈ f g 1;…; T , i.e., when the overlap of the tracker target estimation p<sup>t</sup> with the ground truth p<sup>∗</sup> <sup>t</sup> exceeds the threshold τov. Success plot graphs the success of the tracker against different values of the threshold τov, and its AUC is calculated as

$$ALIC = \frac{1}{T} \int\_{0}^{1} \sum\_{t=1}^{T} 1 \left( \frac{|\mathbf{p}\_{t} \cap \mathbf{p}\_{t}^{\*}|}{|\mathbf{p}\_{t} \cup \mathbf{p}\_{t}^{\*}|} > \tau\_{ov} \right) d\_{\tau\_{ov}} \tag{4}$$

where T is the length of sequence; ∣:∣ denotes the area of the region; ∩ and ∪ stand for intersection and union of the regions, respectively; and 1ð Þ: denotes the step function that returns 1 iff its argument is positive and 0 otherwise. This plot provides an overall performance of the tracker, reflecting target loss, scale mismatches, and localization accuracy.

To establish a fair comparison with the state of the art of tracking-by-detection algorithms, TLD [58] and STRUCK [15] are selected based on the results of [27], BSBT [66] and MIL [47] are selected based on popularity, and CSK [36] was selected as one of the latest algorithms in the category. Since our trackers contain random elements (in sampling and resampling), the results reported here are the average of five independent runs.

#### 2.4. Results

θ<sup>t</sup>þ<sup>1</sup> ¼ u θt; X<sup>t</sup> ð Þ ;L<sup>t</sup> (3)

in which u lð Þ is the update function (e.g., budgeted SVM update [15]) and Xt,L<sup>t</sup> are the set of

To develop a baseline tracking-by-detection algorithm for this study, we use a robust partbased detector for the CLASSIFYING process. This detector employs strong low-level features based on histograms of oriented gradients (HOG) and uses a latent SVM to perform efficient matching for deformable part-based models (pictorial structures) [64]. From each frame, we draw N samples from a Gaussian distribution whose mean is the target's bounding box in the last frame (including its location and size). The selected detector then outputs the classification score for each sample, which is thresholded to obtain the sample's label. The highest classifi-

In the first frame, we generate α1N-positive samples by perturbing the first annotated target patch by few pixels in location and size, select α2N-negative samples from local neighborhood of the target, and select α3N-negative samples from global background of the object in a regular grid (α<sup>1</sup> þ α<sup>2</sup> þ α<sup>3</sup> ¼ 1). These samples are used to train the SVM detector in the first frame. From the next frames, the labels are obtained by the detector itself, and the classifier is

There are several parameters in the system such as the parameters of sampling step (number of samples N, effective search radius Σsearch). These parameters were tuned using a simulated annealing optimization on a cross validation set. The part-base detector dictionary, and the thresholds τl, τu, and the rest of abovementioned parameters have been adjusted using cross validation. With N ¼ 1000, τ ¼ 0:34 T1 achieved the speed of 47.29 fps on a Pentium IV PC @

The experiments are conducted on 100 challenging video sequences, OTB-100 [65], which involves many visual tracking challenges such as target appearance, pose and geometry changes, environment lighting and camera position changes, target movement artifacts such

Figure 1. A simple tracking-by-detection pipeline. After gathering some samples from the current frame, the tracker employs its detector to label the samples as positive (target) or negative (background). The target position is estimated

using these labeled samples. The labels, in turn, are used to update the classifier for the next frame.

input patches and output labels used as the training set of the discriminator.

cation score is considered as the current target location (Figure 1).

batch-trained with all of the samples collected so far.

3.5 GHz and a Matlab/C++ implementation on a CPU.

2.3. Method of evaluation

2.2. Baseline system implementation

106 Human-Robot Interaction - Theory and Application

Figure 2 presents the success and precision plots of T1 along with other competitive trackers for all sequences. We also included a fixed version of T1 tracker (a detector without model update) as T0 to emphasize the role of updating. The figure demonstrates that without the model update, the detector cannot reflect the changes in target appearance and lose the target rapidly in most of the scenarios (comparing T0 and T1). However, it is also evident that having a single tracker is not robust against all of the target's variations (in line with [60]) and the performance of T1 is still low.
