Section 1 Object Tracking

## **Chapter 1**

## Object Tracking Using Adapted Optical Flow

*Ronaldo Ferreira, Joaquim José de Castro Ferreira and António José Ribeiro Neves*

## **Abstract**

The objective of this work is to present an object tracking algorithm developed from the combination of random tree techniques and optical flow adapted in terms of Gaussian curvature. This allows you to define a minimum surface limited by the contour of a two-dimensional image, which must or should not contain a minimum amount of optical flow vector associated with the movement of an object. The random tree will have the purpose of verifying the existence of superfluous vectors of optical flow by discarding them, defining a minimum number of vectors that characterizes the movement of the object. The results obtained were compared with those of the Lucas-Kanade algorithms with and without Gaussian filter, Horn and Schunk and Farneback. The items evaluated were precision and processing time, which made it possible to validate the results, despite the distinct nature between the algorithms. They were like those obtained in Lucas and Kanade with or without Gaussian filter, the Horn and Schunk, and better in relation to Farneback. This work allows analyzing the optical flow over small regions in an optimal way in relation to precision (and computational cost), enabling its application to area, such as cardiology, in the prediction of infarction.

**Keywords:** Object tracking, vehicle tracking, optical flow, gaussian curvature, random forest

## **1. Introduction**

Object tracking is defined as a problem of estimating the object's trajectory, done by means of a video image. There are several tools for tracking objects and are used in various fields of research, such as computer vision, digital video processing, and autonomous vehicle navigation [1]. With the emergence of high-performance computers, highresolution cameras, and the growing use of so-called autonomous systems that, in addition to these items, require specialized tracking algorithms, increasingly accurate and robust for automatic video analysis, has currently been the target of numerous research on the development of new object tracking techniques [2, 3].

Object tracking techniques are applicable to motion-based reconnaissance cases [4], automatic surveillance systems [5], pedestrian flow monitoring in crosswalks [6], traffic control [7], and autonomous vehicular navigation [8]. Problems of this type are highly complex due to the characteristics of the object and the environment, generating many variables, which impairs performance and makes the application of tracking algorithms unfeasible to real-world situations. Some approaches seek to resolve this impasse by simplifying the problem, reducing the number of variables [9]. This process, in most cases, does not generate good results [10, 11], making it even more difficult to identify the main attributes to be selected to perform a task [12, 13].

Most of the object tracking problems occur in open environments, so-called uncontrolled [14]. The complexity of these problems has attracted the interest of the scientific community and generated numerous applied research in various fields of research. Current approaches, such as the ones that use convolutional neural networks —CNN, deal well with the high number of variables of these types of problems, providing space–temporal information of the tracked objects, through threedimensional convolutions [15–17]. This ends up creating an enormous number of learnable parameters, which ends up generating an overfitting [11]. A solution to reduce this number of learnable parameters was combining space–time data, extracted using the optical flow algorithm, used in the Two-Stream technique [18–20]. However, this technique presents good results only for large datasets, showing itself to be inefficient for small datasets [15, 21].

In recent years, research using machine learning has been applied to tracking problems, gaining notoriety due to the excellent results obtained in complex environments and attribute extraction [21–23]. Deep learning stands out among these techniques for presenting excellent results to unsupervised learning problems, [24], object identification [25], semantic segmentation [26]. Random trees are also examples of machine learning techniques, and their excellent results, due to their precision and great capacity to handle a large volume of data and low overfitting tendency [27, 28], and widely used in research areas such as medicine, in the prediction of hereditary diseases [29], agriculture to increase the productivity of a given plantation crop and in astronomy, acting on the improvement of images captured by telescopes, in the spectrum electromagnetic radiation not visible to the human eye [30]. The possibilities of applications, and new trends and research related to machine learning techniques, with particular attention to random trees, allow the development of algorithms that can be combined with existing ones, in the case of optical flow algorithms, (belonging to computational field of view) taken advantage of in this way, the advantages of each [31–33].

Developing an algorithm whose objective is to track objects, using the particular advantages of these techniques in a combined way, justifies creating a tracking algorithm that combines the optical flow technique, adapted in this work in terms of the Gaussian curvature associated with a minimal surface, with a random trees waiting for it to capture on this surface a minimum number of optical flow vectors that characterize the moving object, accurately and with low computational cost, contributing not only in the fields of computational vision but in other branches of science, such as in medicine, it can help in the early identification of infarctions.

## **2. Related works**

Due to the large number of studies related to the technique of object tracking, only a small number surrounding this theme will be addressed. The focus of this project is not to make a thorough study on the state of the art. With this in this item, the main works in the literature, associated with the tracking of objects, will be presented. Among the various approaches used for this context, we highlight those focused on the techniques of optical flow, and others belonging to machine learning, such as those that use identifications of patterns, which allow relating, framing, and justifying the development of this proposal and its importance, through its contribution, to the state of the art.

#### **2.1 Object tracking**

Object tracking is defined as a process that allows you to uniquely estimate and associate the movements of objects with consecutive image frames. The objects considered can be from one, the set of pixels belonging to a region of the image. The detection of pixels is done by a motion detector or objects, which allows to locate objects with similar characteristics that move, between consecutive frames.

These characteristics of the object to be tracked are compared with the characteristics of a reference object modeled by a classifier over a limited region of the so-called region of interest frame, where the probability of detection of the object is greater. Thus, according to [33], the detector of traced objects, locate several objects on the different parts of the region of interest and performs the comparison of these objects with the reference object. This process is performed for each frame and each object detected, candidate to be recognized as the greatest possible similarity, to the reference object can be represented, through a set of fixed-size characteristics, extracted from this region containing a set of pixels, which can be represented by a numerical array of data.

Thus, mathematically, the region containing a set of pixels belonging to the regions of the object of interest, where the characteristics that allow to test whether the region of the frame, in which the object to be traced is, is given by:

$$\text{OC}\_{\text{i}}(\mathbf{t}) = \text{OC}(\mathbf{t}) \| \| \mathbf{L}(\mathbf{OC}(\mathbf{t})) - \mathbf{L}(\mathbf{OR}(\mathbf{i} - \mathbf{1})) \| < \varepsilon \tag{1}$$

where, *L OC t* ð Þ ð Þ is the position ð Þ *x*, *y* of the centroid of the candidate object *OC t*ð Þ, *L OR i* ð Þ ð Þ � 1 , is the position of the object traced to the ð Þ *i*–1 —frame of the video and 0<*ε*∈ , is an actual value associated with the size of the region of the object of interest.

According to the works of [34, 35], learning methods are used to adapt the changes of movement and other characteristics such as geometric aspect and appearance of the tracked object. These methods are usually used adaptive tracked object trackers and detectors. The following will be presented other types of object trackers, found in the literature.

According to [36], a classifier can be defined with a *f* belonging to a family of functions *F* parameterized by a set of classifier parameters. They form a detector of objects to be tracked which in turn is an integral part of a tracker. A classifier can also be training and thereby generate a set of classification parameters, producing the function *f*, that allows you to efficiently indicate the classes *vi* of the test data *xi* from a training set *Ct* ¼ *x*1, *y*<sup>1</sup> , … , *xn*, *yn* }. The data is points in *the space of* the characteristics, which can be entropy, the gray level, among others.

The classifier aims to determine the best way to discriminate the data classes, on the space of characteristics. The test data form a set containing the characteristics of the candidate objects, which have not yet been classified. The position of the object to be tracked in the frame is defined as the position corresponding to the highest response of the detector of the object to be tracked on the *ith*-candidate objects. Therefore, the position of the object to be tracked is determined by the position of the *ith*-candidate object, which is most likely to belong to the class of the crawled object, given by the following equation:

$$\mathbf{L(OR(t))} = \mathbf{L(arg\max} \mathbf{P(yi = COR|\mathbf{OCi(t), PR(t))})} \tag{2}$$

$$P(\text{yi} = \text{COR}|\text{OCi}(t), \text{PR}(t)) = \{\mathbf{1}/\text{N}, \text{se} \| \text{L}(\text{OCi}\left(\mathbf{t}\right)) - \text{L}(\text{OR}(\mathbf{t}|\mathbf{t}-\mathbf{1})) \| < e \text{ 0, otherwise} \tag{3}$$

where the variable *COR* in the equation (3), are the classes of the tracked objects and the candidate objects *OC*, all with equiprobability of occurrence. According to the types of classifiers used object detectors to be tracked, along with the initial detector, it is possible to use some learning technique to train them. One of the ways used is offline training [36] and adjusting the parameters of the classifier before running the tracker.

Offline-trained classifiers are generally employed in object detectors designed to detect all new objects of interest that enter the camera's field of view [37]. The training set *Ct*, must contain characteristics *xi* extracted from the objects to be traced and diverse environmental characteristics. This allows new objects, with varied geometric characteristics and aspects, to be detected more efficiently. As for online training, the adjustment of the parameters of the classified is performed during the tracking process. For online trained classifiers, they are generally used in object detectors to be tracked. Thus, in each frame, the new extracted characteristics are used to adjust the classifiers.

#### *2.1.1 Binary classification*

In [38], trackers that use the detection tracking technique deal with object tracking, as a binary classification problem whose goal is to find the best function *f* that separates the objects to be tracked *R*, of other objects in the environment. Object tracking seen as a binary classification problem, which is currently one of the subjects that receives the most attention in research in computing vision.

In [39], were developed trackers that used detectors of objects to be tracked, formed by classifiers in committee formed by binary classifiers said weak. For [40], a binary classifier is defined as a classifier, used in problems where the class *yi* of a *OCi* belongs to the set *Y* ¼ �f g 1, 1 . The negative class {�1} refers to the characteristics of the environment and other objects. A positive class {+1} refers to the class of the object to be tracked.

A classifier is said to be weak, when it has a probability of "hitting" a given data class, only slightly higher than a random classifier. The detector of the object to be tracked must separate the crawled object from the other objects and the environment. Its purpose and determine the position of the tracked object, according to the equations (1)–(3)*:* According to [41, 42] each of the *ith* candidate object classes *OCi*, it is defined according to Bayesian theory of decision, through minimal classification error. This means that, the decision given by observing, the sign of the difference between *P yi* <sup>¼</sup> *COR*j*OCi*ð Þ*<sup>t</sup>* , *PR t*ð Þ and *P yi* <sup>¼</sup> *CNOR*j*OCi*ð Þ*<sup>t</sup>* , *PR t*ð Þ , so that the sum of these probabilities is unitary.

#### *2.1.2 Monitoring systems*

For [43], the term monitoring system, refers to the process of monitoring and autonomous control, without human intervention. This type of system has the function of detecting, classifying, tracking, analyzing, and interpreting the behavior of objects of interest. In [44, 45], this technique was used combined with statistical techniques for controlling people's access to a specific location. It was also observed the use of intelligent monitoring systems, applied to building, port, or ship security [46, 47].

The functions comprised by a monitoring system are so-called low- and high-level tasks. Among some high-level tasks, we highlight the analysis, interpretation and

#### *Object Tracking Using Adapted Optical Flow DOI: http://dx.doi.org/10.5772/intechopen.102863*

description of behavior, the recognition of gestures, and the decision between the occurrence or not of a threat. Performing high-level tasks require that for each frame, the system needs to perform low-level tasks, which involve direct manipulation of the image pixels [48–56]. As an example, we highlight the processes of noise elimination, detection of connected components, and obtain information on the location and geometric aspect of the object of interest.

A monitoring system consists of five main components, which are presented in **Figures 9**. Some monitoring systems may not contain all components. The initial detector aims to detect the pixel regions of each frame that have a significant probability of containing an object to be tracked. This detector can be formed by a motion detector that detects all moving objects based on models of objects previously recorded in a database or based on characteristics extracted offline [40, 41]. The information obtained by the initial detector is processed by an image processor], which will have the function of eliminating noise, segmenting, and detecting the connected components.

The regions containing the most relevant pixels are analyzed and then classified as objects of interest by the classifier [50–54]. Objects of interest are modeled and are now called reference objects so that the tracker determines its position frame by frame [55, 56]. The information obtained by the initial detector is processed by an image processor], which will have the function of eliminating noise, segmenting, and detecting the connected components.

A tracker, an integral part of a detector, is defined as a function that allows estimating the position of objects at each consecutive frame, through and defines the region of the object of interest, for each ith object being tracked within a region of interest. This estimation of the movement is performed through the correct association of the captured and tracked objects, to consecutive video frames. The trace often and interpreted as a data binding problem. **Figure 1** shows a schematic of the main components of a monitoring system.

## **2.2 State of the art in object tracking with optical flow**

Several techniques that allow the calculation to have been developed in recent years to calculate the optical flow vector [57]. These methods are grouped according to

their main characteristics and the approach used for the calculation of the optical flow. Thus, the differential methods performed in the studies in [56], the methods d and calculation of the optical flow through the frequency domain [46] the phase correlation methods [58], and the method of association between regions [59].

The method proposed in [56], allows the calculation of the optical flow for each point around a neighborhood of pixels. In [60], it is also considered a neighborhood of pixels, but in this case, the calculation of the optical flow is performed geometrically. In the work presented by [61] it is adding of the restrictions of regularization. In [62] turn active compare performance analyses were performed between the various algorithms and optical flow present in the literature.

This technique is considered robust for detaining and tracking moving objects from your images, both those captured by fixed or mobile cameras. This gives this technique, but high computational cost makes most practical applications unfeasible. Thus, to reduce this complexity, techniques of increasing resolutions were adopted in [63]. Also, for the same purpose, we used the techniques of subsampling on some of the pixels belonging to the object of interest to obtain optical flow [52].

Other authors also use a point of interest detector to select the best pixels for tracking and calculate the optical flow on these points [52, 64]. The reduction in the number of points to be tracked is associated with a decrease in computational complexity, so in [52] the points of interest were selected using the FAST algorithm [64].

The method developed by Lucas-Kanade [56], it is a differential method and widely used in the literature and having variations modifications. It allows you to estimate the optical flow for each point ð Þ *x*, *y*, *t* calculating the like transformation ð*TA x*ð Þ , *t* , applied to the pixels of a pixel grid, with center in ð Þ *x*, *y* by the following function *f x*ð Þ , *t* , that is:

$$f(\mathbf{x}, t) = \min \left( \sum\_{\mathbf{x} \in (pixel \, grid)} [Q(\mathbf{x}, t - \mathbf{1}) - (Q(TA(\mathbf{x}), t)) \* \mathbf{g}(\mathbf{x})] \right) \tag{4}$$

where *g x*ð Þ is a Gaussian smoothing filter centered on *x*.

New variations of the techniques were being proposed to make the calculation of the optical flow faster and faster. In [65] a tracker was proposed based on the algorithm of [56]. The translation of a point represented by a grid of rectangular sized pixels 25 � 25, was calculated and its validity is evaluated by calculating the SSD<sup>1</sup> in the grid pixels in *Q t*ð Þ and in *Q t*ð Þ � 1 . If the SSD is high, the point is dropped and stops being traced.

In [51] objects were detected by subtracting the image from the environment and removed the movement of the camera with the calculation algorithm of the optical flow vector proposed by [56]. In the studies carried out in [66, 67], they showed that the reliability of the estimated optical flow reduced the case of some points of the object of interest whose optical flow cannot be represented by the same matrix given by the related transformation ð Þ *TA x*ð Þ , *t* of the other points. Thus, to improve the robustness of the algorithm of [56, 67] proposed a calculation of the independent optical flow vector for each of the *N* points belonging to the object of interest selected with the SURF (Speeded Up Robust Features) point detector in the initial frame.

In [67] they also modified Lucas - Kanade's algorithm [56] by inserting the Hessian matrix in the calculation of the value of the variation of the related transformation

<sup>1</sup> Actual amount of data rate for actual data recorded.

*Object Tracking Using Adapted Optical Flow DOI: http://dx.doi.org/10.5772/intechopen.102863*

ð Þ *ΔTA x*ð Þ , *t* . The algorithm allows for more effective tracking when partial occlusions, deformations, and changes in lighting occur, as optical flow is not calculated considering all points of objects of interest.

Already in the proposal presented in [68] was the development of algorithm to detect people in infrared images that combines the information of the value of pixels with a method of motion detection. The algorithm forms a relevant pixel map by applying thresholding segmentation. While the camera is still, an image *M* is built with the differentiation between frames. If the camera is in motion, *M* is filled with the pixels obtained by the analysis of the moment of the optical flow calculated by the algorithm of [56]. The map of relevant pixels is replaced by the union between *M* and the Pixel Map relevant to the first case and the second an interception between *M* and the pixel map relevant case to compensate for the movement of the camera.

The method for tracking swimmers presented in [46], uses the information of the movement pattern by the optical flow and the appearance of the water that is modeled by a MoG.<sup>2</sup> This allows you to calculate an optical flow vector for each pixel of the video independently of the other, through *B* which is an array composed of gradients in the directions *x* and *y* pixels in a grid of pixels.

In [69], a method was presented that incorporated physical restrictions to the calculation of optical flow. The tracker uses the constraints to extract the moving pixels with a lower failure rate. The calculation can be impaired when occlusions occur or when the environment has low light. The operator defines the physical constraints and selects the points of the *OR* that are tracked by optical flow. Constraints can be geometric, kinematic, dynamic, of the property of the material that makes up the *OR* or any other type of restriction.

In [70], the points that are tracked with the optical flow are defined by applying the Canny edge detector on the pixels of the reference pixel map. Pixels that produce a high response to the Canny detector are the selected points.

In [43], optical flow is used as a characteristic for tracking the contour of the object. The contour is shifted in small steps until the position in which the optical flow vectors are homogeneous is found.

In [64], they performed an estimate of the translation and orientation of the reference object by calculating the optical flow of the pixels belonging to its silhouette. The coordinates of the centroid position are defined by minimizing the Hausdorff distance between the mean of the optical flow vectors of the reference object and the candidate object to be chosen as the object of interest.

#### *2.2.1 Optical flow as a function of Gaussian curvature*

Optical flow is defined as a dense vector field associated with the movement and apparent velocity of an object, given by the translation of pixels from consecutive frames in an image region. It can be calculated from the brightness restriction, considered constant, from the corresponding pixels in consecutive frames.

Mathematically be a pixel ð Þ *x*, *y* , associated with a luminous intensity *I x*ð Þ , *y* , over an image surface or plane, and a time interval and a sequence of frames associated with an apparent offset of the pixel over that image surface or plane. Thus, the rate of variation of light intensity in relation to a time interval, associated with the apparent movement of the pixel, on a surface or plane of the image, being considered practically null can be given by:

<sup>2</sup> MoG: mixture of Gaussian distributions.

$$\frac{dI(\mathbf{x},\mathbf{y})}{dt} = \frac{\partial I(\mathbf{x},\mathbf{y})}{\partial \mathbf{x}} \frac{d\,d\mathbf{x}}{dt} + \frac{\partial I(\mathbf{x},\mathbf{y})}{\partial \mathbf{y}} \frac{d\,d\mathbf{y}}{dt} + \frac{\partial I(\mathbf{u},\mathbf{v})}{dt} \tag{5}$$

$$I = I\_{\mathfrak{x}} \ \mathfrak{u}\_{\mathfrak{x}} + I\_{\mathfrak{y}} \ \mathfrak{v}\_{\mathfrak{x}} + I\_{\mathfrak{t}} \tag{6}$$

$$\frac{dI(\mathbf{x},\boldsymbol{\uprho})}{dt} = \mathbf{0} \Longrightarrow I\_{\mathbf{x}}\ \boldsymbol{u}\_{\mathbf{x}} + I\_{\mathbf{y}}\ \boldsymbol{v}\_{\mathbf{x}} + I\_{t} = \mathbf{0} \tag{7}$$

So that equation (7) is called optical flow restriction and where the terms *Ix*, *Iy*, *It* denote the derivatives relative to the brightness intensity relative to the coordinates *x*, *y* and time *t*, and *u* and *v, u x* ð Þ ð Þ , *y* , *v x*ð Þ , *y* are the horizontal and vertical components of a vector representing the optical flow field, for the pixel ð Þ *x*, *y* in question.

The number of variables in equation (6) is greater than that of equations, which does not allow estimating components and vector, and determining a single solution for the optical flow restriction equation. With this, Lucas and Kanade proposed a solution to solve this problem. The solution method proposed by them considers the constant flow in a region formed by a set of pixels *N* � *N*, so you can write the optical flow restriction equation for each pixel in this region, thus obtaining a systems of equations with 2 variables, that is:

$$\begin{aligned} \mathbf{I\_{x1}v\_x} + \mathbf{I\_{y1}v\_y} + \mathbf{I\_{t1}} &= \mathbf{0} \\\\ \mathbf{I\_{x2}v\_x} + \mathbf{I\_{y2}v\_y} + \mathbf{I\_{t2}} &= \mathbf{0} \\\\ \vdots \\\\ \mathbf{I\_{xp}v\_x} + \mathbf{I\_{yp}v\_y} + \mathbf{I\_{tp}} &= \mathbf{0} \end{aligned} \tag{8}$$

Passing the set of equations given by equation (8) to the matrix form we have:

$$
\begin{pmatrix}
\mathbf{I\_{x1}} & \mathbf{I\_{y1}} \\
\vdots & \vdots \\
\mathbf{I\_{xp}} & \mathbf{I\_{yp}}
\end{pmatrix}
\begin{pmatrix}
\mathbf{v\_x} \\
\mathbf{v\_y}
\end{pmatrix} + \begin{pmatrix}
\mathbf{I\_{t1}} \\
\vdots \\
\mathbf{I\_{tp}}
\end{pmatrix} = \begin{pmatrix}
\mathbf{0} \\
\vdots \\
\mathbf{0}
\end{pmatrix} \tag{9}
$$

Using the least squares method, in the system of equations (9) in the form of matricial, the same can be solved. Therefore, the optical flow *v* ¼ *vx*, *vy* � � can be estimated for a particular region or window with *N* � *N* pixels, that is:

$$
\begin{aligned}
&\begin{pmatrix}\mathbf{I\_{x1}} & \mathbf{I\_{y1}} \\ \vdots & \vdots \\ \mathbf{I\_{xp}} & \mathbf{I\_{yp}} \end{pmatrix} \begin{pmatrix} \mathbf{I\_{x1}} & \mathbf{I\_{y1}} \\ \vdots & \vdots \\ \mathbf{I\_{xp}} & \mathbf{I\_{yp}} \end{pmatrix}^{\mathrm{t}} \begin{pmatrix} \mathbf{v\_{x}} \\ \mathbf{v\_{y}} \end{pmatrix} = -\begin{pmatrix} \mathbf{I\_{t1}} \\ \vdots \\ \mathbf{I\_{pp}} \end{pmatrix} \begin{pmatrix} \mathbf{I\_{x1}} & \mathbf{I\_{y1}} \\ \vdots & \vdots \\ \mathbf{I\_{xp}} & \mathbf{I\_{yp}} \end{pmatrix}^{\mathrm{t}} \\
&\Rightarrow \begin{pmatrix} \mathbf{I\_{x1}} & \mathbf{I\_{y1}} \\ \vdots & \vdots \\ \mathbf{I\_{xp}} & \mathbf{I\_{yp}} \end{pmatrix} \begin{pmatrix} \mathbf{I\_{x1}} & \mathbf{I\_{y1}} \\ \vdots & \vdots \\ \mathbf{I\_{xp}} & \mathbf{I\_{yp}} \end{pmatrix}^{\mathrm{t}} \begin{pmatrix} \mathbf{I\_{x1}} & \mathbf{I\_{y1}} \\ \vdots & \vdots \\ \mathbf{I\_{xp}} & \mathbf{I\_{yp}} \end{pmatrix}^{\mathrm{t}} \begin{pmatrix} \mathbf{I\_{x1}} & \mathbf{I\_{y1}} \\ \vdots & \vdots \\ \mathbf{I\_{xp}} & \mathbf{I\_{yp}} \end{pmatrix}^{\mathrm{t}} \\
&= -\begin{pmatrix} I\_{x1} \\ \vdots \\ I\_{pp} \end{pmatrix} \begin{pmatrix} \mathbf{I\_{x1}} & \mathbf{I\_{y1}} \\ \vdots & \vdots \\ \mathbf{I\_{xp}} & \mathbf{I\_{pp}} \end{pmatrix}^{\mathrm{t}} \begin{pmatrix} \mathbf{I\_{x1}} & \mathbf{I\_{$$

Where:

$$
\begin{pmatrix}
\mathbf{I\_{x1}} & \mathbf{I\_{y1}} \\
\vdots & \vdots \\
\mathbf{I\_{xp}} & \mathbf{I\_{yp}}
\end{pmatrix} = \mathbf{A\_{px2}} = \mathbf{A}
$$

Therefore, one has that:

$$
\begin{pmatrix}
\mathbf{I\_{x1}} & \mathbf{I\_{y1}} \\
\vdots & \vdots \\
\mathbf{I\_{xp}} & \mathbf{I\_{yp}}
\end{pmatrix}
\begin{pmatrix}
\mathbf{I\_{x1}} & \mathbf{I\_{y1}} \\
\vdots & \vdots \\
\mathbf{I\_{xp}} & \mathbf{I\_{yp}}
\end{pmatrix}^{t}
\left(
\begin{pmatrix}
\mathbf{I\_{x1}} & \mathbf{I\_{y1}} \\
\vdots & \vdots \\
\mathbf{I\_{xp}} & \mathbf{I\_{yp}}
\end{pmatrix}
\begin{pmatrix}
\mathbf{I\_{x1}} & \mathbf{I\_{y1}} \\
\vdots & \vdots \\
\mathbf{I\_{xp}} & \mathbf{I\_{yp}}
\end{pmatrix}^{t}\right)^{-1} = \\
(\mathbf{A} \cdot \mathbf{A}^{t})(\mathbf{A} \cdot \mathbf{A}^{t})^{-1} = \begin{pmatrix}
\mathbf{1} & \cdots & \mathbf{0} \\
\vdots & \ddots & \vdots \\
\mathbf{0} & \cdots & \mathbf{1}
\end{pmatrix} = \mathbf{I} d\_{\text{pxp}} = \mathbf{I} d \tag{11}
$$

Thus:

$$Id = -I\_t \cdot A^t \left(\mathbf{A} \cdot \mathbf{A}^t\right)^{-1} \tag{12}$$

This method has a reduced computational cost to determine optical flow estimation when compared to other methods because it is simple, that is, it is since the region in which the variation of light intensity between pixels is minimal has a size 2 � 2, contained in a region *NxN*. In this way, the Optical Flow is determined in a region of 2 � 2, between these two pixels, using only one matrix inversion operation (equation (12)).

To calculate the optical flow over the size region *NxN*, partial derivatives must be calculated in each pixel. However, considering almost null the variation of the intensity of light between pixels, over the region, the small differences in the accumulated intensities of brightness between pixels compromise the accuracy of the Optical Flow in relation to the determination of the actual motion object, that is, it gains in the processing speed and loses precision in the determination of the motion. When deriving equation (5) we have equation (13), that is:

$$
\xi\_a^2 = a\_x a\_1 + a\_y a\_2 + a\_3 u^2 + a\_4 v^2 + a\_5 u v + a\_6 u + a\_7 v + a\_8 \tag{13}
$$

Where the terms *<sup>α</sup><sup>x</sup>* <sup>¼</sup> *<sup>∂</sup>vx <sup>∂</sup><sup>t</sup>* , *<sup>α</sup><sup>y</sup>* <sup>¼</sup> *<sup>∂</sup>vy <sup>∂</sup><sup>t</sup>* , are called the components of the acceleration vector, *vx*, *vy* � � is the components of the velocity vector and the terms *<sup>a</sup>*<sup>1</sup> <sup>¼</sup> *Ix* <sup>¼</sup> *<sup>∂</sup> <sup>∂</sup><sup>x</sup> I x*ð Þ , *<sup>y</sup>*, *<sup>t</sup>* ; *<sup>a</sup>*<sup>2</sup> <sup>¼</sup> *Iy* <sup>¼</sup> *<sup>∂</sup> ∂y I x*ð Þ , *<sup>y</sup>*, *<sup>t</sup>* ; *<sup>a</sup>*<sup>3</sup> <sup>¼</sup> *Ixx* <sup>¼</sup> *<sup>∂</sup>*<sup>2</sup> *<sup>∂</sup>x*<sup>2</sup> *I x*ð Þ , *y*, *t* ; *a*<sup>4</sup> ¼ *Iyy* ¼ *∂*2 *<sup>∂</sup>x*<sup>2</sup> ð Þ *<sup>x</sup>*, *<sup>y</sup>*, *<sup>t</sup>* ; *<sup>a</sup>*<sup>5</sup> <sup>¼</sup> *Ixy* <sup>¼</sup> *<sup>∂</sup> ∂xy I x*ð Þ , *<sup>y</sup>*, *<sup>t</sup>* ; *<sup>a</sup>*<sup>6</sup> <sup>¼</sup> *Ixt* <sup>¼</sup> *<sup>∂</sup> <sup>∂</sup>xtI x*ð Þ , *<sup>y</sup>*, *<sup>t</sup>* ; *<sup>a</sup>*<sup>7</sup> <sup>¼</sup> *Iyt* <sup>¼</sup> *<sup>∂</sup> <sup>∂</sup>ytI x*ð Þ , *y*, *t a*<sup>8</sup> ¼ *Itt* <sup>¼</sup> *<sup>∂</sup> ∂t* <sup>2</sup> *I x*ð Þ , *y*, *t* , are the first and second partial derivatives of the *I x*ð Þ , *y*, *t* .

In view of the small variations present and accumulated along the vector field associated with the optic flow, which cause an additional error in equation (13), a regularization adjustment was made, given by equation (14):

$$\mathfrak{E}\_{\mathbf{x}}^{2} = \left(\frac{\partial}{\partial \mathbf{x}} v\_{\mathbf{x}}(\mathbf{x}, \boldsymbol{\eta})\right)^{2} + \left(\frac{\partial}{\partial \boldsymbol{\eta}} v\_{\mathbf{x}}(\mathbf{x}, \boldsymbol{\eta})\right)^{2} + \left(\frac{\partial}{\partial \boldsymbol{\eta}} v\_{\boldsymbol{\eta}}(\mathbf{x}, \boldsymbol{\eta})\right)^{2} + \left(\frac{\partial}{\partial \mathbf{x}} v\_{\boldsymbol{\eta}}(\mathbf{x}, \boldsymbol{\eta})\right)^{2} \tag{14}$$

Thus, combining equations (13) and (14), the error *ξ* can be minimize by the equation (15):

$$\iint \left(\xi\_a^2 + \alpha^2 \xi\_c^2\right) dxdy \tag{15}$$

where *α* is the value of the weights required for smoothing the variation of the associated optical flow. So, to get the *vx* ¼ *vx*ð Þ *x*, *y* e *vy* ¼ *vy*ð Þ *x*, *y* , thus using the resources of the variational calculation one has:

$$\begin{cases} 2a\_{\\$}v\_{\text{x}} + a\_{\\$}v\_{\text{y}} = \mathbf{a}^{2}\nabla^{2}v\_{\text{x}} - \mathbf{b}\_{1} \\ a\_{\\$}v\_{\text{x}} + 2a\_{\text{4}}v\_{\text{y}} = \mathbf{a}^{2}\nabla^{2}v\_{\text{y}} - \mathbf{b}\_{2} \end{cases} \tag{16}$$

where ∇<sup>2</sup> *vx* is the Laplacian of *vx* e ∇<sup>2</sup>*vy* is the Laplacian of *vy* and the coefficients de *b*1*, b*<sup>2</sup> can be given as:

$$\begin{cases} \mathbf{b}\_1 = \frac{\partial}{\partial \mathbf{t}} \mathbf{a}\_1 + \mathbf{a}\_6 \\ \mathbf{b}\_2 = \frac{\partial}{\partial \mathbf{t}} \mathbf{a}\_2 + \mathbf{a}\_7 \end{cases} \tag{17}$$

and replacing the coefficients *Ix*, *Iy*, *Ixx*, *Iyy*, *Ixy*, *Ixt*, *Iyt*, *Itt* in equation (16), one has:

$$I\_{\rm xx}\nu\_{\rm x} + I\_{\rm xy}\nu\_{\rm y} = \left(\frac{\alpha^2}{3}\right)\nabla 2\nu\_{\rm y} - I\_{\rm xt} \tag{18}$$

$$I\_{\mathbf{x}\mathbf{y}}v\_{\mathbf{x}} + 2I\_{\mathbf{x}\mathbf{y}}v\_{\mathbf{y}} = \left(\frac{\mathbf{a}^2}{\mathfrak{Z}}\right)\nabla 2v\_{\mathbf{x}} - I\_{\mathbf{y}\mathbf{t}} \tag{19}$$

whereas ∇<sup>2</sup> *vx* ¼ *vx*1*<sup>j</sup> <sup>k</sup>* � *vx*1*<sup>j</sup> k* , e <sup>∇</sup>2vy <sup>¼</sup> *vy*1*<sup>j</sup> <sup>k</sup>* � *vj* 1*j k* , are the Laplacians of equations (18) and (19), given in their discretized digital forms together with equation (20),

$$
\lambda = \left(\frac{a^2}{3}\right) \tag{20}
$$

It is possible to reduce the data system by (17), such as:

$$
\left[\lambda^2 \left(I\_{\text{xx}} + I\_{\mathcal{D}} + \lambda^2\right) + \kappa\right] \nu\_{\text{x}} = \lambda^2 \left(I\_{\mathcal{D}} + \lambda^2\right) \overline{\nu\_{\text{x}}} - \lambda^2 I\_{\text{xy}} + \overline{\nu\_{\text{x}}} + c\_1 \tag{21}
$$

$$\left[\lambda^2(I\_{\text{xx}} + I\_{\text{\mathcal{V}}} + \lambda^2) + \kappa\right] \upsilon\_{\text{x}} = -\lambda^2(I\_{\text{\mathcal{V}}} \overline{\upsilon\_{\text{x}}} + \lambda^2(I\_{\text{\mathcal{V}}} + \lambda^2) \overline{\upsilon\_{\text{y}}} + c\_2 \tag{22}$$

where the term *κ* ¼ *IxIyy* � *Ixy* 2 , it is called *Gaussian curvature* of the surface. And it is also that:

$$\begin{cases} c\_1 = I\_{\text{xy}} I\_{\text{yt}} - I\_{\text{xx}} \left( I\_{\text{yy}} + \lambda^2 \right) \\ c\_2 = I\_{\text{xx}} I\_{\text{yy}} - I\_{\text{xx}} \left( I\_{\text{xx}} + \lambda^2 \right) \end{cases} \tag{23}$$

Where *c*1,*c*<sup>2</sup> they're real constants.

Therefore, isolating terms *vx, vy* and still replacing *c*<sup>1</sup> e *c*<sup>2</sup> in equations (21) and (22) respectively, resulting in equations (24) and (25):

*Object Tracking Using Adapted Optical Flow DOI: http://dx.doi.org/10.5772/intechopen.102863*

$$\mathbf{v\_x} = \overline{\mathbf{v\_x}} - \left[ \frac{\lambda^2 \left( \mathbf{I\_{yy}} + \lambda^2 \right) \overline{\mathbf{v\_x}} - \lambda^2 \mathbf{I\_{xy}} + \overline{\mathbf{v\_x}} + \mathbf{c\_1}}{\left[ \lambda^2 \left( \mathbf{I\_{xx}} + \mathbf{I\_{yy}} + \lambda^2 \right) + \mathbf{c} \right] \mathbf{v\_x}} \right] \tag{24}$$

$$\mathbf{v\_{y}} = \overline{\mathbf{v\_{y}}} - \left[ \frac{-\lambda^{2} \left( \mathbf{I\_{yy}} \overline{\mathbf{v\_{x}}} + \lambda^{2} \left( \mathbf{I\_{yy}} + \lambda^{2} \right) \overline{\mathbf{v\_{y}}} + \mathbf{c\_{2}} \right)}{\left[ \lambda^{2} \left( \mathbf{I\_{xx}} + \mathbf{I\_{yy}} + \lambda^{2} \right) + \kappa \right] \mathbf{v\_{x}}} \right] \tag{25}$$

The Algorithm 1 is a pseudocode to generate the proposed optical flow vector, through equations (24) and (25) and that allow estimating the speed and position of an object, through a sequence of video images.

**Algorithm 1**. Adapted optical flow (Gaussian curvature κ).


End

#### *2.2.2 Random forests*

Developed by Breiman [63] in the mid-2000s, and later revised in [71] random trees are considered one of the best-supervised learning methods used in data prediction and classification. Due to its simplicity, low computational cost, great potential to deal with a large volume of data, and still present great accuracy of results, currently this method has become very popular being applied in various fields of science as data science [72]. Bioinformatics, Ecology, in real-life systems and recognition of 3D objects. In recent years, several studies have been conducted with the objective of making the technique more elaborate and seeking new practical applications [73–75].

Many studies were carried out with the aim of narrowing the existing gap between theory and practice can be seen in [58, 76–78]. Among the main components of random tree forests, one can highlight the bagging method [63], and the criterion of classification and regression called cart*-split* [79], which play critical roles.

Bagging (a bootstrap-aggregating contraction) is an aggregation scheme, which generates samples through the bootstrap method, from the original dataset. These methods are nonparametric and belong to the Monte Carlos method class [80], treating the sample as a finite population. Still, these methods are used when the distribution of the target population is not specified, and the sample is the only information available. How in this way a predictor of each sample is constructed, so that the decision is made through an average, and is more effective computational

procedures to improve the indexable estimates, especially for large sets of highdimensional data, where finding a good model in one step is impossible due to the complexity and scale of the problem. As for the cart-split criterion, it originates from the CART program [63], and is used in the construction of individual trees to choose the best cuts perpendicular to the Axes. However, while bagging and the CART division scheme are key elements in the random forest, both are difficult to mathematically analyze and are a very promising field for both theoretical and practical research.

In general, the set of trees is organized in the form of {*T*<sup>1</sup> (Ɵ1), *T*<sup>2</sup> (Ɵ2)... *Ti* (Ɵ*i*}, where *TB* is every tree and Ɵ*<sup>B</sup>* are bootstrap samples with spare dimensions *q* x *mtry*, where *mtry* is equal to the number of variables that will be used on each node during the construction of each tree and *q* is approximately 0, 67 � ð Þ *n :* Each of the trees produces a response *y*1,*<sup>i</sup>* for each of the samples *W* {*T*1ð Þ *W* = *y*1,*<sup>i</sup>* , *T*2ð Þ *W* = *y*2,*<sup>i</sup>* , .., *Ti*ð Þ *W* = *y*2,*<sup>B</sup>*} and the mean (regression) or majority vote (classification) of the tree responses will be the final response of the model for each of the samples.

## **3. Methodology**

The methodology employed consisted of combining the optical flow algorithm in terms of Gaussian curvature, developed in this work together with the technique of random forest. The language used for the development of this algorithm was the MATLAB programming language, executed on a 64-bit 8th generation notebook, CORE i7. The input data is a video extension Avi, lasting 5 min of a vehicle and two cyclists, circulating in the vicinity of the beach of Costa Nova, in the locality of Ilhavo, in Aveiro, Portugal. The video was fragmented into a set of frames, analyzed two by two by the algorithm for the generation of the vector field of optical flow. After that, the resulting image associated with the flow and a minimal surface region, given by the Gaussian curvature. Next on this surface, the random trees analyzed which vectors presented important characteristics to characterize in an "optimal" way, the movement of the object (see **Figure 2**).

**Figure 2.** *Representative model of operation of a random forest.*

#### *Object Tracking Using Adapted Optical Flow DOI: http://dx.doi.org/10.5772/intechopen.102863*

After finishing the process of analysis of the movement of the objects, the execution times and accuracy of the results obtained by the proposed algorithm were compared in relation to the algorithms of Lucas Kanade, Horn and Shunck, Farneback and Lucas Kanade with or without Gaussian filter, allowing to validate the results obtained. After that, the implementation of the developed algorimo began.

## **4. Results**

**Figure 3** shows the vehicle and the two cyclists that were used to collect the image to which the results proposed in this work were obtained so that the choice was random on the right side. A graphical representation of the vector field of optical flow generated by the sequence of two consecutive frames, over 5 minutes of video is shown.

On the right side of **Figure 3**, the optical flow associated with the movement of the vehicle between the time intervals from ð Þ *t* � 1 to *t* is being represented. Note that the vector representation of this field associated with this flow was performed in such a way that the vectors generated by the field were superimposed in the horizontal direction of the central axis of the figure. Although there were other objects present at the site, that is, two cyclists and a car in the upper left corner, the object of interest considered was the vehicle close to the cyclists. This is shown on the right side of **Figure 3**, by the layout of this horizontal arrangement of vectors, which allows indicating whether the current movement and the predicted movement of the considered object is to the left or to the right.

The region with the highest horizontal vector density in **Figure 3** is located on the left side, in blue. It is also observed that the number of vectors in this region, despite being spaced, starting from the center to the left, is greater in relation to the number of vectors on the right side. It is also possible, through it, to visually evaluate the movement behavior of the considered objects. This region, containing a higher vector density, corresponds to the current direction in which the object is heading and its predicted displacement. It is also possible to observe that this vector density increases towards the left side, passing through the central part, coming from the right, clearly indicating the direction of movement of the object, that is, the object moves to the left. In **Figure 4**, this process can be understood more clearly.

#### **Figure 3.**

*(a) Left side: vehicle shift between moments t*ð Þ � 1 *and t. (b) Right side: representation of the corresponding optical flow.*

**Figure 4.** *Prediction and actual displacement of the object obtained through the optical flow.*

In a similar way to the one mentioned in **Figure 3**, on the right side of **Figure 5**, the optical flow generated by the displacement of the moving vehicle is represented, between the instants *t* to ð Þ *t* þ 1 . It is possible to observe that the vector representation of this field was performed in such a way that its vectors representing this field were also superimposed on the axis in the horizontal direction of the figure, creating a vector density created by this superposition. The object of interest considered remains the vehicle close to the cyclists. As can be seen on the right side of **Figure 3**, the arrangement of the horizontal vectors also allows to indicate the current movement and, if its movement prediction is to the left or to the right.

It is possible to observe a small increase in the vector density to the left, but that has a great influence on the determination of the real and predicted position of the object in the considered time intervals. The Object continues with its actual movement to the left, as well as the predicted movement of the object to the left. However, he showed a slight movement to the left (direction where the cyclists are).

In **Figure 6**, a small variation of the optical flow is observed again in the associated movement between the instants ð Þ *t* þ 1 to ð Þ *t* þ 2 . In this figure, the vehicle is next to the cyclists, both in the opposite direction to the vehicle. The movement of the vehicle continues without great variation in the direction, causing no period for cyclists or other vehicles in the opposite direction to it on the left side.

## *Object Tracking Using Adapted Optical Flow DOI: http://dx.doi.org/10.5772/intechopen.102863*

In **Figure 7**, there was no optical flow variation in the associated movement between the time intervals ð Þ *t* þ 2 to ð Þ *t* þ 3 . In this figure, the vehicle can be seen as it passes the two cyclists. This means that the non-significant variation in the optical flow vector field, keeping the number of vectors higher on the left side is associated with maintenance in the direction of movement of the object considered, that is, it continues to move on the left side.

In **Figure 8**, the vehicle can be seen completely overtaking the two cyclists and approaching another vehicle in the opposite direction in the upper part of the image

#### **Figure 6.**

*Object remains on the right side, but with a slight shift to the right and offset estimate to the left.*

**Figure 7.** *Object moving and keeping on the left for consecutive frames.*

**Figure 8.** *Object with unchanged offset pattern.*

(left). The variation of the optical flow vector field remains the same. This indicates that the vehicle continues its trajectory, on the left side to the cyclists, however without posing a danger of collision for the other vehicle in the opposite direction.

## **5. Analysis and discussing the results**

This item will show how the performance evaluation of the proposed algorithm and accuracy was performed in relation to the Algorithms of Luca and Kanade, with or without Gaussian filter, Horn and Schunck, and Farneback.

The algorithm allowed to show on the display in real-time the displacement of the object on the right side and the set of vectors capable of representing the movement of the real-time or accumulated indicating the tendency, in this case, of the direction that the object should perform. This process was carried out in a similar way, using the other algorithms to make it possible to compare them. The behavior of the proposed algorithm and the other will be graphically shown.

The technique developed in this work allowed to generate an optical flow considering important geometric properties allowing to identify similar categories of moving objects and same characteristics. These geometric properties are intrinsically associated with the curvature of the object's surface in three-dimensional space, called Gaussian curvature, in this case in a 2D image.

The modified optical flow, considering these properties, generated a dense optical flow, allowing the generation of a band, describing a track on the 2D plane. This allowed tracking the movement of the considered object. In the same **Figure 8**, it is possible to observe that at each time interval in which the object was monitored, the dispositions of the vectors for the left and right sides, as shown in **Figures 3**–**7** were responsible for drawing the track associated with the displaced and that allowed tracking the object as it moves.

**Figure 9** shows the vehicle that, when moving, generated the optical flow. In **Figures 10** and **11**, the variations of the optical flow between two-time intervals, Δ*ti* and Δ*tn*, (*i*< *n*) are shown. In this way, the algorithm allowed tracking the progressive movement of the object (movement adopted as progressive, in this work) and, as this happens, it is possible to predict in which direction it is moving, that is, to the left or to the right (or keeping straight) line.

**Figure 9.** *Variation of the optical flow of the moving object.*

*Object Tracking Using Adapted Optical Flow DOI: http://dx.doi.org/10.5772/intechopen.102863*

**Figure 10.** *Vehicle movement.*

**Figure 11.** *Object moving to the left side.*

**Figure 12.** *Object moving to the right side.*

## **5.1 Optical flow algorithm in terms of the curvature**

In the following items, the implementations of the Lucas and Kanade algorithms without or with a Gaussian filter, Horn and Schunck, and Farneback will be shown, using as input data the same sequence of video images used in the algorithm developed in this work. For each, the performance and accuracy obtained will be verified.

## **5.2 Lucas and Kanade algorithm without Gaussian filter**

**Figure 13.** *Variation of the optical flow of the moving object.*

**Figure 14.** *Vehicle movement.*

**Figure 15.** *Object moving to the left side.*

**Figure 16.** *Object moving to the right side.*

## **5.3 Lucas and Kanade Algorithm with Gaussian filter**

**Figure 17.** *Variation of the optical flow of the moving object.*

**Figure 18.** *Vehicle movement.*

**Figure 19.** *Object moving to the left side.*

**Figure 20.** *Object moving to the right side.*

## **5.4 Algoritmo de Horn and Schunck**

**Figure 21.** *Variation of the optical flow of the moving object.*

**Figure 23.** *Object moving to the right side.*

**Figure 24.** *Object moving to the left side.*

## **5.5 Algoritmo de Farneback**

**Figure 25.** *Variation of the optical flow of the moving* object.

**Figure 26.** *Vehicle movement.*

**Figure 27.** *Object moving to the left side.*

**Figure 28.** *Object moving to the left side.*

For each of the 5 algorithms, 1 frame is shown containing 4 figures, with 2 upper and 2 lower. In each frame, the figure at the top left shows the variation of the vector field between two frames. The right frame, on the other hand, corresponds to the variation of the object's movement in real-time. The lower ones, except for the proposed algorithm, correspond to the number of points on the right or the left, and with this, the movement will occur to the side that has the greatest number of points. In the case of the proposed algorithm, the process will take place through the analysis of vector density. So, to the side where there is greater vector density, this is the side to which the movement will be occurring (see **Figures 12**–**20**).

Comparing the results presented by the algorithms, it is observed that in the developed model, it was possible to see a dense vector trail of the object, with a slight tendency of displacement to the left, as it continues its movement. In the other models, this was not possible, and it is necessary to resort to a score of points, in the lower table. This process is also possible in the proposed model, but not necessary, which means a reduction in computational cost (see **Figures 21**–**28**).

Comparing the results, it is observed that the Farneback algorithm also presents high vector density. But the proposed model, as previously said, presents a welldefined vector trail which suggests the non-use of the point count in the lower frame, which does not occur for the Farneback algorithm, indicating higher computational cost, which can affect the accuracy of this algorithm when compared to the proposed algorithm.

Comparing the Horn and Schunck algorithm, a low vector density is observed when compared to the proposed algorithm, which indicates lower accuracy when compared to the proposed algorithm.

Although the two techniques of Lucas and Kanade, are faster applications, indicating low computational cost when compared to the proposed algorithm, the factor of low vector density results in low precision in relation to the proposed method.

## **6. Final considerations**

The proposed method presented good results, showing to be accurate and reasonable speed. This allows this application to be used in critical problems, i.e., to realworld problems. However, it presented limitations that could be verified when compared to the model with Lucas and Kanade, with a Gaussian filter, which is faster and presents good accuracy.

The proposed Method reached only approximately 50% execution speed in relation to the Lucas and Kanade Method, which motivates further improvements to the Method. The technique presented can be applied to other fields of research as in cardiology due to presenting great precision when submitted to small region, which is important because it can be applied with the objective of predicting infarctions and as a current contribution, for the state of the art is to characterize the optical flow in terms of Gaussian curvature, that makes it possible to highlight fields of research such as computational vision and differential geometry.

## **Acknowledgements**

The authors of this work would like to thank the Institute of Electronics and Informatics Engineering of Aveiro, the Telecommunications Institute of Aveiro, and *Object Tracking Using Adapted Optical Flow DOI: http://dx.doi.org/10.5772/intechopen.102863*

the University of Aveiro for the financial, technical-administrative, and structural support provided that allowed the accomplishment of this work.

## **Author details**

Ronaldo Ferreira<sup>1</sup> \*, Joaquim José de Castro Ferreira<sup>1</sup> and António José Ribeiro Neves<sup>2</sup>

1 University of Aveiro/Telecommunications Institute, Aveiro, Portugal

2 University of Aveiro/Institute of Electronics and Informatics Engineering of Aveiro, Portugal

\*Address all correspondence to: ronaldoferreira@ua.pt; an@ua.pt

© 2022 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## **References**

[1] Gonzalez RC, Woods RE. Digital Image Processing. 2002

[2] Abbass MY, Kwon KC, Kim N, Abdelwahab SA, El-Samie FEA, Khalaf AA. A survey on online learning for visual tracking. The Visual Computer. 2020:1-22

[3] Khalid M, Penard L, Memin E. Application of optical flow for river velocimetry. International Geoscience and Remote Sensing Symposium. 2017: 6265-6246

[4] Kastrinaki V, Zervakis M. A survey of video processing techniques for traffic applications. Image and Vision Computing. 2003;**21**(4): 359-381

[5] Almodfer R, Xiong S, Fang Z, Kong X, Zheng S. Quantitative analysis of lanebased pedestrian-vehicle conflict at a non-signalized marked crosswalk. Transportation Research Part F: Traffic Psychology and Behaviour. 2016;**42**: 468-468

[6] Tian B, Yao Q, Gu Y, Wang K, Li Y. Video processing techniques for traffic flow monitoring: A survey. In: ITSC. IEEE; 2011

[7] Laurense VA, Goh JY, Gerdes JC. Path-tracking for autonomous vehicles at the limit of friction. In: ACC. IEEE; 2017. p. 56665591

[8] Yilmaz A, Javed O, Shah M. Object tracking: A survey. ACM Computing Surveys. 2006;**38**(2006):13

[9] Veenman C, Reinders M, Ebacker E. Resolving motion matching for densely moving points. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2001;**23**(1):54-72

[10] Goodfellow I, Bengio Y, Courville A, Bengio Y. Deep Learning. Vol. 1. Massachusetts, USA: MIT Press; 2016

[11] Santos Junior JMD. Analisando a viabilidade de deep learning para reconhecimento de ações em datasets pequenos. 2018

[12] Kelleher JD. Deep Learning. MIT Press; 2019

[13] Xiong Q, Zhang J, Wang P, Liu D, Gao RX. Transferable two-stream convolutional neural network for human action recognition. Journal of Manufacturing Systems. 2020;**56**: 605-614

[14] Khan MA, Sharif M, Akram T, Raza M, Saba T, Rehman A. Handcrafted and deep convolutional neural network features fusion and selection strategy: An application to intelligent human action recognition. Applied Soft Computing. 2020;**87**(73):74986

[15] Abdelbaky A, Aly S. Human action recognition using three orthogonal with unsupervised deep convolutional neural network. Multimedia Tools and Applications. 2021;**80**(13):20019-20065

[16] Rani SS, Naidu GA, Shree VU. Kinematic joint descriptor and depth motion descriptor with convolutional neural networks for human action recognition. Materials Today: Proceedings. 2021;**37**:3164-3173

[17] Farnebäck G. Two-frame motion estimation based on polynomial expansion. In: Proceedings of the Scandinavian Conference on Image Analysis (SCIA). 2003. pp. 363-370

[18] Wang Z, Xia C, Lee J. Group behavior tracking of Daphnia magna *Object Tracking Using Adapted Optical Flow DOI: http://dx.doi.org/10.5772/intechopen.102863*

based on motion estimation and appearance models. Ecological Informatics. 2021;**61**:7278

[19] Lin W, Hasenstab K, Cunha GM, Schwartzman A. Comparison of handcrafted features and convolutional neural networks for liver MR image adequacy assessment. Scientific Reports. 2020;**10**(1):1-11

[20] Xu Y, Zhou X, Chen S, Li F. Deep learning for multiple object tracking: A survey. IET Computer Vision. 2019; **13**(4):355-368

[21] Pal SK, Pramanik A, Maiti J, Mitra P. Deep learning in multi-object detection and tracking: State of the art. Applied Intelligence. 2021:1-30

[22] Jiao L, Zhang F, Liu F, Yang S, Li L, Feng Z, et al. A survey of deep learningbased object detection. IEEE Access. 2019;**7**:51837-51868

[23] Pal SK, Bhoumik D, Chakraborty DB. Granulated deep learning and z-numbers in motion detection and object recognition. Neural Computing Applied. 2020;**32**(21): 16533-16555

[24] Chung D, Tahboub K, Delp EJ. A two stream siamese convolutional neural network for person re-identification. In: Proceedings of the IEEE International Conference on Computer Vision. 2017. pp. 1983-1671

[25] Choi H, Park S. A survey of machine learning-based system performance optimization techniques. Applied Sciences. 2021;**11**(7):3235

[26] Abdulkareem NM, Abdulazeez AM. Machine learning classification based on Radom Forest algorithm: A review. International Journal of Science and Business. 2021;**5**(2):51-142

[27] Iwendi C, Jo O. COVID-19 patient health prediction using boosted random Forest algorithm. Frontiers in Public Health. 2020;**8**:9

[28] Dolejš M. Generating a spatial coverage plan for the emergency medical service on a regional scale: Empirical versus random forest modelling approach. Journal of Transport Geography. 2020:10 Available from: https://link.springer.com/ book/10.687/978-981-15-0637-6

[29] Reis I, Baron D, Shahaf S. Probabilistic random forest: A machine learning algorithm for Noisy data sets. The Astronomical Journal. 2018;**157**(1): 16. DOI: 10.38/1538-3881/aaf69

[30] Thomas B, Thronson H, Buonomo A, Barbier L. Determining research priorities for astronomy using machine learning. Research Notes of the AAS. 2022;**6**(1):11

[31] Yoo S, Kim S, Kim S, Kang BB. AI-HydRa: Advanced hybrid approach using random forest and deep learning for malware classification. Information Sciences. 2021;**546**:420-655

[32] Liu C, Gu Z, Wang J. A hybrid intrusion detection system based on scalable K-means+ random Forest and deep learning. IEEE Access. 2021;**9**: 75729-75740

[33] Paschos G. Perceptually uniform color spaces for color texture analysis: An empirical evaluation. IEEE Transactions on Image Processing. 2001; **10**:932-937

[34] Estrada FJ, Jepson AD. Benchmarking image segmentation algorithms. International Journal of Computer Vision. 2009;**56**(2):167-181

[35] Jaiswal JK, Samikannu R. Application of random forest algorithm on feature

subset selection and classification and regression. In: 2017 World Congress on Computing and Communication Technologies (WCCCT). IEEE; 2017. pp. 65-68

[36] Menezes R, Evsukoff A, González MC, editors. Complex Networks. Springer; 2013

[37] Jeong C, Yang HS, Moon K. A novel approach for detecting the horizon using a convolutional neural network and multi-scale edge detection. Multidimensional Systems and Signal Processing. 2019;**30**(3): 1187-1654

[38] Liu YJ, Tong SC, Wang W. Adaptive fuzzy output tracking control for a class of uncertain nonlinear systems. Fuzzy Sets and Systems. 2009;**160**(19): 2727-2754

[39] Beckmann M, Ebecken NF, De Lima BSP. A KNN undersampling approach for data balancing. Journal of Intelligent Learning Systems and Applications. 2015;**7**(04):72

[40] Yoriyaz H. Monte Carlo method: Principles and applications in medical physics. Revista Brasileira de Física Médica. 2009;**3**(1):141-149

[41] Wang X. Intelligent multicamera video surveillance: A review. Pattern Recognition Letters. 2013;**34**(1): 3-19

[42] Wu J, Rehg JM. CENTRIST: A visual descriptor for scene characterization. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2011;**33**(8): 1559-1501

[43] Cremers D, Schnorr C. Statistical shape knowledge in variational motion segmentation. Israel Network Capital Journal. 2003;**21**:77-86

[44] Siegelman N, Frost R. Statistical learning as an individual ability: Theoretical perspectives and empirical evidence. Journal of Memory and Language. 2015;**81**(73):74-65

[45] Kim IS, Choi HS, Yi KM, Choi JY, Kong SG. Intelligent visual surveillance —A survey. International Journal of Control, Automation, and Systems. 2010;**8**(5):926-939

[46] Chan KL. Detection of swimmer using dense optical flow motion map and intensity information. Machine Vision and Applications. 2013;**24**(1):75-69

[47] Szpak ZL, Tapamo JR. Maritime surveillance: Tracking ships inside a dynamic background using a fast levelset. Expert System with Applications. 2011;**38**(6):6669-6680

[48] Fefilatyev S, Goldgof D, Shceve M, et al. Detection and tracking of ships in open sea with rapidly moving buoymounted camera system. Ocean-Engineering. 2012;**54**(1):1-12

[49] Frost D, Tapamo J-R. Detection and tracking of moving objects in a maritime environment with level-set with shape priors. EURASIP Journal on Image and Video Processing. 2013;**1**(42):1-16

[50] Collins RT, Lipton AJ, Kanade T, et al. A System for Video Surveillance and Monitoring. Technical Report. Pittsburg: Carnegie Mellon University; 2000

[51] Viola P, Jones MJ. Robust real-time face detection. International Journal of Computer Vision. 2004;**57**(2):63-154

[52] Rodrigues-Canosa GR, Thomas S, Cerro J, et al. Real-time method to detect and track moving objects (DATMO) from unmanned aerial vehicles (UAVs) using a single camera. Remote Sensing. 2012;**4**(4):770-341

*Object Tracking Using Adapted Optical Flow DOI: http://dx.doi.org/10.5772/intechopen.102863*

[53] Frakes D, Zwart C, Singhose W. Extracting moving data from video optical flow with Fhysically-based constraints. International Journal of Control, Automation and Systems. 2013; **11**(1):55-57

[54] Sun K. Robust detection and tracking of long-range target in a compound framework. Journal of Multimedia. 2013;**8**(2):98 73, 74

[55] Kravchenko P, Oleshchenko E. Mechanisms of functional properties formation of traffic safety systems. Transportation Research Procedia. 2017; **20**:367-372

[56] Lucas BD, Kanade., T. An iterative image registration technique with an application to stereo vision. In: International Joint Conference on Artificial Intelligence. 1981

[57] Gong Y, Tang W, Zhou L, Yu L, Qiu G. A discrete scheme for computing Image's weighted Gaussian curvature. IEEE International Conference on Image Processing (ICIP). 2021;**2021**: 1919-1923. DOI: 10.1109/ ICIP42928.2021.9506611

[58] Hooker G, Mentch L. Bootstrap bias corrections for ensemble methods. arXiv preprint arXiv:1506.00553. 2015

[59] Tran T. Semantic Segmentation Using Deep Neural Networks for MAVs. 2022

[60] Horn BAND, Schunk B. Determining optical flow. Artificial Intelligence. 1981;**17**:156

[61] Dalal N, Triggs B. Histograms of oriented gradients for human detection. In: 2005 EEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). Vol. 1. IEEE; 2005. pp. 886-893

[62] Rosten E, Drummond T. Fusing points and lines for high performance tracking. In: 10th IEEE International Conference on Computer Vision. Vol. 2. Beijing, China; 2005. pp. 1508-1515

[63] Smolka B, Venetsanopoulos AN. Noise reduction and edge detection in color images. In: Color Image Processing. CRC Press. 2018. pp. 95-122

[64] Li L, Leung MK. Integrating Intensity and Texture Differences for Robust Change. 2002

[65] Shi J, Tomasi C. Good features to track. In: 9th IEEE Conference on Computer Vision and Pattern Recognition. Seattle WA, USA; 1674. pp. 593-600

[66] Cucchiara R, Prati A, Vezzani R. Advanced video surveillance with pan tilt zoom cameras. In: Proceedings of the 6th IEEE International Workshop on Visual Surveillance. Graz, Austria; 2006

[67] Li J, Wang Y, Wang Y. Visual tracking and learning using speeded up robust features. Pattern Recognition Letters. 2012;**33**(16):2094-2269

[68] Fernandez-Caballero A, Castillo JC, Martinez-Cantos J, et al. Optical flow or image subtraction in human detection from infrared camera on Mobile robot. Robotics and Autonomous Systems. 2010;**66**(12):503-511

[69] Frakes D, Zwart C, Singhose W. Extracting moving data from video optical flow with physically-based constraints. International Journal of Control, Automation and Systems. 2013; **11**(1):55-57

[70] Revathi R, Hemalatha M. Certain approach of object tracking using optical flow techniques. International Journal of Computer Applications. 2012;**53**(8):50-57 [71] Breiman L. Consistency for a Simple Model of Random Forests. 2004

[72] Biau G, Devroye L, Lugosi G. Consistency of random forests and other averaging classifiers. Journal of Machine Learning Research. 2008;**9**(9)

[73] Meinshausen N, Ridgeway G. Quantile regression forests. Journal of Machin Learning Research. 2006;**7**(6)

[74] Ishwaran H, Kogalur UB. Consistency of random survival forests. Statistics & Improbability Letters. 2010; **80**(13–14):746-744

[75] Biau G. Analysis of a random forests model. The Journal of Machine Learning Research. 2012;**13**(1):743-775

[76] Genuer R. Variance reduction in purely random forests. Journal of Nonparametric Statistics. 2012;**24**(3): 565-562

[77] Wager S. Asymptotic theory for random forests. arXiv preprint arXiv: 1405.0352. 2014

[78] Scornet E, Biau G, Vert JP. Consistency of random forests. The Annals of Statistics. 2015;**65**(4): 1716-1741

[79] Murphy KP. Machine Learning: A Probabilistic Perspective. MIT Press; 2012

[80] Yoriyaz H. Monte carlo method: Principles and applications in medical physics. Revista Brasileira de Física Médica. 2009;**3**(1):141-149

## **Chapter 2**

## Siamese-Based Attention Learning Networks for Robust Visual Object Tracking

*Md. Maklachur Rahman and Soon Ki Jung*

## **Abstract**

Tracking with the siamese network has recently gained enormous popularity in visual object tracking by using the template-matching mechanism. However, using only the template-matching process is susceptible to robust target tracking because of its inability to learn better discrimination between target and background. Several attention-learning are introduced to the underlying siamese network to enhance the target feature representation, which helps to improve the discrimination ability of the tracking framework. The attention mechanism is beneficial for focusing on the particular target feature by utilizing relevant weight gain. This chapter presents an indepth overview and analysis of attention learning-based siamese trackers. We also perform extensive experiments to compare state-of-the-art methods. Furthermore, we also summarize our study by highlighting the key findings to provide insights into future visual object tracking developments.

**Keywords:** visual object tracking, siamese network, attention learning, deep learning, single object tracking

## **1. Introduction**

Visual object tracking (VOT) is one of the fundamental problems and active research areas of computer vision. It is the process of determining the location of an arbitrary object from video sequences. A target with a bounding box is given for the very first frame of the video, and the model predicts the object's location with height and width in the subsequent frames. VOT has a wide range of vision-based applications, such as intelligent surveillance [1], autonomous vehicles [2], game analysis [3], and human-computer interface [4]. However, it remains a complicated process due to numerous nontrivial challenging aspects, including background clutter, occlusion, fast motion, motion blur, deformation, and illumination variation.

Many researchers have proposed VOT approaches to handle these challenges. Deep features are used more than the handcraft features such as scale-invariant feature transform (SIFT), histogram of oriented gradients (HOG), and local binary patterns (LBP) to solve the tracking problem and perform better against several challenges. Convolutional neural networks (CNN), recurrent neural networks (RNN), autoencoder, residual networks, and generative adversarial networks (GAN) are some

popular approaches used to learn deep features for solving vision problems. Among them, CNN is used the most because of its simplistic feed-forward process and better performance on several computer vision applications, such as image classification, object detection, and segmentation. Although CNN has had massive success in solving vision problems, tracking performance has not improved much because of obtaining adequate training data for end-to-end training the CNN structure.

In recent years, tracking by detection and template matching are two major approaches for developing a reliable tracking system. VOT is treated as a classification task in tracking-by-detection approaches. The classifier learns to identify the target from the background scene and then updates based on prior frame predictions. The deep features with correlation filter-based trackers such as CREST [5], C-COT [6], and ECO [7], as well as deep network-based tracker MDNet [8], are followed the tracking by detection strategy. These trackers' performance depends on online templateupdating mechanisms, which is time-consuming and leads trackers to compromise realtime speed. Besides, the classifier is susceptible to overfit on recent frames result.

However, techniques relying on template matching using metric learning extract the target template and choose the most similar candidate patch at the current frame. Siamese-based trackers [9–15] follow the template-matching strategy, which uses cross-correlation to reduce computational overhead and solve the tracking problem effectively. Siamese-based tracker, SiamFC [9], gains immense popularity to the tracking community. It constructs a fully convolutional Y-shaped double branch network, one for the target template and another for the subsequent frames of the video, which learns through parameter sharing. SiamFC utilizes the off-line training method on many datasets and performs testing in an online manner. It does not use any template-updating mechanisms to adapt the target template for the upcoming frames. This particular mechanism is beneficial for fast-tracking but prone to less discrimination due to the static manner of the template branch.

Focusing on the crucial feature is essential to improve tracker discrimination ability. Attention mechanism [16] helps to improve the feature representation ability and can focus on the particular feature. Many siamese-based trackers adopted attentional features inside the feature extraction module. SA-Siam [11] presents two siamese networks that work together to extract both global and semantic level information with channel attention feature maps. SCSAtt [10] incorporates stacked channel and spatial attention mechanism for improving the tracking effectively. To improve tracker discriminative capacity and flexibility, RASNet [13] combines three attention modules.

This chapter focuses on how the attention mechanism evolves on the siamesebased tracking framework to improve overall performance by employing simple network modules. We present different types of attention-based siamese object trackers to compare and evaluate the performance. Furthermore, we include a detailed experimental study and performance comparison among the attentional and non-attentional siamese trackers on the most popular tracking benchmarks, including OTB100 [17, 18] and OTB50 [17, 18].

## **2. Related works**

#### **2.1 Tracking with siamese network**

The siamese-based trackers gain great attention among the tracking community after proposing SiamFC [9], which performs at 86 frames per second (FPS). SiamFC

#### *Siamese-Based Attention Learning Networks for Robust Visual Object Tracking DOI: http://dx.doi.org/10.5772/intechopen.101698*

utilizes a fully convolutional parallel network that takes two input images, one for the target frame and another for the subsequent frames of the video. A simple crosscorrelation layer is integrated to perform template matching at the end of fully convolutional parallel branches. Based on the matching, a similarity score map or response map is produced. The maximum score point on the 2D similarity map denotes the target location on the search frame. However, a siamese network is first introduced to verify signatures [19].

Before introducing SiamFC, the siamese-based approach was not much popular for solving tracking problems. The optical flow-based tracker SINT [20] is considered as one of the earliest siamese-based trackers, but it was not operating in real time (about 2 FPS). Around the same time, another siamese-based tracker named GOTURN [21] utilizes a relative motion estimation solution to address the tracking problem as regression. Then many subsequent studies for siamese tackers [20, 22–25] have been introduced to improve the overall tracking performance. CFnet [23] employs a correlation-based filter in the template branch of SiamFC after performing feature extraction in a closed-form equation. SiamMCF [26] considers multiple layers response maps using cross-correlation operation and finally fused it to get a single mapped score to predict the target location. SiamTri [24] introduces a triplet loss-based simaese tracking to utilize discriminative features rather than pairwise loss to the link between the template and search images effectively. DSiam [25] uses online training with the extracted background information to suppress the target appearance changes.

#### **2.2 Tracking with attention network**

The attention mechanism is beneficial to enhance the model performance. It works to focus on the most salient information. This mechanism is widely used in several fields of computer vision, including image classification [16], object detection [27], segmentation [28], and person reidentification [29]. Similarly, visual tracking frameworks [10, 11, 13–15] adopt attention mechanisms to highlight the target features. This technique enables the model to handle challenges in tracking. SCSAtt [10] utilizes a stacked channel-spatial attention learning mechanism to determine and locate the target information by answering what and where is the maximum similarity of the target object. RASNet [13] employs multiple attentions together to augment the adaptability and discriminative ability of the tracker. IMG-Siam [14] uses the super pixel-based segmentation matting technique to fuse the target after computing channel-refined features for improving the overall target's appearance information. SA-Siam [11] considers a channel attention module in the semantic branch of their framework to improve the discrimination ability. FICFNet [30] integrates channel attention mechanism in both branches of the siamese architecture and improves the baseline feature refinement strategy to improve tracking performance. IRCA-Siam [31] incorporates several noises [32, 33] in its input feature during training the tracker in off-line to improve the overall network generalization ability.

Moreover, the long short-term memory (LSTM) model also considers attention learning to improve the important features, such as read and write operations. MemTrack [34] and MemDTC [35] used the attentional LSTM-based memory network to update the target template during tracking. The temporal feature-based attention for visual tracking is introduced by FlowTrack [36], which considers temporal information for the target.

## **3. Methodology**

This section discusses how siamese-based tracking frameworks integrate with attention mechanisms, which help to improve the overall tracking performance. Before going into the deep details of the attention integration, the underlying siamese architecture for tracking is discussed.

#### **3.1 Baseline siamese network for visual tracking**

Siamese network is a Y-shaped parallel double branch network and learns through parameter sharing. The end of the parallel CNN branch calculates a similarity score between two branches. In the siamese-based tracking frameworks, usually, SiamFC [9] is popularly considered as a baseline. It computed a response map as a similarity score by calculating the cross-correlation score between target and search image. The highest score point of the response map represents the corresponding target location in the search image.

**Figure 1** shows the basic siamese object tracking framework, where *z* and *x* denote the target and search images, respectively. The solid block represents the fully convolutional network, which learns through parameter sharing between two branches.

The baseline siamese-based tracker, SiamFC, can be defined mathematically as.

$$R(z,\boldsymbol{\omega}) = \boldsymbol{\varphi}(\boldsymbol{z}) \* \boldsymbol{\varphi}(\boldsymbol{\omega}) + \boldsymbol{b} \cdot \mathbf{1},\tag{1}$$

where *R z*ð Þ , *x* denotes cross-correlation-based similarity score map called response map, and *ψ*ð Þ*z* and *ψ*ð Þ *x* represent fully convolutional feature maps for target image and search image, respectively. ∗ stands for cross-correlation operation between two feature maps. *b* � 1 denotes bias value on every position on the response map *R z*ð Þ , *x* . The baseline siamese tracker solves the closed-form equation and learns through parameter sharing. It can run at real-time speed but cannot handle tracking challenges properly due to its lack of discriminative ability. Therefore, the attention mechanism

**Figure 1.** *The basic siamese-based visual object tracking framework.*

comes into action to improve the overall tracker accuracy by handling challenging scenarios.

#### **3.2 Siamese attention learning network for visual tracking**

The human visual perception inspires the attention learning network; instead of focusing on the whole scene, the network needs to learn an essential part of the scene. During the feature extraction of a CNN, it learns through the depth of channels. Each channel is responsible for learning different features of the object. Attention networks learn to prioritize the object's trivial and nontrivial parts by using the individual channel's feature weight gain. As explained in the studies by He et al., Rahman et al., Wang et al., and Fiaz et al. [11–13, 15], the attention mechanism greatly enhances siamesebased tracking frameworks that can differ between foreground and background from an image. It helps to improve the overall discriminative ability of the tracking framework by learning various weights gain on different areas of the target to focus the nontrivial part and suppress the trivial part.

Integrating attention mechanisms into the siamese network is one of the important factors for improving the tracker performance. There are three common approaches of integrating attention mechanisms into the siamese-based tracking framework, including (a) attention on template feature map, (b) attention on search feature map, and (c) attention on both feature maps. When the attention mechanism is integrated into the siamese tracker, the attention-based tracker can be defined by altering the baseline equation as.

$$R(z,\boldsymbol{\omega}) = A(\boldsymbol{\psi}(\boldsymbol{z})) \* \boldsymbol{\upmu}(\boldsymbol{\omega}) + \boldsymbol{b} \cdot \mathbf{1},\tag{2}$$

$$R(z, \boldsymbol{\omega}) = \boldsymbol{\psi}(z) \* A(\boldsymbol{\psi}(\boldsymbol{\omega})) + \boldsymbol{b} \cdot \mathbf{1},\tag{3}$$

and

$$R(z, \boldsymbol{x}) = A(\boldsymbol{\psi}(\boldsymbol{z}) \* \boldsymbol{\psi}(\boldsymbol{x})) + \boldsymbol{b} \cdot \mathbf{1},\tag{4}$$

where *A*ð Þ� denotes the attention mechanism on the feature map *ψ*ð Þ*:* , which learns to highlight the target information by providing the positive weights on important features. The Eqs. (2)–(4) represent the three common ways of integrating attention mechanisms subsequently.

**Figure 2** illustrates a general overview of these three common types of attention integration to the baseline siamese tracker. The backbone of the siamese network learns through parameter sharing. The CNN feature extractor networks are fully convolutional and able to take any size of images. After computing features from both branches, a cross-correlation operation produces a response map for the similarity score between the target and search image. The difference between the baseline and attention-based siamese tracker is that baseline does not use any attentional features. In contrast, the attentional feature is used to produce a response map in the attentionbased trackers.

The attention on the template feature map (illustrated in **Figure 2(a)**) considers only the attention mechanism on the template/target feature, which improves the network's target representation and discrimination ability. A better target representation is essential for the better performance of the tracker. The attention on search feature map approach (shown in **Figure 2(b)**) integrates the attention mechanism to search branch of the underlying siamese tracker. Since in the siamese-based trackers,

#### **Figure 2.**

*The common approaches of integrating attention mechanisms into the baseline siamese tracking framework.*

the target branch is usually fixed after computing the first frame of the video sequence. The search branch is responsible for the rest of the subsequent frames of the

#### *Siamese-Based Attention Learning Networks for Robust Visual Object Tracking DOI: http://dx.doi.org/10.5772/intechopen.101698*

video. Therefore, adding the attention mechanism to the search branch will be computed for all video frames, which seriously hinders the tracking speed. Integrating the attention mechanism on both branches (illustrated in **Figure 2(c)**) takes attentional features and performs similarity score computation instead of taking typical CNN features. This type of attentional siamese architecture usually faces less discrimination on the challenging sequences and reduces the tracking speed because of the attention network in the search branch.

Attention with template branch is the most popular strategy among these three ways of integration. It also considers how many attention modules are used. The number of integrating attention mechanisms to the baseline siamese architecture is another important factor for improving the siamese tracker performance. However, this section will discuss the two most common and popular ways of utilizing the attentional feature to improve tracking performance with less parameter overhead.

#### *3.2.1 Single attention mechanism for visual tracking*

Many challenges are encountered when visual object tracking using a basic siamese tracking pipeline to track the object in challenging scenarios. Candidates similar to the template and the correct object should be identified from all of these candidates. A tracker with less discrimination ability fails to identify the most important object features during tracking for challenging sequences such as occlusion and cluttered background, which results in unexpected tracking failure. A robust discriminative mechanism needs to increase the siamese network's performance to deal with such issues. Therefore, incorporating an attention mechanism with the underlying siamese network improves the overall tracking performance, particularly tackling challenging scenarios.

It has been widely observed that the channel attention mechanism [16] is beneficial to prioritize the object features and is used as the popular single-employed attention mechanism for visual tracking. It is one of the most popular approaches to improve the siamese-based tracker performance in terms of success and precision score. The idea of learning different features by different channels utilizing the channel attention. **Figure 3(a)** shows a max-pooled and global average-pooled featuresbased channel attention mechanism. The max-pooled highlights the finer and more distinct object attributes from the individual channel, whereas global average-pooled offers a general overview of individual channel contributions. Therefore, the maxpooled and average-pooled features are fused after performing a fully connected neural operation. The fused feature is normalized by sigmoid operation and added to the original CNN feature using residual skip connection.

The following subsection presents some state-of-the-art tracking frameworks to overview the single attention mechanism-based siamese visual object tracking.

• IMG-Siam [14]: The channel attention mechanism and matting guidance module with a siamese network called IMG-Siam. **Figure 4** represents the IMG-Siam. They consider channel attention mechanism into the siamese network to improve the matching model. During online tracking, IMG-Siam uses super-pixel matting to separate the foreground from the background of the template image. The foreground information is inputted to the fully convolution network after getting the features from convolution layers. The features from the initial and matted templates are fed to the channel attention network to learn the attentional features. Both attentional features are fused for cross-correlation operation with

#### **Figure 3.**

*Channel attention and spatial attention networks.*

#### **Figure 4.** *IMG-Siam tracking framework [14].*

the search image features to produce a response map. The response map is used to locate the target in the corresponding search image. The IMG-Siam channel attention mechanism only computes the global average-pooled features rather than considering the max-pooled features with it. After integrating the channel attention module, IMG-Siam performance has improved from the baseline

*Siamese-Based Attention Learning Networks for Robust Visual Object Tracking DOI: http://dx.doi.org/10.5772/intechopen.101698*

siamese tracker. Although the performance has improved, using only the average pooled feature susceptible to the real challenges, including occlusion, deformation, fast motion, and low resolution.


**Figure 5.** *SiamFRN tracking framework [12].*

**Figure 6.** *SA-Siam tracking framework [11].*

important design choice for SA-Siam separately trained these two branches to keep the heterogeneity of features.

Moreover, the authors integrate a channel-wise attention mechanism in the semantic branch of the tracker. SA-Siam considers only max-pooled-based channelwise features for acquiring finer details of the target. The motivation of using the channel attention mechanism in the SA-Siam framework is to learn the channel-wise weights corresponding to the activated channel around the target position. The last two layers' convolution features are selected for the semantic branch because the high-level features are better for learning semantic information. The low-level convolutional features focus on preserving the location information of the target. However, the high-level features, that is, semantic features, are robust to the object's appearance changes, but they cannot retain the better discrimination ability. Therefore, the tracker suffers poor performance when similar objects in a scene or the background are not distinguishable from the target object. Incorporating the attention mechanism into the SA-Siam tracker framework helps alleviate such problems and enhances its performance in cluttered scenarios.

## *3.2.2 Multiple attention mechanisms for visual tracking*

Multiple attentions are employed instead of using single attention to improve the tracker performance further in challenging scenarios. RASNet [13] and SCSAtt [12] used multiple attentional mechanisms in their tracking framework to enhance the baseline siamese tracker performance. In the multiple attention mechanisms, one attention is responsible for learning one important thing and others are responsible for learning other essential things of the target. Combinedly, they learn to identify and locate the target more accurately. This subsection describes the siamese-based trackers where multiple attention mechanisms are incorporated.

• RASNet [13]: Residual attentional siamese network (RASNet) is proposed by Wang et al. [13]. It incorporates three attention mechanisms, including general attention, residual attention, and channel attention. **Figure 7** represents the RASNet tracker. RASNet design allows a network to learn the efficient feature representation and better discrimination facility. It employed an hourglass-like convolutional neural network (CNN) for learning the different scaled features representations and contextualization. Since RASNet considers residual-based learning, it enables a

*Siamese-Based Attention Learning Networks for Robust Visual Object Tracking DOI: http://dx.doi.org/10.5772/intechopen.101698*

**Figure 7.** *RASNet tracking framework [13].*

network to encode and learn more adaptive target representation from multiple levels. It also investigates a variety of attentional techniques to adjust offline feature representation learning to track a specific target. All training operations in RASNet are completed during the offline phase to ensure efficient tracking performance. Tracker's general attention mechanism gradually converges in the center, which is similar to a Gaussian distribution. It represents the center position as a more important part of the training samples than the peripheral parts, which is tremendously beneficial to train the siamese network. A residual attention module is incorporated to improve the general attention module performance and combinedly called the dual attention (DualAtt) model. The residual module helps to learn better representation and reduces bias on the training data. Furthermore, the channel attention module integrates to a single branch of the siamese network to improve the network discrimination ability, which learns through channel-wise features.

• SCSAtt [10]: Stacked channel-spatial attention learning (SCSAtt) employed channel attention and spatial attention mechanisms together. Channel attention uses to learn "what" information, and spatial attention focuses on the location information by learning "where" information of the target. To improve tracking performance with end-to-end learning, SCSAtt combines "what" and "where" information modules and focuses on the most nontrivial part of the object. **Figure 3** shows the channel attention and spatial attention mechanisms. **Figure 8** illustrates the SCSAtt tracker combining channel attention and spatial attention. The overall framework tries to balance the tracker's accuracy (success and precision) and speed. SCSAtt extends the baseline siamese network by incorporating the stacked channel-spatial attention in the target branch to handle challenges. SCSAtt channel attention and spatial attention modules consider maxpooled and global average-pooled features together to learn better target representation and discrimination learning. These improved features help the network to locate and identify the target in challenging scenarios, such as background clutter, fast motion, motion blur, and scale variation. SCSAtt does not employ any updating mechanisms in the tracking framework and considers only a pretrained model during testing, which helps to ensure fast tracking performance.

**Figure 8.** *SCSAtt tracking framework [10].*

## **4. Experimental analysis and results**

This section describes the experimental analysis and compares the results of the visual trackers over the OTB benchmark. The most popular comparison on the OTB benchmark is the OTB2015 benchmark [17, 18]. It is also familiarized as the OTB100 benchmark because of consisting 100 challenging video sequences for evaluating tracking performance. Besides, the subset of OTB100 benchmark named OTB50 benchmark is also considered for evaluating tracking performance. It contains the most challenging 50 sequences among hundred sequences. The OTB video sequences are categorized into 11 challenging attributes, such as scale variation (SV), background clutter (BC), fast motion (FM), motion blur (MB), low resolution (LR), inplane rotation (IPR), out-plane rotation (OPR), deformation (DEF), occlusion (OCC), illumination variation (IV), and out-of-view (OV).

Usually, one-pass evaluation (OPE) uses to compute success and precision plots. The percentage of overlap score between the predicted and ground-truth bounding box is considered as success scores. The center location error of the predicted and ground-truth bounding box is considered as precision scores. The overlap score is computed by the intersection over union (IOU), and the center location error is computed by the center pixel distance. Success plots and precision plots are drawn using the tracking community-provided OTB toolkit based on these two scores. The precision and success plots thresholds are 20 pixels distance and 0.5 IOU score, respectively, and considered accurate tracking. The following subsections demonstrate a quantitative and qualitative analysis by comparing the tracking speed.

#### **4.1 Quantitative and qualitative comparison and analysis**

To compute a fair comparison, we carefully selected various trackers including attentional and non-attentional siamese-based trackers. **Figures 9** and **10** show the compared trackers' results on the OTB100 and OTB50 benchmarks, respectively. The compared trackers in **Figures 9** and **10** are siamese-based. Among them, SA-Siam [11], SCSAtt [10], MemDTC [35], MemTrack [34], and SiamFRN [12] utilize attention mechanism to improve the baseline SiamFC tracker [9]. SiamFC achieves 77.1% and 69.2% for overall precision plots, and 58.2% and 51.6% for overall success plots on OTB100 and OTB50 benchmarks. The attention-based tracker SA-Siam shows the

*Siamese-Based Attention Learning Networks for Robust Visual Object Tracking DOI: http://dx.doi.org/10.5772/intechopen.101698*

**Figure 9.** *Compared trackers' results on OTB100 benchmark.*

**Figure 10.** *Compared trackers' results on OTB50 benchmark.*

dominating performance among the compared trackers. It acquires 86.4% and 65.6% precision and success scores on the OTB100 benchmark, respectively. The OTB50 benchmark also achieves 82.3% in the precision score and 61.0% in the success score.

The overall performance of the attention-integrated siamese trackers is higher than other siamese-based trackers. Among the other siamese trackers, GradNet performance is better due to its expensive tracking time operation. GradNet performs 86.1% and 82.5% for precision plots, and 63.9% and 59.7% success plots on OTB100 and OTB50 benchmarks. The other siamese-based trackers', including DSiamM, SiamTri, and CFnet, performance is not much improved than the original siamese pipeline. However, the attention with the siamese baseline tracker shows improving the tracker's overall performance. The attention-integrated siamese trackers, including SCSAtt and SiamFRN, utilize the same channel attention mechanism inside their framework. They achieve 82.8% and 77.8% for precision, and 60.2% and 58.1% for precision success plots, respectively, on the OTB50 benchmark. The trackers with the LSTM attention network (MemDTC and MemTrack) also performed better than the baseline siamese tracker. Both follow a similar attention mechanism except considering different features for memory, which makes the performance difference. MemDTC achieves 84.5% for precision plots, which is 2.5% higher than the MemTrack scores (82.0%). Similarly, the gap between them is 1.1% for success scores on the OTB100 benchmark. MemDTC also performs better than MemTrack on the OTB50 benchmark.

**Figures 11** and **12** show the trackers' performance comparison on the challenging attributes of the OTB100 benchmark in terms of precision and success plots. For better visualization of these two figures, the interested reader may check this link:

#### **Figure 11.**

*Compared trackers' performance on the challenging attributes of OTB100 benchmark in terms of precision plots.*

**Figure 12.**

*Compared trackers' performance on the challenging attributes of OTB100 benchmark in terms of success plots.*

## *Siamese-Based Attention Learning Networks for Robust Visual Object Tracking DOI: http://dx.doi.org/10.5772/intechopen.101698*

https://github.com/maklachur/VOT\_Book-Chapter. SCSAtt tracker performs better in precision plots than other trackers in several challenging scenarios, such as scale variation, illumination variation, deformation, motion blur, and fast motion. SCSAtt utilizes channel attention and spatial attention mechanism into the baseline SiamFC model. Furthermore, the channel attention-based SA-Siam tracker performs better than the other siamese-based trackers, including CFnet, DSiamM, and SiamTri. SA-Siam also shows the dominating performance on other trackers over the OTB100 benchmark in the success plots of challenging attributes. It performs better than the other trackers except on the motion blur challenge, whereas SCSAtt performs better than the other trackers for success plots.

**Figure 13** illustrates the qualitative comparison results among trackers over several challenging sequences from the OTB100 benchmark. For better visualization of this result, the interested reader may check this link: https://github.com/ maklachur/VOT\_Book-Chapter. The overall tracking accuracy of attention-based trackers is better than the other trackers. They can track the target object more correctly with accurate bounding boxes from the background information. We observed that most trackers fail to handle the target in the car sequence, but MemTrack and MemDTC trackers manage to provide better tracking. Similarly, SCSAtt, SA-Siam, and SiamFRN show accurate tracking for other compared sequences, whereas the non-attentional trackers suffer handling the target accurately.

## **4.2 Speed comparison and analysis**

In order to compare tracking speed, we selected trackers from our previous comparison for quantitative and qualitative analysis. **Table 1** shows the speed comparison results in terms of FPS and corresponding success and precision scores on the OTB100 benchmark. We observed that SiamFC tracking speed (86 FPS) shows the highest tracking speed, but it achieves the lowest accuracy scores in terms of success and precision. Therefore, it could not utilize its full potential of tracking speed. The motivation of designing trackers is not just to improve the tracking speed, but they

#### **Figure 13.**

*The qualitative comparison results among trackers over several challenging sequences (carScale, liquor, motorRolling, skating2–2, and soccer) from the OTB100 benchmark.*


*\* The red highlight represents the best, green represents the second best, and blue represents the third-best performance. \*\*RASNet paper did not provide the precision score that is why we do not include it in our comparison.*

#### **Table 1.**

*The speed comparison results in terms of FPS and corresponding success and precision scores on the OTB100 benchmark.*

should be able to track the target in challenging scenarios. Preserving a balance between speed and accuracy is essential when designing a tracker for real-time applications. Most of the presented trackers in our comparison illustrate better performance than the SiamFC. RASNet and SCSAtt achieve the second-highest and thirdhighest tracking speeds, respectively. They also show better accuracy on success scores and show a balance performance.

Most trackers presented in **Table 1** show the high tracking speed because of leveraging the SiamFC pipeline and computing template image only for the very first frame of the video sequence. However, MemDTC achieves the lowest tracking speed among the other trackers, which is 40 FPS. It utilizes the memory mechanism for updating the target template during tracking, which reduces its operational efficiency. SA-Siam, Img-Siam, MemTrack, and SiamFRN achieve 50 FPS, 50 FPS, 50 FPS, and 60 FPS tracking speed, respectively. The motivation of these trackers is maintaining a balance between the tracking speed and accuracy utilizing the siamese tracking framework for handling challenging sequences fully.

## **5. Conclusion and future directions**

The attention mechanism is very simple but powerful for improving the network learning ability. It is beneficial for better target representation and enhancing tracker discrimination ability with fewer parameter overhead. The baseline siamese tracker does not perform well on accuracy on the challenging scenarios due to its insufficient feature learning and distinguishing inability between foreground and background. The attention mechanism is integrated into the baseline tracker pipeline to overcome the underlying siamese issues and improve the tracking performance. Attention helps to prioritize the features by calculating the relevant weights gain of the individual feature map. Therefore, it learns to highlight the important features of the target, which helps to handle challenges during tracking. In our study, we present a detailed discussion about the attention embedding in siamese trackers. The attention-based siamese trackers show outstanding performance and domination over other non-

## *Siamese-Based Attention Learning Networks for Robust Visual Object Tracking DOI: http://dx.doi.org/10.5772/intechopen.101698*

attentional trackers in the compared results. For example, SA-Siam and SCSAtt achieve high tracking accuracy in success and precision plots on most challenging attributes, representing the robustness of the model.

Furthermore, we observed that the employed attention mechanism in the target branch performs well instead of integrating only in the search branch or both branches. Besides this, multiple attention mechanisms are considered rather than the single attention mechanism to focus on the target class and the location information. Since the location information is important for correctly predicting the object's bounding box, the spatial information-focused module helps to improve the tracker's effectiveness on challenges. RASNet and SCSAtt trackers used the multiple attention mechanisms in their pipeline to handle the challenging sequences. The trackers' performance on challenging attributes in **Figures 11** and **12** proves the attention mechanism advantages. Using the attention mechanisms inside the tracker framework would be a better choice for future tracker developments. Therefore, improving the overall tracker performance on challenges and preserving the balance performance between accuracy and speed, integrating attention mechanisms are recommended for designing the future tracking framework.

## **Acknowledgements**

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (NRF-2019R1A2C1010786).

## **Author details**

Md. Maklachur Rahman\* and Soon Ki Jung Virtual Reality Lab, School of Computer Science and Engineering, Kyungpook National University, South Korea

\*Address all correspondence to: maklachur@gmail.com; maklachur@knu.ac.kr

© 2022 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## **References**

[1] Attard L, Farrugia RA. Vision based surveillance system. In: 2011 IEEE EUROCON-International Conference on Computer as a Tool. IEEE; 2011. pp. 1-4

[2] Janai J, Güney F, Behl A, Geiger A. Computer vision for autonomous vehicles: Problems, datasets and stateof-the-art. arXiv preprint arXiv: 170405519. 2017;**12**:1-308

[3] Lu WL, Ting JA, Little JJ, Murphy KP. Learning to track and identify players from broadcast sports videos. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2013;**35**(7): 1704-1716

[4] Pavlovic VI, Sharma R, Huang TS. Visual interpretation of hand gestures for human-computer interaction: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1997; **19**(7):677-695

[5] Song Y, Ma C, Gong L, Zhang J, Lau RW, Yang MH. Crest: Convolutional residual learning for visual tracking. In: Proceedings of the IEEE International Conference on Computer Vision. Italy: IEEE; 2017. pp. 2555-2564

[6] Danelljan M, Robinson A, Khan FS, Felsberg M. Beyond correlation filters: Learning continuous convolution operators for visual tracking. In: European Conference on Computer Vision. Netherland: Springer; 2016. pp. 472-488

[7] Danelljan M, Bhat G, Shahbaz Khan F, Felsberg M. Eco: Efficient convolution operators for tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Hawaii: IEEE; 2017. pp. 6638-6646

[8] Nam H, Han B. Learning multidomain convolutional neural networks for visual tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Nevada: IEEE; 2016. pp. 4293-4302

[9] Bertinetto L, Valmadre J, Henriques JF, Vedaldi A, Torr PH. Fullyconvolutional siamese networks for object tracking. In: European Conference on Computer Vision. Netherland: Springer; 2016. pp. 850-865

[10] Rahman MM, Fiaz M, Jung SK. Efficient visual tracking with stacked channel-spatial attention learning. IEEE Access. Utah: IEEE. 2020;**8**:100857- 100869

[11] He A, Luo C, Tian X, Zeng W. A twofold siamese network for real-time object tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Utah: IEEE; 2018. pp. 4834-4843

[12] Rahman M, Ahmed MR, Laishram L, Kim SH, Jung SK, et al. Siamese highlevel feature refine network for visual object tracking. Electronics. 2020;**9**(11): 1918

[13] Wang Q, Teng Z, Xing J, Gao J, Hu W, Maybank S. Learning attentions: Residual attentional siamese network for high performance online visual tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Utah: IEEE; 2018. pp. 4854-4863

[14] Qin X, Fan Z. Initial matting-guided visual tracking with Siamese network. IEEE Access. 2019;**03**:1

[15] Fiaz M, Rahman MM, Mahmood A, Farooq SS, Baek KY, Jung SK. Adaptive feature selection Siamese networks for visual tracking. In: International

*Siamese-Based Attention Learning Networks for Robust Visual Object Tracking DOI: http://dx.doi.org/10.5772/intechopen.101698*

Workshop on Frontiers of Computer Vision. Japan: Springer; 2020. pp. 167-179

[16] Woo S, Park J, Lee JY, So KI. Cbam: Convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV). Germany: Springer; 2018. pp. 3-19

[17] Wu Y, Lim J, Yang MH. Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2015;**37**(9): 1834-1848

[18] Wu Y, Lim J, Yang MH. Online object tracking: A benchmark. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Oregon: IEEE; 2013. pp. 2411-2418

[19] Bromley J, Guyon I, LeCun Y, Säckinger E, Shah R. Signature verification using a "siamese" time delay neural network. In: Advances in Neural Information Processing Systems. US: NIPS; 1994. pp. 737-744

[20] Tao R, Gavves E, Smeulders AW. Siamese instance search for tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Nevada: IEEE; 2016. pp. 1420-1429

[21] Held D, Thrun S, Savarese S. Learning to track at 100 fps with deep regression networks. In: European Conference on Computer Vision. Netherland: Springer; 2016. p. 749–765

[22] Chen K, Tao W. Once for all: A two-flow convolutional neural network for visual tracking. IEEE Transactions on Circuits and Systems for Video Technology. 2018;**28**(12): 3377-3386

[23] Valmadre J, Bertinetto L, Henriques J, Vedaldi A, Torr PH. End-to-end representation learning for correlation filter based tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Hawaii: IEEE; 2017. pp. 2805-2813

[24] Dong X, Shen J. Triplet loss in siamese network for object tracking. In: Proceedings of the European Conference on Computer Vision (ECCV). Germany: Springer; 2018. pp. 459-474

[25] Guo Q, Feng W, Zhou C, Huang R, Wan L, Wang S. Learning dynamic siamese network for visual object tracking. In: Proceedings of the IEEE International Conference on Computer Vision. Italy: IEEE; 2017. pp. 1763-1771

[26] Morimitsu H. Multiple context features in Siamese networks for visual object tracking. In: Proceedings of the European Conference on Computer Vision (ECCV). Germany: Springer; 2018

[27] Khan FS, Van de Weijer J, Vanrell M. Modulating shape features by color attention for object recognition. International Journal of Computer Vision. IJCV: Springer; 2012;**98**(1):49-64

[28] Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, et al. Dual attention network for scene segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. pp. 3146-3154

[29] Xu J, Zhao R, Zhu F, Wang H, Ouyang W. Attention-aware compositional network for person reidentification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Utah: IEEE; 2018. pp. 2119-2128

[30] Li D, Wen G, Kuai Y, Porikli F. Endto-end feature integration for correlation filter tracking with channel attention. IEEE Signal Processing Letters. SPL: IEEE; 2018;**25**(12):1815-1819

[31] Fiaz M, Mahmood A, Baek KY, Farooq SS, Jung SK. Improving object tracking by added noise and channel attention. Sensors. Utah: IEEE; 2020; **20**(13):3780

[32] Rahman MM. A DWT, DCT and SVD based watermarking technique to protect the image piracy. arXiv preprint arXiv:13073294. 2013

[33] Rahman MM, Ahammed MS, Ahmed MR, Izhar MN. A semi blind watermarking technique for copyright protection of image based on DCT and SVD domain. Global Journal of Research In Engineering. SPL: IEEE; 2017;**16**

[34] Yang T, Chan AB. Learning dynamic memory networks for object tracking. In: Proceedings of the European Conference on Computer Vision (ECCV). Germany: Springer; 2018. pp. 152-167

[35] Yang T, Chan AB. Visual tracking via dynamic memory networks. IEEE Transactions on Pattern Analysis and Machine Intelligence. TPAMI: IEEE; 2019

[36] Zheng Z, Wu W, Zou W, Yan J. End-to-End Flow Correlation Tracking with Spatial-Temporal Attention. Utah: IEEE; 2018. pp. 548-557

[37] Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems. US: NIPS; 2012. pp. 1097-1105

## **Chapter 3**

## Robust Template Update Strategy for Efficient Visual Object Tracking

*Awet Haileslassie Gebrehiwot, Jesus Bescos and Alvaro Garcia-Martin*

## **Abstract**

Real-time visual object tracking is an open problem in computer vision, with multiple applications in the industry, such as autonomous vehicles, human-machine interaction, intelligent cinematography, automated surveillance, and autonomous social navigation. The challenge of tracking a target of interest is critical to all of these applications. Recently, tracking algorithms that use siamese neural networks trained offline on large-scale datasets of image pairs have achieved the best performance exceeding real-time speed on multiple benchmarks. Results show that siamese approaches can be applied to enhance the tracking capabilities by learning deeper features of the object's appearance. SiamMask utilized the power of siamese networks and supervised learning approaches to solve the problem of arbitrary object tracking in real-time speed. However, its practical applications are limited due to failures encountered during testing. In order to improve the robustness of the tracker and make it applicable for the intended real-world application, two improvements have been incorporated, each addressing a different aspect of the tracking task. The first one is a data augmentation strategy to consider both motion-blur and lowresolution during training. It aims to increase the robustness of the tracker against a motion-blurred and low-resolution frames during inference. The second improvement is a target template update strategy that utilizes both the initial ground truth template and a supplementary updatable template, which considers the score of the predicted target for an efficient template update strategy by avoiding template updates during severe occlusion. All of the improvements were extensively evaluated and have achieved state-of-theart performance in the VOT2018 and VOT2019 benchmarks. Our method (VPU-SiamM) has been submitted to the VOT-ST 2020 challenge, and it is ranked 16*th* out of 38 submitted tracking methods according to the Expected average overlap (EAO) metrics. VPU\_SiamM Implementation can be found from the VOT2020 Trackers repository1 .

**Keywords:** real-time, tracking, template update, Siamese

## **1. Introduction**

Visual object tracking (VOT), commonly referred to as target tracking, is an open problem in computer vision; this is due to a broad range of possible applications and

<sup>1</sup> https: //www.votchallenge.net/vot2020/trackers. html

potential tracking challenges. Thus, it has been divided into sub-challenges according to several factors, which include: the number of targets of interest, the number of cameras, the type of data (i.e., medical, depth, thermal, or RGB images), static or moving camera, offline or online (real-time) processing.

Visual object tracking is the process of estimating and locating a target over time in a video sequence and assigning a consistent label to the tracked object across each video sequence frame. VOT algorithms have been utilized as a building block in more complex applications of computer vision such as traffic flow monitoring [1], humanmachine interaction [2], medical systems [3], intelligent cinematography [4], automated surveillance [5], autonomous social navigation [6] and activity recognition [7]. Real-time visual target tracking is the process of locating and associating the target of interest in consecutive video frames while the action is taking place in real-time. Realtime visual target tracking plays an inevitable role in time-sensitive applications such as autonomous mobile robot control to keep track of the target of interest while the viewpoint is changing due to the movement of the target or the robot. In such a scenario, the tracking algorithm must be accurate and fast enough to detect sudden changes in the observed environment and act accordingly to prevent losing track of the quickly moving target of interest.

Since the start of the Visual-Object-Tracking(VOT) Real-time challenge in 2017, Siamese network-based tracking algorithms have achieved top performance and won in the VOT real-time challenge with a considerable margin over the rest of the trackers. Nearly all top ten trackers applied the siamese network, and also the winners. The dominant methodology in real-time tracking, therefore, appears to be associated. A siamese network aims to learn a similarity function. It has a Y-shaped network architecture that takes two input images and returns similarity as an output. Siamese networks are utilized to compare the similarity between the template and the candidate images to determine if the two input images have an identical pattern(similarity). In the past few years, a series of state-of-the-art siamese-based trackers have been proposed, and all of them utilize embedded features by employing CNN to compute similarity and produce various types of output, such as similarity score(probability measure), response map(two-dimensional similarity score map), and bounding box location of the target.

Luca Bertinetto et al. [8] proposed Siamese fully convolutional network (SiameseFC) to addresses the broad similarity learning between a target image and

#### **Figure 1.**

*Fully-convolutional Siamese architecture. The output is a scalar-valued score map whose dimension depends on the size of the search image [8].*

#### *Robust Template Update Strategy for Efficient Visual Object Tracking DOI: http://dx.doi.org/10.5772/intechopen.101800*

search image, as presented in **Figure 1**. According to the VOT winner rules, the winning real-time tracker of the VOT2017 [9] was SiamFC. SiamFC applies a fullyconvolutional siamese network trained offline to locate an exemplar (template) image inside a larger search image **Figure 1**. The network is fully convolutional w.r.t search image: dense and efficient sliding window evaluation is achieved with a bilinear layer that computes the cross-correlation of two inputs. The deep convolutional network is trained offline with" ILSVRC VID" dataset [10] to address a general similarity learning problem and maximize target discrimination power. During tracking, SiamFC takes two images and infers a response map using the learned similarity function. The new target position is determined at the maximum value on the response map, where it depicts a maximum similarity **Figure 1**. As improvement in Siamese based tracking methods, Qiang Wang et al. [11] proposed SiamMask aiming to improve the ability of the SiamFC network to differentiate between the background and the foreground by augmenting their loss with a binary segmentation task. SiamMask is a depth-wise cross-correlation operation performed on a channel-by-channel basis, to keep the number of channels unchanged. The result of the depth-with cross-correlation indicated as RoW (response of candidate window), then distributed into three branches, respectively segmentation, regression, and classification branches **Figure 2**.

Seven of the top ten realtime trackers (SiamMargin [12], SiamDWST [13], SiamMask [11], SiamRPNpp [14], SPM [15] and SiamCRF-RT) are based on siamese correlation combined with bounding box regression. In contrast, the top performers of the VOT2019 Real-time challenge are from the class of classical siamese correlation trackers, and siamese trackers with region proposals [16]. Although these methods showed a significant improvement, there was small attention on how to carefully update the template of the target as time goes from the start of the tracking. In all top performers, the target template is initialized in the first frame and then kept fixed during the tracking process. However, diverse variations regarding the target usually occur in the process of tracking, i.e., camera orientation, illumination change, selfrotation, self-deformation, scale, and appearance change. Thus, failing to update the target template leads to the early failure of the tracker. In such scenarios, it is crucial to adapt the target template model to the current target appearance. In addition to this, most of the tracking methods fail when motion-blurred frames or frames with low-resolution appear in the video sequence, as depicted in **Figures 3** and **4**. We

#### **Figure 2.**

*An illustration of SiamMask with three branches, respectively segmentation, regression, and classification branches; where* ∗ *<sup>d</sup> denotes depth-wise cross correlation [11].*

believe that this case arguably arises from the complete lack of similar training samples. Therefore one must incorporate a data-augmentation strategy to consider both motion-blur and low-resolution during training to significantly increase the diversity of datasets available for training without actually gathering new data.

## **2. Method**

The problem of establishing a correspondence between a single target in consecutive frames can be affected by factors such as initializing a track, updating it robustly, and ending the track. The tracking algorithm receives an input frame from the camera module and performs the visual tracking over a frame following a siamese networkbased tracking approach. Since developing a new tracking algorithm from scratch is beyond the scope of this chapter, a state-of-the-art siamese-based tracking algorithm called siammask [11], one of the top performers in the VOT2019 real-time challenge, is used as a backbone of our tracking algorithm.

To mitigate the limitations associated with Siamese-based tracking methods. This section presents two improvements on top of the SiamMask implementation. *VPU\_SiamM Implementation can be found from VOT2020 Trackers repository*<sup>2</sup> .

## **2.1 Data-augmentation**

As mentioned in the introduction, the siamese-based tracker fails when motionblurred frames or frames with low-resolution appear in the video sequence, as depicted in **Figures 3** and **4**. Therefore to address the problems, a tracking algorithm should incorporate a data-augmentation strategy to consider both motion-blur and low-resolution during training. Since data augmentation is a strategy that significantly increases the diversity of datasets available for training without actually gathering new data, it will require implementing the data augmentation techniques explained through the following sub-sections.

#### **Figure 3.**

*An example of SiamMask failure due to motion-blur, green and yellow bounding box indicates ground truth and predicted target respectively.*

<sup>2</sup> https: //www.votchallenge.net/vot2020/trackers. html

*Robust Template Update Strategy for Efficient Visual Object Tracking DOI: http://dx.doi.org/10.5772/intechopen.101800*

#### **Figure 4.**

*An example of SiamMask failure due to low resolution, green and yellow bounding box indicates ground truth and predicted target respectively.*

## *2.1.1 Data-augmentation for motion-blur*

Kernel filters are a prevalent technique in image processing to blur images. These filters work by sliding an *n* � *n* matrix across an image with a Gaussian blur filter, resulting in a blurry image. Intuitively, blurring images for data augmentation could lead to higher resistance to motion blur during testing [17]. **Figure 5** illustrates an example of a motion-blurred frame generated by the developed data-augmentation technique.

#### *2.1.2 Data-augmentation for low-resolution*

We followed a Zhangyang Wang et al. [18] approach to generate a low-resolution dataset. During training, the original (High Resolution) images are first downscaled by *scale* ¼ 4 and then upscaled back by *scale* ¼ 4 with nearest-neighbor interpolation as low-resolution images. A small additive Gaussian noise is added as a default data augmentation during training. An illustrates of a low-resolution frame generated by the developed data-augmentation technique is depicted in **Figure 6**.

#### **Figure 5.**

*An example of motion blurred frame (left image) generated from original frame (right image) using the developed data-augmentation for motion-blur technique.*

**Figure 6.**

*An illustrates on how the low-resolution data augmentation generation are performed (from (a) to (c)).*

## **2.2 Target template updating strategy**

The target template update mechanism is an essential step, and its robustness has become a crucial factor influencing the quality of the tracking algorithm. To tackle this problem, more recent Siamese trackers [19–21] have implemented a simple linear update strategy using a running average with a constant learning rate. However, A simple linear update is often inadequate to cope with the changes needed and to generalize to all potentially encountered circumstances. Lichao Zhang et al. [22] proposes to replace the hand-crafted update function with a method that learns to update, using a convolutional neural network called *UpdateNet*, aims to estimate the optimal template for the next frame. However, excessive reliance on a single updated template may suffer from catastrophic drift and the inability to recover from tracking failures.

One can argue the importance of the original initial and supplementary updatable templates, which incorporate the up-to-date target information. To this end, we have incorporated a template updates strategy that utilizes both the initial template (ground truth template)*T*\_*G* and an updatable template *T*\_*i*. Consequently, the initial template *T*\_*G* provides highly reliable information. It increases robustness against model drift, whereas an updatable template *T*\_*i* integrates the new target information at the predicted target location given by the current frame. However, when a target is temporarily occluded, such as when a motorbike passes through the forest and is shielded by trees **Figure 7**, updating the template during occlusion is not required as it may cause template pollution because of shield interference. Therefore, our system needs to recognize if occlusion occurs and be able to decide whether to update the template or not. Examples of occlusion in tracking are shown in **Figures 7** and **8**. As depicted in **Figures 7** and **8**, when the target is occluded, the score becomes small, indicating the similarity between the tracked target in the current frame and the

#### **Figure 7.**

*Overview on how the target similarity score (red) varies under different occlusion scenario during tracking process. The similarity score is indicated in red color in the top left of each frame, VOT2019 road dataset. Where blue: Ground truth, red: Tracking result.*

*Robust Template Update Strategy for Efficient Visual Object Tracking DOI: http://dx.doi.org/10.5772/intechopen.101800*

#### **Figure 8.**

*Overview on how the target similarity score varies under different occlusion scenario during tracking process, VOT2019 girl dataset. Where blue: Ground truth, red: Tracking result.*

template is low. Thus the response on the score value can be used as the criterion for strategic decision. **Figure 9** illustrates an overview of the method.

## *2.2.1 Updating with a previous and a future template*

In 2.2 the target template update strategy considers the target appearance only from the previous frame. However, in this section, we introduce an alternative template update strategy that considers both the target appearance from the previous frame and the target appearance in the future frame, which incorporates future information of the target appearance by updating the updatable template T\_i described in 2.2. The template updating mechanism is shown in **Figure 10**. During online tracking, the template updating and the tracking procedure works as follows:

1.Tracking procedure on the next frame *i* þ 1 is applied using both the previously updated target template T\_i and the ground truth target template T\_G.

#### **Figure 9.**

*Target template update strategy: Where T\_G is the ground truth template,T\_i is an updatable template, Sth is the score threshold and P\_box is the predicted target location.*

#### **Figure 10.**

*Updating with a previous and a future template, where T\_G is ground truth target template,T\_i is previous target template and T\_i + 1 is future target template. Where green: Ground truth template, yellow: Updatable template, blue: Tracking result.*


First, a tracking procedure is applied using both the previous target template in T\_i and the ground truth template T\_G to perform tracking on the next frame. Then the updatable template T\_i is updated using the predicted target on the next frame incorporating a piece of future information about the target. Finally, a tracking procedure is again applied to the current frame using both the updated future target template T\_i + 1 and the ground truth template T\_G.

## **3. Implementation**

## **3.1 Training**

The SiamMask implementation was trained using 4 Tesla V100 GPUs. In this experiment, only the refinement module of the mask branch is trained. The training process was carried out using COCO<sup>3</sup> and Youtube-vos<sup>4</sup> Datasets: The training was

<sup>3</sup> http://cocodataset.org/

<sup>4</sup> https: //youtube-vos.org/dataset/vis/

performed over ten epochs using mini-batches of 32 samples. The data augmentation techniques described in 2.1.1 and 2.1.2 were utilized for generating datasets with motion-blur and low-resolution, respectively.

## **3.2 Tracking**

During tracking, the tracking algorithm is evaluated once per frame. The output mask is selected from the location attaining the maximum score in the classification branch and creating an optimized bounding box. Finally, the highest scoring output of the box branch is used as a reference to crop the next search frame.

## **3.3 Visual-object-tracking benchmark**

As object tracking has gotten significant attention in the last few decades, the number of publications on tracking-related problems has made it difficult to follow the developments in the field. One of the main reasons is that there was a lack of commonly accepted annotated datasets and standardized evaluation protocols that allowed an objective comparison of different tracking methods. To address this issue, the Visual Object Tracking (VOT) workshop was organized in association with ICCV2013<sup>5</sup> . Researchers from the industry and academia were invited to participate in the first VOT2013 challenge, which was aimed at model-free single-object visual trackers. In contrast to related attempts in tracker benchmarking, the dataset is labeled per-frame by visual properties such as occlusion, motion change, illumination change, scale, and camera motion, offering a more systematic comparison of the trackers [23]. VOT focused on short-term tracking (no re-detection) until the VOT2017 challenge, where a new"real-time challenge" was introduced. In the Realtime challenge, the tracker constantly receives images at real-time speed. If the tracker does not respond after the new frame becomes available, the last bounding box from the previous frame is reported as the tracking result in the current frame.

## **3.4 VOT evaluation metrics**

The VOT challenges applies a reset-based methodology. Whenever a zero overlap between the predicted bounding box and the ground truth occurs, a failure is detected, and the tracker is re-initialized five frames after the failure. There are three primary metrics used to analyze the tracking performance in visual object tracking challenge benchmark: Accuracy (A), Robustness (R), and Expected Average Overlap (EAO) [9].

## *3.4.1 Accuracy*

Accuracy is calculated as the average overlap between the predicted and ground truth bounding boxes during successful tracking periods [23]. The tracking accuracy at time-step t is defined as the overlap between the tracker predicted bounding box *AT t* and the ground truth bounding box *AG t*

<sup>5</sup> http://www.iccv 2013.org/

$$\Phi\_t = \frac{A\_t^G \cap A\_t^T}{A\_t^G \cup A\_t^T} \tag{1}$$

#### *3.4.2 Robustness*

Robustness measures how often the tracker loses/fails the target, i.e., a zero overlap between the predicted and the ground truth bounding boxes during tracking. The protocol specifies an overlap threshold to determine tracking failure. The number of failed tracked frames are then divided by the total number of frames, as depicted in Eq. (2):

$$P\tau = \frac{\{\Phi\_t \le \tau\}\_{k=1}^N}{N} \tag{2}$$

Where *τ* is the overlap threshold which is zero in this case, and *N* is the run time of the tracker in frames. A failure is identified in a frame when the overlap (as computed using Eq. (1)) is below the defined threshold *τ*. Thus, the robustness of the tracker is given as a normalized number of incorrectly tracked frames.

#### *3.4.3 Expected average overlap (EAO)*

For the purpose of ranking tracking algorithms, it is better to have a single metric. Thus, in 2015 the VOT challenge introduced Expected Average Overlap (EAO), which combines both Accuracy and Robustness. EAO estimates the average overlap that a tracker is expected to achieve on a large collection of short-term sequences with the same visual properties as the given dataset.

The EAO metric can be found by calculating the average of Φ^ *Ns* over typical sequence lengths, from *Nlo* to *Nhi*:

$$\Phi = \frac{1}{N\_{hi} - N\_{lo}} \sum\_{N\_l = N\_{lo}}^{N\_{hi}} \hat{\Phi}\_{N\_l} \tag{3}$$

#### **3.5 Experiment A: Low-resolution data-augmentation**

This first experiment is dedicated to evaluating the impact of the low-resolution data-augmentation technique. The data augmentation technique described in 2.1.2 was applied to generate datasets with low-resolution during the training process of the refinement module of the network.

#### *3.5.1 Evaluation*

The performance of the developed method: incorporating low-resolution datasets using data augmentation technique during training has been evaluated using the VOT evaluation metrics on the VOT2018, VOT2019 datasets. The overall Evaluation results are shown in **Table 1**.

The term # **Tracking Failures (Lost)** indicates how often the tracking algorithm lost the target in the given video sequence, basically:

*Robust Template Update Strategy for Efficient Visual Object Tracking DOI: http://dx.doi.org/10.5772/intechopen.101800*

• **Tracking Lost/Failure:** is when IOU between Ground-truth and Predicted Bounding box is Zero. Thus the lower the values ↓, the higher the performance.

In **Table 1**, we compare our approach against the state-of-the-art SiamMask tracker on the VOT2018 and VOT2019 benchmarks, respectively. It can be clearly observed that the data augmentation technique for incorporating low-resolution datasets has contributed to robustness improvements. The tracker's failure has decreased from 60 to 53 and from 104 to 93 in VOT2018 and VOT2019, respectively. Improvements are clearly shown especially in a video sequence with low-resolution, i.e. *handball*1, and *handball*2 as depicted in **Figure 11**.

The results obtained in **Table 1** confirm that the developed methodology significantly improved the overall performance of the tracker. This approach outperforms the original SiamMask achieving a relative gain of 2.6% and 0.4% in EAO on VOT2018 and VOT2019, respectively. Most significantly, a gain of around 3% and 5% in Robustness value has been achieved on VOT2018 and VOT2019, respectively.

#### *3.5.2 Results*

As it is depicted in **Figure 11**, The data-augmentation for incorporating lowresolution datasets during training has contributed to enhancing the tracker robustness. Thus, the tracker becomes robust against low-resolution frames during inference in relative to the original SiamMask tracker.


#### **Table 1.**

*Comparison between SiamMask and the developed method (incorporating low-resolution data during training), under the VOT metric (EAO, Accuracy, Robustness) on VOT2018 (left) and VOT2019 (right), best results are marked in Bold.*

#### **Figure 11.**

*Qualitative comparison between SiamMask and developed data-augmentation technique for incorporating lowresolution datasets during training. Where blue: Ground truth, red: Tracking result.*

## **3.6 Experiment B: Motion-blur data-augmentation**

In this experiment, the data-augmentation technique for incorporating motionblurred datasets described in 2.1.1 was applied for generating datasets with motionblur during the training process of the refinement module of the network.

## *3.6.1 Evaluation*

The performance of the tracking algorithm incorporating the motion-blur data augmentation technique has been evaluated using the VOT evaluation metrics on the VOT2018, VOT2019 datasets. The Overall Evaluation results are shown in **Table 2**.

The data augmentation technique for incorporating motion-blurred datasets has contributed to the overall enhancement of the tracker performance. They are clear improvements in terms of Robustness in multiple video sequences relative to SiamMask. From **Table 2**, it can be concluded that the data augmentation technique for incorporating motion-blurred datasets has contributed to the improvement in Robustness of the tracker, especially in a video sequence with a motion-blur, i.e., *ball*3, and *car*1. The overall performance of the tracker has been improved, and the developed method obtained a significant relative gain of 2*:*1% EAO in VOT2018 and 4% R in VOT2019, compared to the SiamMask result as it is depicted in **Table 2**.

## *3.6.2 Results*

**Figure 12** presents a visual comparison between SiamMask and the developed improvement incorporating motion-blurred datasets during training using


#### **Table 2.**

*Comparison between SiamMask and the developed method (incorporating motion-blurred dataset during training), under the VOT metric (EAO, accuracy, robustness) on VOT2018 (left) and VOT2019 (right).*

#### **Figure 12.**

*Qualitative comparison between SiamMask and developed data-augmentation technique: Incorporating motionblurred datasets during training. Where blue: Ground truth, red: Tracking result.*

data-augmentation. From **Figure 12** it can be clearly observed that the dataaugmentation for incorporating motion-blurred dataset during training has contributed to enhancing the tracker Robustness. Thus, the tracker has become robust against motion-blurred video frames during inference in relative to the original SiamMask tracker.

## **3.7 Experiment C: Target template updating strategy**

When it comes to updating the target template, the question is how and when to update the target. The parameter *Sth* controls when to update the target template according to the developed template update strategy in 2.2. Thus, the 2*nd* (updatable) target template is updated when the predicted target's score is higher than the threshold *Sth*. Therefore determining the optimal threshold value *Sth* is the main focus in this sub experiment.

This set of experiments compares the effect of the target template updating strategy by varying the score threshold of *Sth* by evaluating the tracking performance on VOT2018 and VOT2019 datasets. From the experimental results shown in **Tables 3** and **4**, it can be observed that the performance of the tracker increases as the parameter *Sth* increases. Thus by using a *Sth* value as high as possible, we guaranty an efficient template update strategy by avoiding template updates during severe occlusion. **Figure 13** illustrates an overview of how each VOT metric (EAO, Accuracy, and Robustness) and FPS behave as we vary the *Sth*. Therefore, the parameter *Sth* plays an important role in deciding whether to update the target template or not when cases such as occlusion or deformation occur, as illustrated in **Figures 14** and **15**. It is worth mentioning that the template update has a negative impact on the tracker's speed since it needs to compute the feature map of the updated template for every updated template. Therefore by setting *Sth* high, we can leverage both performance and speed as it is depicted in **Figure 13**.

**Figures 14** and **15** are an illustration of how the template update strategy decides when to update the updatable template. For instance in **Figure 15a** the target is not occluded; as a result the score is high, thus *Update* : *True*flag is generated indicating to update the target template, on the other hand in **Figure 15b** and **c**, the target is


#### **Table 3.**

*Determining the optimal score threshold (Sth) for updating the target template under VOT-metrics on VOT2018.*


#### **Table 4.**

*Determining the optimal score threshold (Sth) for updating the target template under VOT-metrics on VOT2019.*

**Figure 13.**

*An illustration on the effect of the tracking performance, with a template update strategy by varying the score threshold Sth, NB: This experiments carried out using the checkpoint which include data-augmentation.*

occluded by the tree: thus *Update* : *False* flag is generated indicating not to update the target template during such occlusions. This experiment was carried out using *<sup>S</sup>th* <sup>&</sup>gt; <sup>¼</sup> <sup>0</sup>*:*95.

**Table 5** presents a comparison between no-update SiamMask and incorporating the developed template update strategy: it can be observed that a relative gain of 0.7% and 2.0% in Robustness has been achieved by incorporating template update strategy. Thus, the tracker has encountered less failure than the no-update SiamMask, decreasing from 60 to 58 and 104 to 100 in VOT2018 and VOT2019 benchmarks, respectively. The robustness of the tracker is the crucial element for applications such as automatic robotic cameras where there is no human assistance.

*Robust Template Update Strategy for Efficient Visual Object Tracking DOI: http://dx.doi.org/10.5772/intechopen.101800*

#### **Figure 14.**

*Visual illustration on how the target template update strategy decides whether to update the template or not based on the similarity score under different occlusion scenario during tracking process, VOT2019 girl dataset. Where blue: Ground truth, red: Tracking result.*

#### **Figure 15.**

*Visual illustration on how the target template update strategy decides whether to update the template or not based on the similarity score under different occlusion scenario during tracking process, VOT2019 girl dataset. Where blue: Ground truth, red: Tracking result.*


#### **Table 5.**

*Comparison between no-update SiamMask and incorporating target template update under VOT2018 (left) and VOT2019 (right) benchmarks.*

## *3.7.1 Updating with a previous and a future template*

This experiment dedicated to examine the strength and weakness of the" updating with a previous and future frame" template update strategy described in 2.2.1. As can be seen from **Table 6**, the method"updating with previous and a future template" has achieved a relative gain of around 0.7% and 2.5% in Robustness value w.r.t SiamMask in both VOT2018 and VOT2019 benchmark, respectively. This indicates that the"Updating with Previous and Future template" strategy has enhanced the tracker's Robustness, which is the most crucial in automated tracking applications. However, this can not be used for real-time applications as the processing speed is very slow, around 12 FPS on a laptop equipped with NVIDIA GEFORCE GTX1060. The main


#### **Table 6.**

*Comparison between no-update SiamMask and incorporating target template updating with previous and future template on VOT2018 (left) and VOT2019 (right) benchmark.*

computational burden on the tracker is related to the target template feature extraction network. Thus, the tracking algorithm processing speed becomes very slow when the target template is updated with the previous and future template, resulting in a poor FPS.

#### **3.8 Experiment E: Comparison with state-of-the-art trackers**

This section compares our tracking framework called VPU\_SiamM with other state-of-the-art trackers SiamRPN, SiamMask in the VOT2018 and SiamRPN++, SiamMask in VOT2019.

To take advantage of the incorporated improvements, a tracker named VPU\_SiamM has been developed. VPU\_SiamM has been trained based on the data augmentation technique incorporating both motion-blur and low-resolution, and during online inference, a target template update strategy is applied.

We have tested our VPU\_SiamM tracker on the VOT2018 dataset in comparison with state-of-the-art methods. We compare with the top trackers SiamRPN (winner of the VOT2018 real-time challenge) and SiamMask among the top performer in the VOT2019 challenge. Our tracker obtained a significant relative gain of 1*:*3% in EAO, compared to the top-ranked trackers. Following the evaluation protocol of VOT2018, we adopt the Expected Average Overlap (EAO), Accuracy (A), and Robustness (R) to compare different trackers. The detailed comparisons are reported in **Table 7**: it can be observed that the VPU\_SiamM has achieved top performance on EAO, and R. Especially, our VPU\_SiamM tracker outperforms SiamRPN (the VOT2018 real-time challenge winner), achieving a relative gain of 1*:*3% in EAO and 1.6% in Accuracy and 4% in Robustness. Besides, our tracker yields a relative gain of 4% on Robustness w.r.t


#### **Table 7.**

*Comparison of our tracker VPU\_SiamM with the state-of-the-art trackers SiamRPN and SiamMask in terms of expected average overlap (EAO), accuracy, and robustness (failure rate) on the VOT2018 benchmark.*


#### **Table 8.**

*Comparison of our tracker VPU\_SiamM with the state-of-the-art trackers SiamRPN++ and SiamMask in terms of expected average overlap (EAO), accuracy, and robustness (failure rate) on the VOT2019 benchmark.*

both SiamMak and SiamRPN, which is the common vulnerability of the Siamese network-based trackers.

Following previous VOT evaluation, we have evaluated our VPU\_SiamM tracker on VOT2019 datasets, which contains 60 challenging testing sequences. As shown in **Table 8**, our VPU\_SiamM also achieves the best tracking results on VOT2019 in EAO and Accuracy metrics compared to state-of-the-art trackers SiamMask and SiamRPN+ +. More specifically, our approach improves the EAO by around 1%.

**Submission to VOT-ST 2020 Challenge:** Our method (VPU-SiamM) has been submitted to the VOT-ST 2020 challenge [24], and our tracking methods (VPU SiamM) is ranked 16*th* out of 38 computing tracking methods according to the Expected average overlap (EAO) metrics [24].

## **4. Conclusions**

In this chapter, one of the state-of-the-art tracking algorithms based on siamese networks called SiamMask has been used as a backbone, and two improvements have been affixed, each addressing different aspects of the tracking task.

The developed data augmentation technique for incorporating low-resolution and motion-blur has been evaluated separately and jointly, achieving state-of-the-art results in the VOT2018 and VOT2019 benchmarks. From the evaluation results, it is clear to conclude that the data augmentation technique has played an essential role in improving the overall performance of the tracking algorithm. It has outperformed the SiamMask results in both VOT2018 and VOT2019 benchmarks. In contrast, among the three data augmentation techniques, the data augmentation technique for incorporating both motion-blur and low-resolution outperforms the rest in terms of EAO in VOT2018 and VOT2019 benchmarks. Nevertheless, the data-augmentation for incorporating only motion-blur has achieved a top performance according to the Accuracy metric in both VOT2018 and VOT2019 benchmarks. However, the Accuracy is less significant as it only considers the IOU during a successful tracking. According to the VOT ranking method, the EAO value is used to rank tracking methods. Therefore the data augmentation technique for incorporating both motion-blur and low-resolution is ranked top among the others. This indicates that the data-augmentation technique has contributed to the improvement of the overall tracker performance.

Comparable results on VOT2018 and VOT2019 benchmarks confirm that the robust target template update strategy that utilizes both the initial ground truth template and a supplementary updatable template and avoiding template updates during severe occlusion can significantly improve the tracker's performance with respect to SiamMask results while running at 41 FPS.

A tracker named VPU\_SiamM was trained based on the presented approach, and it was ranked 16*th* out of 38 submitted tracking methods in the VOT-ST 2020 challenge [24].

## **Acknowledgements**

This work has been partially supported by the Spanish Government through its TEC2017-88169-R MobiNetVideo project.

## **Author details**

Awet Haileslassie Gebrehiwot<sup>1</sup> \*, Jesus Bescos<sup>2</sup> and Alvaro Garcia-Martin<sup>2</sup>

1 Czech Technical University in Prague, Prague, Czech Republic

2 Universidad Autonoma de Madrid, Madrid, Spain

\*Address all correspondence to: awethaileslassie21@gmail.com

© 2022 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

*Robust Template Update Strategy for Efficient Visual Object Tracking DOI: http://dx.doi.org/10.5772/intechopen.101800*

## **References**

[1] Tian B, Yao Q, Gu Y, Wang K, Li Y. Video processing techniques for traffic flow monitoring: A survey. In: 14th International IEEE Conference on Intelligent Transportation Systems, ITSC 2011, Washington, DC, USA, October 5-7, 2011. IEEE; 2011. pp. 1103-1108

[2] Zeng M, Guo G, Tang Q. Vehicle human-machine interaction interface evaluation method based on eye movement and finger tracking technology. In: HCI International 2019 - Late Breaking Papers - 21st HCI International Conference, HCII 2019, Orlando, FL, USA, July 26-31, 2019, Proceedings (C. Stephanidis, ed.), vol. 11786 of Lecture Notes in Computer Science. Springer; 2019. pp. 101-115

[3] Brandes S, Mokhtari Z, Essig F, Hünniger K, Kurzai O, Figge MT. Automated segmentation and tracking of non-rigid objects in time-lapse microscopy videos of polymorphonuclear neutrophils. Medical Image Anal. 2015;**20**(1):34-51

[4] Nägeli T, Alonso-Mora J, Domahidi A, Rus D, Hilliges O. Realtime motion planning for aerial videography with real-time with dynamic obstacle avoidance and viewpoint optimization. IEEE Robotics Autom. Lett. 2017;**2**(3):1696-1703

[5] Esterle L, Lewis PR, McBride R, Yao X. The future of camera networks: Staying smart in a chaotic world. In: Arias-Estrada MO, Micheloni C, Aghajan HK, Camps OI, Brea VM, editors. Proceedings of the 11th International Conference on Distributed Smart Cameras, Stanford, CA, USA, September 5-7, 2017. ACM; 2017. pp. 163-168

[6] Chen YF, Everett M, Liu M, How JP. Socially aware motion planning with

deep reinforcement learning. CoRR. 2017;**abs/1703.08862**

[7] Aggarwal JK, Xia L. Human activity recognition from 3d data: A review. Pattern Recognition Letters. 2014;**48**: 70-80

[8] Bertinetto L, Valmadre J, Henriques JF, Vedaldi A, Torr PHS. Fully-convolutional siamese networks for object tracking. In: Computer Vision - ECCV 2016 Workshops - Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part II (G. Hua and H. Jégou, eds.), vol. 9914 of Lecture Notes in Computer Science. 2016. pp. 850-865

[9] Matej Kristan EA, Matas J. The visual object tracking VOT2017 challenge results. In: 2017 IEEE International Conference on Computer Vision Workshops, ICCV Workshops 2017, Venice, Italy, October 22-29, 2017. IEEE Computer Society; 2017. pp. 1949-1972

[10] Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision. 2015;**115**(3):211-252

[11] Wang Q, Zhang L, Bertinetto L, Hu W, Torr PHS. Fast online object tracking and segmentation: A unifying approach. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE; 2019. pp. 1328-1338

[12] Zhou J, Wang P and Sun H. Discriminative and Robust Online Learning for Siamese Visual Tracking. 2019

[13] Zhang Z, Peng H, Wang Q. Deeper and wider siamese networks for realtime visual tracking. CoRR. 2019;**abs/ 1901.01660**

[14] Li B, Wu W, Wang Q, Zhang F, Xing J, Yan J. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In: arXiv preprint arXiv: 1812.11703. 2018

[15] Wang G, Luo C, Xiong Z, Zeng W. Spm-tracker: Series-parallel matching for real-time visual object tracking. CoRR. 2019;**abs/1904.04452**

[16] Matej Kristan EA. The seventh visual object tracking VOT2019 challenge results. In: 2019 IEEE/CVF International Conference on Computer Vision Workshops, ICCV Workshops 2019, Seoul, Korea (South), October 27-28, 2019. IEEE; 2019. pp. 2206-2241

[17] Shorten C, Khoshgoftaar TM. A survey on image data augmentation for deep learning. J. Big Data. 2019;**6**:60

[18] Wang Z, Chang S, Yang Y, Liu D, Huang TS. Studying very low resolution recognition using deep networks. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society; 2016. pp. 4792-4800

[19] Wang Q, Teng Z, Xing J, Gao J, Hu W, Maybank SJ. Learning attentions: Residual attentional siamese network for high performance online visual tracking. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. IEEE Computer Society; 2018. pp. 4854-4863

[20] Zhu Z, Wang Q, Li B, Wu W, Yan J, Hu W. Distractor-aware siamese networks for visual object tracking. In: Computer Vision - ECCV 2018 - 15th European Conference, Munich,

Germany, September 8-14, 2018, Proceedings, Part IX (V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, eds.), vol. 11213 of Lecture Notes in Computer Science. Springer; 2018. pp. 103-119

[21] Li B, Yan J, Wu W, Zhu Z, Hu X. High performance visual tracking with siamese region proposal network. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. IEEE Computer Society; 2018. pp. 8971-8980

[22] Zhang L, Gonzalez-Garcia A, van de Weijer J, Danelljan M, Khan FS. Learning the model update for siamese trackers. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE; 2019. pp. 4009-4018

[23] Kristan EAM. The visual object tracking vot2013 challenge results. In: 2013 IEEE International Conference on Computer Vision Workshops. 2013. pp. 98-111

[24] Kristan M, Leonardis A, Matas J, Felsberg M, Pflugfelder R, Kämäräinen J-K, et al. "The eighth visual object tracking vot2020 challenge results," in Computer Vision – ECCV 2020 Workshops (A. Bartoli and A. Fusiello, eds.), (Cham), Springer: International Publishing; 2020. pp. 547-601

## **Chapter 4**

## Cognitive Visual Tracking of Hand Gestures in Real-Time RGB Videos

*Richa Golash and Yogendra Kumar Jain*

## **Abstract**

Real-time visual hand tracking is quite different from commonly tracked objects in RGB videos. Because the hand is a biological object and hence suffers from both physical and behavioral variations during its movement. Furthermore, the hand acquires a very small area in the image frame, and due to its erratic pattern of movement, the quality of images in the video is affected considerably, if recorded from a simple RGB camera. In this chapter, we propose a hybrid framework to track the hand movement in RGB video sequences. The framework integrates the unique features of the Faster Region-based Convolutional Neural Network (Faster R-CNN) built on Residual Network and Scale-Invariant Feature Transform (SIFT) algorithm. This combination is enriched with the discriminative learning power of deep neural networks and the fast detection capability of hand-crafted features SIFT. Thus, our method online adapts the variations occurring in real-time hand movement and exhibits high efficiency in cognitive recognition of hand trajectory. The empirical results shown in the chapter demonstrate that the approach can withstand the intrinsic as well as extrinsic challenges associated with visual tracking of hand gestures in RGB videos.

**Keywords:** hand tracking, faster R-CNN, hand detection, feature extraction, scale invariant, artificial neural network

## **1. Introduction**

Hand Gestures play very significant roles in our day-to-day communication, and often they convey more than words. As technology and information are growing rapidly in every sector of our life, interaction with machines has become an unavoidable part of life. Thus, a deep urge for natural interaction with machines is growing all around [1, 2]. One of the biggest accomplishments in the domain of Hand Gesture Recognition (HGR) is Sign language recognition (SLR) where machines interpret the static hand posture of a human standing in front of a camera [3]. Recently, implementation of HGR-based automotive interface in BMW cars is very much appreciated. Here, five gestures are used for contactless control of music volume and incoming calls while driving [4]. Project Soli is the ongoing project of Google's Advanced Technology; in this project a miniature radar is developed that understands the realtime motion of the human hand at various scales [5].

Hand gestures are very versatile as they comprise static as well as dynamic characteristics, physical as well as behavioral characteristics, for example, movement in

any direction, fingers can bend to many angles. Hand skeleton has a complex structure with a very high freedom factor, and thus its two-dimensional RGB data sequence has unpredictable variations. Visual recognition of dynamic hand gestures is complex because the complete process requires the determination of hand posture along with a cognitive estimation of the trajectory of motion of that posture [3, 6–9]. Due to these intricacies to date, vision-based HGR applications mainly dominate with static hand gesture recognition.

## **2. Challenges in online tracking hand motion**

In context with computer vision and pattern recognition, a human hand is described as a biological target with a complex structure. Uneven surface, broken contours, and erratic pattern of movement are some of the natural characteristics that complicate DHGR [10]. Thus, in comparison to the other commonly tracked moving object, a hand is a non-rigid subtle object and covers a very small area in the image frame. The scientific challenges accompanied in the online tracking of the hand region in an unconstrained environment in RGB images captured using a simple camera are categorized as follows: [3, 4, 6–11].

i. Intrinsic Challenges: Intrinsic challenges are related to a target that is "Hand" physical and behavioral nature. The features such as

Hand Appearance: The number of joints in the hand skeleton, the appearance of the same hand posture has a large variation, known as shape deformation. Different postures have a wide difference in occupancy area in an image frame, and some postures only cover 10% of the image frame, which is a very small target size in computer vision. In a real-time unconstrained environment, the two-dimensional (2-D) posture shows large variation during movement.

Manner of Movement: There is a large diversity among human beings in performing the gesture of the same meaning, in terms of speed and path of movement. The moving pattern of the hand is erratic, irregular, and produces blur in the image sequence. Furthermore, the two-dimensional data sequence of a moving hand is greatly affected by background conditions, thus tracking and interpretation of dynamic hand gestures are a challenging task in the HGR domain. The unpredictable variation in target trajectory makes the detection and classification process complex in pattern recognition.

ii. Extrinsic Challenges: These challenges mainly arise due to the environment in which the hand movement is captured. Some of the major factors that deeply impact the real-time visual tracking of the dynamic hand gestures are as follows:

Background: In the real-time HGR applications, backgrounds are unconstrained, we cannot use fixed background models to differentiate between the foreground and the background. Thus, the core challenge in the design of a real-time hand tracking system is the estimation of discriminative features between background and target hand posture.

Illumination: Illumination conditions in real-time applications are uneven and also unstable. Thus, 2-D (two-dimensional) projection of the 3-D

(three-dimensional) hand movement produces loss of information in RGB images. This loss is the major reason for errors in the visual tracking of hand movement.

Presence of other skin color objects in the surroundings: The presence of objects with similar RGB values such as the face, neck, arm, etc., is the serious cause for track loss in the RGB-based visual tracking techniques.

## **3. Components of DHGR**

There are four main components in cognitive recognition of dynamic hand gestures [3, 10–12].


In Dynamic Hand Gesture Recognition (DHGR), acquisition of signals plays a very important role in deciding the technique to recognize and deduce the hand pattern into meaningful information. Contact-based sensors and contactless sensors are two main types of sensors to acquire hand movement signals. Contact-based sensors are those sensors that are attached to the body parts of a user example. Data gloves are hand gloves, accelerometers are attached the arm region, and egocentric sensors are put on the head to record hand movement. Wearable sensor devices are equipped with inertial, magnetic sensors, mechanical, ultrasonic, or barometric [7]. Andrea Bandini et al. [13], in their survey, presented many advantages of egocentric visionbased techniques as they can acquire hand signals very closely. Although the contactbased techniques require fewer computations, but wearing these devices gives uneasiness to the subject. Due to the electrical and magnetic emission of signals, it is likely to produce hazardous effects on the human body.

Contactless sensors or vision-based sensor technology is becoming encouraging technology to develop natural human-machine interfaces [1–4, 14]. These devices consist of visual sensors, with a single or a group of cameras situated at a distance from the user to record the hand movement. In vision-based methods, the acquired data is image type, a user does not have to wear any devices, and he can move his hand naturally in an unconstrained pattern. The important assets of vision-based techniques are large flexibility for users, low hardware requirements, and no health issues. These methods have the potential to develop any natural interface for remote humanmachine interaction, this can ease the living of physically challenged or elderly people with impaired mobility [2, 9, 15].

In vision-based methods, the information is two-dimensional, three-dimensional, or multiview images. Two-dimensional images are RGB images with only intensity information about the object, captured using simple cameras and. Three-dimensional images are captured from advanced sensor cameras such as Kinect, Leap Motion, Time of flight, etc.; these cameras collect RGB along with depth information of the

object in the scene. The third and the most popular choice in HGR is multiview images; here two or more cameras are placed at different angles to capture the hand movement from many views [3, 6, 8].

Wang J. et al. [16] used two calibrated cameras to record hand gestures under stable lighting conditions. They initially segmented the hand region using YCbCr color space and then applied SIFT algorithm for feature extraction. After then, they tracked using Kalman Filter. But due to similarity with other objects, the author imposes position constraints to avoid track loss.

Poon G. et al. [17] also supported multiple camera setups that can observe the hand region from diversified angles to minimize the errors due to self-occlusion. They proposed three camera setups to recognize bimanual gestures in HGR. Similarly, Bautista A.G. et al. [18] used three cameras in their system to avoid complex background and illumination. Marin G. et al. [19] suggested combining Kinect data with Leap motion camera data to exploit the complementary characteristics of both the cameras. Kainz O. et al. [20] combined leap motion sensor signals and surface electromyography signals to propose a hand tracking scheme.

Andreas Aristidou discussed that high complexity in hand structure and movement make the animation of a hand model a challenge. They preferred a marker-based optical motion capture system to acquire the orientation of the hand [21]. With the same opinion, Lizy Abraham et al. [22] placed infrared LEDs on the hand to improve the consistency of accuracy in tracking. According to the study conducted by Mais Yasen et al. [9], surface electromyography (sEMG) as wearable sensors and Artificial Neural Network (ANN) as classifiers are the most preferable choices in hand gesture recognition.

The important factor in HGR is that information obtained using a monocular camera is not sufficient to extract the moving hand region. The loss of information in RGB images is maximum due to unpredictable background, self-occlusion, illumination variation, and erratic pattern of the hand movement [8, 10, 14].

The second component in the design of DHGR is description of the region of interest or "target modeling." In this section, features that are repetitive, unique, and invariant to general variations, e.g., illumination, rotation, translation of the hand region are collected. These features model the target of tracking and are responsible for detecting and localizing the target in all frames of a video. This step is very significant because it helps to detect the target in an unconstrained environment [10, 12].

Li X. et al. [12] presented a very detailed study of the building blocks of visual object tracking and the associated challenges. They stated that effective modeling of the appearance of the target is the core issue for the success of a visual tracker. Practically, effective modeling is greatly affected by many factors such as target speed, illumination conditions, state of occlusion, complexity in shape, and camera stability, etc. Skin color features are the most straightforward characteristic of the hand used in the HGR domain to identify the hand region in the scene. Huang H. et al. simply detected skin color for contour extraction and then classified them using VGGNet [23]. M. H. Yao et al. [24] extracted 500 particles using the CAMShift algorithm for tracking the moving hand region. In this case, the real-time performance of the HGR system decreases when a similar color object (face or arm region) interferes. As the number of particles increases the complexity of the system increases. The HGR technique proposed by Khaled H. et al. [25] emphasized the use of both shape and skin color features for hand area detection because of background conditions, shadows, visual overlapping of the objects. They stated that noise added

due to camera movement is one of the major problems in real-time hand tracking. Liu P. et al. [26] proposed a single-shot multibox detector ConvNet architecture that is like Faster R-CNN to detect hand gestures in a complex environment. Bao P. et al. [27] expressed that since the size of hand posture is very small, therefore misleading behavior or the overfitting problem becomes prominent in regular CNN.

In the method discussed in [10], we have shown that though the local representation of the hand is a comparatively more robust approach to detect the hand region, but they often suffer from background disturbance in a real-time tracking. In general, hand-crafted features result in large computations and loss of trajectory visual while tracking in real-time hand movement is very common. Henceforth, it is difficult for hand-crafted features to perfectly describe all variations in target as well as background [10, 12]. According to Shin J. et al. [28], the trackers that visually trace the object, based on appearance and position, must have a high tolerance for appearance and position. Tran D. et al. [29] initially detected the palm region from depth data collected by Kinect V2 skeletal tracker followed by morphological processing. They determined hand contour using a border tracing algorithm on binary image converted using a fixed threshold. After detecting fingertip by K-cosine algorithm, hand posture is classified using 3DCNN.

Matching of hand gesture trajectory is another important phase in the cognitive recognition of DHGR. The main constrain in generating similarity index in HGR is the speed of hand motion and the path of movement. Both these factors are highly dependent on the user's mood and surrounding conditions at the instant of movement. Similarity matching based on distance metrics generally fails to track efficiently as hand gestures of the same meaning do not follow the same path always.

Dan Zhao et al. [30] used a hand shape fisher vector to find the movement of the finger and then classified it by linear SVM. Plouffe et al. [31] proposed Dynamic Time Wrapping (DTW) to match the similarity between target and trained gesture. In [32], a two-level speed normalization procedure is proposed using DTW and Euclidean distance-based techniques. In this method, for each test gesture, 10 best-trained gestures are selected using the DTW algorithm. Out of these 10 gestures, the most accurate gesture is selected by calculating Euclidean distance. Pablo B. et al. [33] suggested a combination of the Hidden Markov Model (HMM) and DTW, in the prediction stage.

## **4. Proposed methodology**

The proposed system is designed by using a web camera; it is a simple RGB camera. The use of the RGB camera is limited in the field of hand gesture tracking because of various difficulties as discussed above (**Figure 1**).

The proposed system is divided into three modules:

#### **4.1 Module I**

This module is also known as the "hand detection module." Here the posture of the hand, which is used by the user in real-time hand movement events, is detected. When the user moves his hand in front of the web camera attached to any machine acquires a video of 5–6 seconds at a rate of 15 frames per second. This video comprises a raw data sequence of length 100–150 frames; it is saved in a temporary folder, resizing all the frames to size [240, 240]. In this module, detection of an online Active

**Figure 1.** *Architecture of the proposed system.*

Hand Template (AHT) is made using Faster Region-based Convolutional Neural Network (Faster R-CNN).

## *4.1.1 Faster R-CNN*

We have proposed the design of an online hand detection scheme (AHT) using Faster Region-based Convolutional Neural Network (Faster R-CNN) [34], on Residual Network (ResNet101) [35], a deep neural architecture. Three major issues that are encountered in online tracking of hand motion captured using simple cameras are as follows:


Thus, the essential requisite of any technique is to cope with the abovementioned factors. In the proposed method, these issues are solved by using Faster-RCNN, a Deep Neural Network (DNN) architecture. Deep learning algorithms (DLAs) are models for a machine to learn and execute any task as human beings perform. Deep networks directly learn features from raw data by exploiting local information of the target, with no manual extraction or elimination of background. Convolutional Neural *Cognitive Visual Tracking of Hand Gestures in Real-Time RGB Videos DOI: http://dx.doi.org/10.5772/intechopen.103170*

Network (ConvNet) is a powerful tool in the computer vision field that mainly deals with images.

Ren S. et al. [34] modified fast RCNN to Faster Region-based Convolutional Neural Network (Faster R-CNN). They added a region proposal network (RPN) (a separate CNN network) that simultaneously estimates objectness score and regresses the boundaries of the object using the anchor box concept.

The architecture of the proposed Faster R-CNN developed on ResNet 101 is shown in **Figure 2**. Region Proposal Network (RPN) is an independent small-sized ConvNet, designed to directly generate region proposals from an image of any size without using a fixed edge box algorithm. The process of RPN is shown in **Figure 3**; here region proposals are generated from the activation feature map of the last shared

#### **Figure 2.**

*The architecture of the proposed faster R-CNN.*

**Figure 3.** *Process in RPN.*

convolutional layer between the RPN network and Fast-RCNN. It is implemented with an *m* � *m* convolutional layer followed by two siblings: box regression layer and box classification layer each of size 1 � 1. At each sliding grid, multiple regions are proposed depending upon the number of anchor boxes (Q). Each predicted region is classified by a score and four tuples ð Þ *x*, *y*, *L*, *B* where ð Þ *x*, *y* are the coordinates of the top left corner of the bounding box, *L* and *B* are the length and breadth of the box. If *M* � *N* is the size of the feature map and number of anchors, the technique is Q, then total anchors created will be *M* � *N* � *Q:*

Anchor boxes are bounding boxes with predefined height and width to capture the scale and aspect ratio of the target object. There are pyramids of anchors. The anchorbased method is translation invariant and detects objects of multiple scales and aspect ratios. For every tiled anchor box, the RPN predicts the probability of object, background, and intersection over union (IoU) values. The advantage of using the anchor boxes in a sliding window-based detector is to detect, encode, and classify the object in the region in a single process [34].

The design of the proposed Faster-RCNN technique is accomplished on Residual network (Resnet) resnet 101. Resnet architecture network was proposed in 2015 by Kaiming He et al. [35], to ease the learning process in a deeper network. They exhibited that a resnet architecture eight times deeper than VGG16 still has less complexity in training on ImageNet dataset. The proposed use of resnet 101 in the design of Faster R-CNN solves the complex problems in object classification by using a large number of hidden layers without increasing the training error. Furthermore, the network does not have a vanishing and exploding gradient problem because of the "skip connection" approach.

#### **4.2 Module II**

This module handles the feature extraction process of the AHT that helps in the continuous localization of the moving hand region. Our method processes a hybrid framework that combines Scale Invariant Feature Transform (SIFT) and Faster-RCNN. A framework with hybrid characteristics is selected because in real-time movement, the geometrical shape of any posture changes many times, and thus it is difficult to detect the moving hand region with only hand-crafted features i.e., SIFT. Whenever the posture is changed above the threshold (number of matched features <= 3), then AHT is determined using Faster R-CNN, and the previous AHT is updated with new AHT. During this process, a bounding box is also constructed around the centroid of the hand movement to determine the current two-dimensional area covered by the hand region.

#### *4.2.1 Scale invariant feature transform (SIFT)*

In motion modeling, we have used SIFT algorithm designed by David Lowe [36], for local feature extraction of AHT. As compared with global features such as color, contour, texture, local features have high distinctiveness, better detection accuracy toward local image distortions, viewpoint change, and partial occlusion. Therefore, SIFT detects the object in the cluttered background without performing any segmentation or preprocessing algorithms [36, 37]. The combination of SIFT and Faster-RCNN is helpful in real-time fast-tracking of the non-rigid subtle object hand.

SIFT algorithm comprises of feature detector as well as a feature descriptor. In general, features are high-contrast areas example point, edge, or small image patch, in *Cognitive Visual Tracking of Hand Gestures in Real-Time RGB Videos DOI: http://dx.doi.org/10.5772/intechopen.103170*

an image. These features are extracted such that they are detectable even in noise, scale variation, and during the change in illumination. Each SIFT feature is defined by four parameters: *fi* ¼ *pi* , *σi*, *φ<sup>i</sup>* � �, where *pi* <sup>¼</sup> *xi*, *yi* � � is the 2D position of SIFT keypoint, *σ<sup>i</sup>* is the scale, *φ<sup>i</sup>* is gradient orientation within the region. Each key point *i* d is described by 128-dimensional descriptor *d* [36].

In our approach, we find the SIFT features of the AHT template obtained in module-I, since it contains only the target hand posture and is small as compared with the image frame [240, 240]. Therefore, this approach saves time in matching unnecessary features and pruning them further [20, 21].

Let there be m key features in AHT frame, given as *SAHT* ¼ *fi* � �*m*, where *fi* is the feature vector at *i th* location. Let *Scur* <sup>¼</sup> *fj* n o*<sup>k</sup>* are k numbers of SIFT features in the current frame, where *fj* is the SIFT feature at *j th* location. We use the best-bin-first search method that identifies the nearest neighbors of AHT features with current frame features. The process of SIFT target recognition and localization in the subsequent frames of a video is accomplished in three steps:

Initially, we find the first nearest neighbors (FNN) of all the SIFT features in AHT with SIFT features in the current frame. The First Nearest Neighbors (FNNs) are defined as the pairs of key points in two different frames with a minimum sum of squared differences for the given descriptor vector

$$distance\text{FNND}(a\_{AHT}, b\_{cur}) = \sqrt{\sum\_{i=1}^{128} (a\_i - b\_i)}\tag{1}$$

where *aAHT* and *bcur* are descriptor vectors of features in AHT and current frame, respectively.

In the second step, matching is improved by performing Lowe's Second Nearest Neighbor (SNN) test using Eq. (2).

$$\frac{distance(a\_{AHT}, b\_{cur})}{distance\ (a\_{AHT}, c\_{cur})} > 0.8 \tag{2}$$

SNN test is done by calculating the ratio between the FNND of *aAHT* feature with two nearest neighbors *bcur* and *ccur* in the current frame.

Further to find the geometrically consistent points, we apply the geometric verification test (Eq. (3)) on the key points obtained after SNN.

$$
\begin{bmatrix} \boldsymbol{\mathcal{x}}^\* \\ \boldsymbol{\mathcal{y}}^\* \end{bmatrix} = \nu \boldsymbol{R}(\boldsymbol{a}) \begin{bmatrix} \boldsymbol{\mathcal{x}} \\ \boldsymbol{\mathcal{y}} \end{bmatrix} + \begin{bmatrix} T\_{\boldsymbol{x}} \\ T\_{\boldsymbol{\mathcal{y}}} \end{bmatrix} \tag{3}
$$

Here *v* is isotropic scaling, *α* is rotation parameter, ð*Tx*,*Ty*Þ are translation vectors for the *i th* SIFT keypoint located at a distance ð Þ *<sup>x</sup>*, *<sup>y</sup>* .

#### **4.3 Module III**

This module deals with the cognitive recognition of the trajectory. Here the cognitive recognition means vision-based intellectual development of machine for the interpretation of hand movement. Because hand movements do not have a fixed

pattern, by nature movement patterns are erratic. Due to this characteristic, till now static hand gesture recognition is more preferred than dynamic hand gesture recognition. We have determined the centroids of hand location in the tracked frames. To derive the meaning of hand movement, we have used the modified back-propagation Artificial Neural Network (m-BP-ANN) Match of test trajectory to train database. This cognitive stage is very significant for DHG because the way we collect and transform the centroid of hand movement *CHM* of every frame in a particular data sequence, helps to classify the hand gesture. In the proposed system we have kept this stage simple but efficient because complex algorithms increase the error rate and time of interpretation.

We have made use of the concept of the quadrant system of the Cartesian plane to transform the image frame into a 2-D plane. The two-dimensional Cartesian system divides the plane of the frame into four equal regions called Quadrants. Each quadrant is bound by two half-axes, with the center in the middle of a frame. The translation of the image frame axis to a Cartesian axis is done using Eqs. (4) and (5):

$$\mathfrak{x}\_{\mathfrak{c}} = (C\_{\text{HMx}} - I\_{\text{x}}) / n\_{\text{x}} \tag{4}$$

$$\mathcal{Y}\_{\mathfrak{c}} = \left(\mathbf{C}\_{\text{HMy}} - I\_{\mathfrak{y}}\right) / n\_{\mathfrak{y}} \tag{5}$$

Here *Ix*,*Iy* are the dimensions of the image frame [240, 240] and *nx*, *ny* [12, 12] are normalization factors for the X and Y-axis. To convert the hand trajectory into meaningful command, we have applied Modified Back-Propagation of Artificial Neural Network (mBP-ANN) using start and end location of the hand gesture.

Back-propagation (BP) is a supervised training procedure in feed-forward neural networks. It works on minimizing the cost function of the network using the delta rule or gradient descent method. The value of the weights with which we obtain the minimum cost function is the solution for the given learning problem. The error function 0 *Ef* } is defined as the mean square sum of the difference between the actual output value of the network (*aj*) and the desired target value (*tj*Þ for the jth neuron. *Ef* calculated for *NL* number of output neurons in}<sup>0</sup> *L*} a number of layers are given as Eq. (6):

$$E\_f = \mathbf{1}/2 \sum\_{p=1}^{P} \sum\_{j=1}^{N\_L} \left( \mathbf{t}\_j - a\_j \right)^2 \tag{6}$$

The minimization of the error function is carried out using gradient descent or delta rule. It determines the amount of weight update based on gradient direction along with step size. It is given by Eq. (7):

$$\frac{\partial \mathbf{C}(t+1)}{\partial \delta\_{\vec{\eta}}(t)} = \frac{\partial \mathbf{C}(t+1)}{\partial w\_{\vec{\eta}}(t+1)} \mathbf{x} \frac{\partial w\_{\vec{\eta}}(t+1)}{\partial \delta\_{\vec{\eta}}(t)}\tag{7}$$

In the traditional BP, the optimization of the multidimensional cost function is difficult because step size is fixed, since the performance parameters are highly dependent on the learning rate *δ*. Hence, to overcome the problems of fixed step size and slow learning, we use adaptive learning and momentum term to modify BP. The updated weight value at any node is given by Eq. (8):

$$
\Delta \boldsymbol{\nu}\_{\vec{\eta}}(t) = \eta \delta\_{\vec{\jmath}} \boldsymbol{a}\_{i} + m \Delta \boldsymbol{\nu}\_{\vec{\eta}}(t-1) \tag{8}
$$

*Cognitive Visual Tracking of Hand Gestures in Real-Time RGB Videos DOI: http://dx.doi.org/10.5772/intechopen.103170*

**Figure 4.**

*Architecture of the proposed ANN model.*

The term momentum (0< *m* <1) updates the value of weight using the previous value of it. Adaptive learning rates help to learn the characteristic of the cost function. If the effort function is decreasing, then the learning rate will increase, and vice versa [38].

In the proposed prototype, we have developed eight vision-based commands to operate and machine remotely by showing hand gestures. The proposed model of ANN has three layers, input layer, hidden layer, and output layer as shown in **Figure 4**. The input layer has 4 neurons, the hidden layer has 10 neurons, and the outer layer consists of 8 neurons.

## **5. Experimental analysis**

In this research work, we have taken three hand postures (as shown in **Table 1**) to demonstrate the vision-based tracking efficiency of our proposed concept. It is the unique feature of this work as most of the techniques demonstrate tracking of hand movements performed by a single posture [32]. For consolidated evaluation, we have taken approximately 100 data sequences captured in different environments as shown in **Figure 5**. Our database is a collection of publicly available dataset [32] and selfprepared data sequence. In [32], hand movements are mainly performed by a single hand posture (Posture III as shown in **Table 1**) and in a constrained laboratory environment.

In self-prepared dataset, we have collected hand movements performed by six participants of three different age groups: two kids (age 10–16 years), two adults (age 20–40 years), and two seniors (age 65 years). In this, the hand movement is carried out using three different postures (as illustrated in **Table 1**), in linear as well as circular pattern. In self-collected dataset, 15 frames per second are taken through the web camera, and gesture length varies from 120 to 160 frames.

#### **Table1.**

*Types of postures used in the proposed system.*

The evaluation of the proposed online adaptive hand tracking methodology is carried out on four test parameters. The methodology is also compared with the contemporary techniques that are based on RGB images or webcam images. The four test parameters are as follows:


## **5.1 Accuracy in hand detection**

**Figure 6** demonstrates the outcome of the hand recognition stage of different data sequences captured (using three hand postures demonstrated in **Table 1**) in different backgrounds under d

ifferent illumination conditions. To test the accuracy of the hand detection scheme in recognizing the hand region, we have considered nearly all possible combinations:

**Figure 5.** *Dataset for training faster R-CNN.*

## *Cognitive Visual Tracking of Hand Gestures in Real-Time RGB Videos DOI: http://dx.doi.org/10.5772/intechopen.103170*

only the hand is visible in the camera view, subject face along with arm region is in the camera view, illumination conditions are unstable, background has same color as the hand region, etc. Thus, the hand detection results in **Figure 6** illustrate the following distinguishing key features of our proposed system:


## **5.2 Parametric evaluation of module i**

The proposed hand detection module, developed on Faster R-CNN architecture, has been evaluated on the following parameters:

**Figure 6.**

*Various outcomes of module-I (simple background, complex background, the subject is also visible in camera range).*


**Table 2** illustrates detailed performance outcomes of the proposed Faster R-CNN based on the abovementioned parameters. The observations are taken at intervals of 50, 100, 150, 200, 220 iterations. The outcomes illustrate following points of our proposed architecture on Faster R-CNN constructed on resnet101:



*Based on above outcomes, the characteristic features of the proposed trained resnet101 are:*

*Accuracy: 98.76%*

*Loss: 0.17*

*The behavior of the Network: Well fit.*

#### **Table 2.**

*Outcomes in the training process of the proposed faster R-CNN model.*

*Cognitive Visual Tracking of Hand Gestures in Real-Time RGB Videos DOI: http://dx.doi.org/10.5772/intechopen.103170*

validation dataset, the value of RMSE and loss reached their minimal at the 200th iteration.

## **5.3 Efficiency of hybrid tracking system**

In this section, we have evaluated the tracking efficiency of our proposed hybrid method. The data sequences captured are of variable length ranging from 100 to 150 frames. **Figure 7** shows results of tracking in different data sequences, approximately 10–12 frames of each data sequence are shown here to highlight the tracking efficiency of module II. Each frame is illustrated by its frame number, a yellow box enclosing the hand region and a yellow dot inside the yellow box represent the instant position of the centroid of the hand region. **Figure 7**(a) shows the tracking of P-I posture in a

**Figure 7.**

*Tracking outcomes of different data sequence are shown in (a), (b), (c), and (d) shows the cognitive recognition of hand movement in (c).*

cluttered background. This data sequence is captured in a background that has many similar colored objects as that of the hand. Our proposed system discriminates and localizes the hand region efficiently due to the robust deep feature learning capability of our hybrid tracking system. It is also noticeable that the hand is properly identified even when the hand region was blurred due to sudden erratic movement by the subject as shown in frame 99 of the data sequence.

**Figure 7(b)** displays the tracking results of the P-III hand posture in improper illumination conditions. It can be noticed that in **Figure 7(a)** and **(b)**, the FoS are frame 3 and frame 15, respectively. This data sequence is mainly affected by the color reflection of the background wall, and thus, it is visible that the edges of the P-III posture are nearly mixed with the background in some frames.

**Figure 7(c)** demonstrates the tracking results of a data sequence [32] in which a teenage girl is moving her hand (posture P-III) in front of her face. It is noticeable that the hand region and face region nearly overlap in frame 17. The fast change in the hand position in the frames indicates that the subject is moving her hand in a speedy manner. The change in the distance between the two positions of the hand frame 45 to frame 59 along with the change from a clear image of the hand region to the blurred image of the hand image proves the fast movement of the hand. During the movement, the subject is also changing the orientation of the hand posture as can be seen from the frames 59, 73, 80, 87.

#### **5.4 Efficiency of hybrid tracking system**

Cognitive efficiency means the development of the semantic between the trajectory of the dynamic hand gestures and machine command. Since, hand gestures do not follow a fixed line of movement to convey the same meaning. Therefore, syntax formation to match train data and test data is a challenge. Hence, the main limitation in DHGR is the development of a process that can convert the trajectory of hand movement to machine command. Our proposed method handles this difficult challenge in a schematic manner.

In our proposed technique, we have developed eight vision-based commands "INSTRUCTION 1–8" (abbreviated as INT-1 to INT-8). For the vision-based instruction, we have drafted a process to convert trajectory of the hand movement obtained in module-II to a machine command by using Cartesian plane system as illustrated in **Figure 8**.

**Figure 7**(d) illustrates the process in developing cognitive ability to recognize hand movement by the machine. This process consists of three steps: (i) trajectory plot of the hand movement, (ii) position of start and end point in Cartesian plane, and (iii) conversion to machine command. **Figure 7**(d) demonstrates the results of the cognitive recognition of a data sequence shown in **Figure 7(c)** [32]; here an adult girl moves her hand from right to left and the machine recognizes this movement as command 7.

**Figure 9(a)** shows tracking results of P-III posture performed by a teenage boy. In this data sequence, we can notice that scale change of the hand region is very prominent (as the size of the hand region is continuously changing from frame to frame). The posture area is big in frame 37, and it gradually decreases till frame 147. This indicates the distance between the subject's hand and the camera, it is minimum in frame 37 and maximum at frame 147. **Figure 9(b)** displays the result of cognitive recognition of the trajectory in the three steps in trajectory to command interpretation of left initiated data sequences The movement starts from the bottom left, moves in a

*Cognitive Visual Tracking of Hand Gestures in Real-Time RGB Videos DOI: http://dx.doi.org/10.5772/intechopen.103170*

**Figure 8.**

*Conversion of trajectory of hand movement to machine command.*

**Figure 9.**

*Tracking results of P-III posture performed by a teenage boy. (b) Cognitive recognition of hand movement.*

zigzag manner, and finally reaches close to the initial starting place. The PoS and end location of this sequence both are in the third and fourth quadrant respectively; thus, "INSTRUCTION 8" is generated through this hand movement.

## **5.5 Comparison with contemporary techniques**

In this section, we compare our process and results with two different approaches used recently in the field of DHGR. In the first approach [32] technique utilizes true RGB images. This approach mainly involves hand-crafted features for hand detection and tracking. The research work conducted by Singha J. et al. [32] focused on only fist posture tracking in a fixed background, they have achieved 92.23% efficiency when no skin color object is present in the surroundings. One of the prominent limitations in


**Table 3.**

*Comparative analysis of two recent methods with the proposed methodology based on different parameters.*

*Cognitive Visual Tracking of Hand Gestures in Real-Time RGB Videos DOI: http://dx.doi.org/10.5772/intechopen.103170*

their approach is that they have applied sequence of algorithms for precise detection of hand region. This method is complex and unsuitable for real-time implementation of DHGR.

In the approach proposed by Tran DS et al. [29] for fingertip tracking, depth coordinates of fingertip provided by the inbuilt software of the advanced sensor-based camera are directly used. According to the researchers, RGB camera images are largely affected by illumination variation, and thus, to avoid background and illumination complexities in DHGR, they utilized RGB-D data sequences captured through the Microsoft Kinect V2 camera. It is a skeletal tracker camera that provides the position of 25 joints of the human skeleton including fingertips. This method is designed for tracking only seven hand movements comprised of 30–45 frames in three fixed backgrounds; besides, subjects are also trained to perform correct hand movement. In this research work, each frame is allotted an individual 3DCNN for classification. Thus, the experiments can perform fingertip tracking only for short gesture length. The training time of the 3D CNN is 1 hr. 35 minute with a six-core processor of 16GB RAM, which indicates the complex architecture of the technique. The accuracy of the trained 3D CNN model is 92.6% on validation data. **Table 3** illustrates and compares different technical aspect of the above two mentioned approaches with our proposed method:

## **6. Conclusion**

This research work presents solutions to many crucial and unresolved challenges in vision-based tracking of hand movement captured using a simple camera. The methodology has the potential to provide a complete solution from hand detection to tracking and finally for cognitive recognition of trajectory to machine command for contactless Human-Machine interaction via dynamic hand gestures. Since the proposed design is implemented around a single RGB webcam, thus the system is economical and user-friendly. The accuracy achieved in the online and adaptive hand detection scheme with Faster R-CNN is 98.76%. The proposed hybrid tracking scheme exhibits high efficiency to adapt scale variation, illumination variation, and background conditions. It also exhibits high accuracy when camera is in motion during the movement. The overall accuracy achieved by our proposed system in complex conditions is 95.83%.

The comparative analysis demonstrates that our system gives users the freedom to select posture and to start the hand movement from any point in the image frame. Also, we do not impose any strict conditions in terms of geometrical shape of any posture. The hybrid framework and cognitive recognition features of our proposed method give a robust solution to classify any hand trajectory in a simple manner. This feature has not been discussed in any existing technique working with RGB images till date. The cumulative command interpretation efficiency of our system in real-time environment is 96.2%. The various results justify the "online" hand detection and "adaptive" tracking feature of the proposed technique. In the future, the method can be further extended to track multiple hand movements.

## **Acknowledgements**

No funding is received

## **Author details**

Richa Golash\* and Yogendra Kumar Jain Samrat Ashok Technological Institute, Vidisha, Madhya Pradesh, India

\*Address all correspondence to: golash.richa@gmail.com

© 2022 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

*Cognitive Visual Tracking of Hand Gestures in Real-Time RGB Videos DOI: http://dx.doi.org/10.5772/intechopen.103170*

## **References**

[1] Wachs JP, Kölsch M, Stern H, Edan Y. Vision-based hand-gesture applications. Communications of the ACM. 2011; **54**(2):60-71. DOI: 10.1145/ 1897816.1897838

[2] Golash R, Jain YK. Economical and user-friendly Design of Vision-Based Natural-User Interface via dynamic hand gestures. International Journal of Advanced Research in Engineering and Technology. 2020;**11**(6)

[3] Rautaray SS, Agrawal A. Vision based hand gesture recognition for human computer interaction: A survey. Artificial Intelligence Review. 2015; **43**(1):1-54. DOI: 10.1007/s10462-012- 9356-9

[4] Mcintosh J. How it Works: BMW's Gesture Control. Available from: https:// www.driving.ca/auto-news/news/howit-works-bmw-gesture-control [Accessed: March 23, 2021]

[5] Gu C, Lien J. A two-tone radar sensor for concurrent detection of absolute distance and relative movement for gesture sensing. IEEE Sensors Letters. 2017;**1**(3):1-4. DOI: 10.1109/ LSENS.2017.2696520

[6] Oudah M, Al-Naji A, Chahl J. Hand gesture recognition based on computer vision: A review of techniques. Journal of Imaging. 2020;**6**(8):73. DOI: 10.3390/ jimaging6080073

[7] Li Y, Huang J, Tian F, Wang HA, Dai GZ. Gesture interaction in virtual reality. Virtual Reality & Intelligent Hardware. 2019;**1**(1):84-112

[8] Chakraborty BK, Sarma D, Bhuyan MK, MacDorman KF. Review of constraints on vision-based gesture recognition for human–computer

interaction. IET Computer Vision. 2018; **12**(1):3-15

[9] Yasen M, Jusoh S. A systematic review on hand gesture recognition techniques, challenges and applications. PeerJ Computer Science. 2019; **16**(5):e218

[10] Golash R, Jain YK. Trajectory-based cognitive recognition of dynamic hand gestures from webcam videos. International Journal of Engineering Research and Technology. 2020;**13**(6): 1432-1440

[11] Yang H, Shao L, Zheng F, Wang L, Song Z. Recent advances and trends in visual tracking: A review. Neurocomputing. 2011;**74**(18):3823-3831

[12] Li X, Hu W, Shen C, Zhang Z, Dick A, Hengel AV. A survey of appearance models in visual object tracking. ACM Transactions on Intelligent Systems and Technology (TIST). 2013;**4**(4):1-48

[13] Bandini A, Zariffa J. Analysis of the hands in egocentric vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2020

[14] Golash R, Jain YK. Robust tracking of moving hand in coloured video acquired through simple camera. International Journal of Computer Applications in Technology. 2021;**65**(3): 261-269

[15] Bandara HM, Priyanayana KS, Jayasekara AG, Chandima DP, Gopura RA. An intelligent gesture classification model for domestic wheelchair navigation with gesture variance compensation. Applied Bionics and Biomechanics. 2020;**30**:2020

[16] Wang J, Payandeh S. Hand motion and posture recognition in a network of calibrated cameras. Advances in Multimedia. 2017;**2017**:25. Article ID 216207. DOI: 10.1155/2017/2162078

[17] Poon G, Kwan KC, Pang WM. Realtime multi-view bimanual gesture recognition. In: 2018 IEEE 3rd International Conference on Signal and Image Processing (ICSIP). IEEE; 2018. pp. 19-23

[18] Cruz Bautista AG, González-Barbosa JJ, Hurtado-Ramos JB, Ornelas-Rodriguez FJ, González-Barbosa EA. Hand features extractor using hand contour–a case study. Automatika. 2020; **61**(1):99-108

[19] Marin G, Dominio F, Zanuttigh P. Hand gesture recognition with jointly calibrated leap motion and depth sensor. Multimedia Tools and Applications. 2016;**75**(22):14991-15015

[20] Kainz O, Jakab F. Approach to hand tracking and gesture recognition based on depth-sensing cameras and EMG monitoring. Acta Informatica Pragensia. 2014;**3**(1):104-112

[21] Aristidou A. Hand tracking with physiological constraints. The Visual Computer. 2018;**34**(2):213-228

[22] Abraham L, Urru A, Normani N, Wilk MP, Walsh M, O'Flynn B. Hand tracking and gesture recognition using lensless smart sensors. Sensors. 2018; **18**(9):2834

[23] Huang H, Chong Y, Nie C, Pan S. Hand gesture recognition with skin detection and deep learning method. Journal of Physics: Conference Series. 2019;**1213**(2):022001

[24] Yao MH, Gu QL, Wang XB, He WX, Shen Q. A novel hand gesture tracking

algorithm fusing Camshift and particle filter. In: 2015 International Conference on Artificial Intelligence and Industrial Engineering. Atlantis: Atlantis Press; 2015

[25] Khaled H, Sayed SG, Saad ES, Ali H. Hand gesture recognition using modified 1\$ and background subtraction algorithms. Mathematical Problems in Engineering. 2015;**20**:2015

[26] Liu P, Li X, Cui H, Li S, Yuan Y. Hand gesture recognition based on single-shot multibox detector deep learning. Mobile Information Systems. 2019;**30**:2019

[27] Bao P, Maqueda AI, del-Blanco CR, García N. Tiny hand gesture recognition without localization via a deep convolutional network. IEEE Transactions on Consumer Electronics. 2017;**63**(3):251-257

[28] Shin J, Kim H, Kim D, Paik J. Fast and robust object tracking using tracking failure detection in kernelized correlation filter. Applied Sciences. 2020;**10**(2):713

[29] Tran DS, Ho NH, Yang HJ, Baek ET, Kim SH, Lee G. Real-time hand gesture spotting and recognition using RGB-D camera and 3D convolutional neural network. Applied Sciences. 2020;**10**(2):722

[30] Zhao D, Liu Y, Li G. Skeleton-based dynamic hand gesture recognition using 3d depth data. Electronic Imaging. 2018; **2018**(18):461-461

[31] Plouffe G, Cretu AM. Static and dynamic hand gesture recognition in depth data using dynamic time warping. IEEE Transactions on Instrumentation and Measurement. 2015;**65**(2):305-316

[32] Singha J, Roy A, Laskar RH. Dynamic hand gesture recognition using *Cognitive Visual Tracking of Hand Gestures in Real-Time RGB Videos DOI: http://dx.doi.org/10.5772/intechopen.103170*

vision-based approach for human– computer interaction. Neural Computing and Applications. 2018;**29**(4):1129-1141

[33] Barros P, Maciel-Junior NT, Fernandes BJ, Bezerra BL, Fernandes SM. A dynamic gesture recognition and prediction system using the convexity approach. Computer Vision and Image Understanding. 2017; **1**(155):139-149

[34] Ren S, He K, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2016;**39**(6): 1137-1149

[35] He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE; 2016. pp. 770-778

[36] Lowe DG. Distinctive image features from scale-invariant keypoints. International Journal of Computer vision. 2004;**60**(2):91-110

[37] Lindeberg T. Scale invariant feature transform. Scholarpedia. 2012;**7**(5): 10491

[38] Rojas R. Fast learning algorithms, in neural networks. Springer. 1996;**1996**: 183-225

Section 2
