**2.1. Convolutional neural network for object tracking**

Convolutional neural network (CNN) is a multi-layered supervised learning feedforward neural network. A typical CNN structure includes convolutional layer, pooling layer, and full connection layer. Specially, the automatic feature extraction function of the CNN is mainly realized through the convolution layer and pooling layer. The structure of the CNN determines that it has natural advantages for image processing, and it also shows a competitive performance in visual tracking. In order to solve the problem of object drift caused by similar or clutter background in visual tracking, Fan [11] et al. use CNN to learn spatial and temporal invariance features between adjacent frames. Jin [12] combine a CNN with two convolutional layers and two pooling layers and radial basis function (RBF) to perform feature extraction so that it can better learn the invariable features of the object appearance in visual tracking. Hong [13] use an offline-trained CNN to extract the distinctive feature map of the object in visual tracking. Wang [14] train a two-level CNN by offline way and use it for online object tracking. The network pays more attention to the learning of motion invariant features. Unlike most CNN used for object tracking, the network designed by Wang [15] et al. is not a binarized output classification result but instead generates a probability map to represent the potential area of the object. The use of CNN greatly improves the accuracy of visual tracking, but high computational complexity is still a limitation. In order to improve the real-time performance of the tracking algorithm, Doulamis et al. [16] proposed a fast adaptive supervised algorithm for object tracking and classification. In addition, although the pooling operation in CNN can obtain invariant features to drop the recognition effect caused by the change of the object appearance, however, it reduces the resolution of the image and leads to spatial information loss. The loss of information of pooling operations is crucial for tracking [17]. Zhang et al. [18] combined convolutional neural networks with spatial-temporal saliency-guided sampling for object tracking in a correlated filter framework. The algorithm establishes an optimization function to locate object positions based on significant region detection and significant motion estimation. Different from other object tracking algorithms whose location estimation is based on the last layer of the convolutional neural network, this algorithm combines intra-frame appearance correlation information and inter-frames motion saliency information to ensure accurate target location. All in all, the object tracking algorithm based on the convolutional neural network can effectively track object, but the network structure is relatively complex, consumes a lot of training time, and requires a large number of labeled training samples, and it is difficult to achieve a balance between tracking accuracy and tracking speed.

#### **2.2. Deep auto-encoder for object tracking**

The basic idea of the deep auto-encoder (DAE) is to encode the input signal and then use the decoder to reconstruct the original signal. The goal of the one is to minimize the reconstruction error between the reconstructed signal and original signal. Compared with the method of visual tracking using CNN, a DAE compresses the original signal by coding, removes redundancy, and can reflect the more primitive nature of the original signal in a more concise manner. Therefore, visual tracking using DAE has a lower calculation cost and is more suitable for some occasions with high real-time requirements. In 2013, Wang et al. [19] proposed a novel deep learning tracker (DLT), which for the first time uses a DAE for tracking. DLT considers the object tracking task as a two-category problem. Firstly, using Tiny Images data set to offline train a stacked denoising auto-encoder (SDAE) in an unsupervised manner to obtain a universal image feature representation for object and then use it for online tracking. The classification neural network is constructed and is fine-tuned in the tracking process to distinguish the target from the background. Soon after, many improved versions of the DLT methods have been proposed. For example, Zhou [20] combined online AdaBoost feature selection framework with SDAE for object tracking to effectively solve complex and dramatic changes of the object appearance. Cheng et al. [21] used the SDAE network to implement adaptive target tracking in an incremental deep learning approach under the dual particle filter framework. Cheng et al. [22] implemented an object tracking algorithm based on enhanced group tracker and SDAE in the framework of the popular tracking-learning-detection (TLD) algorithm in order to solve the object drift of the tracking method based on the appearance model. Due to the Haar-like features in the multi-instance learning (MIL) tracking algorithm are difficult to reflect the shortcomings of the object itself and the external changes, Cheng et al. [23] introduced SDAE to extract the effective features of the example image to achieve higher precision tracking. In order to further improve the application performance of the stacked denoising auto-encoder in video object tracking, some scholars have proposed many improved tracking algorithms based on a stacked denoising auto-encoder. Dai et al. [24] proposed a local patch tracking algorithm based on a stacked denoising auto-encoder. The algorithm partitions the input image; then a feature extractor combining multiple stacked denoising auto-encoder is used to describe the feature information of local patch and fuse their local features to achieve object tracking. The local feature extraction greatly reduces the computation complexity compared with the global feature representation. In the tracking process, the weight of each patch of the object candidate region can be adaptively adjusted according to the confidence of the corresponding network. Hua et al. [25] proposed a new visual tracking algorithm based on the multi-level feature learning capability of the stack denoising auto-encoder under the particle filter framework.

The training of the stacked auto-encoder network includes two stages of hierarchical pre-training and online tracking. In the hierarchical pre-training stage, a description of multi-level image features is obtained. In the online tracking stage, the network parameters are back-propagated through the genetic algorithm to fine-tuning. The use of genetic algorithm in network parameter adjustment effectively avoids the deficiency of traditional BP algorithm and further enhances the robust performance of the network. These trackers can use SDAE for unsupervised feature learning on data that lacks tagging, improving the problem of insufficient training data for deep neural networks (DNNs). However, in some challenging and complex environments, these trackers will fail to track the object. Therefore, we can further enhance the feature expression capabilities of deep neural networks (DNNs) for more robust tracking.

In this chapter, we add the K-sparse constraint into the coding part of the SDAE to learn more invariant feature of object appearance and propose a staked k-sparse-auto-encoder–based robust tracking algorithm for outdoor vehicle under particle filter framework to solve the problem of large appearance variations during the tracking.
