**2.2 Object feature extraction and representation**

Since objects in video surveillance are physical objects (e.g. people, vehicles) that are present in the scene at a certain time, in general, they are detected and tracked in a large number of frames. Objects in videos possess two main characteristics named spatial and temporal characteristics. Spatial characteristics of an object may be its positions in frames (in 2D coordinates) and positions in scene (in 3D coordinates), its spatial relationships with other objects and its appearance. Temporal characteristics of an object contain its movement and its temporal relationships with other objects. Therefore, an object may be represented by one sole or several characteristics. However, among these characteristics, object movement and object appearance are the two most important characteristics and are widely used in the literature.

Concerning the object representation based on object movement, in the literature, a number of different approaches have been proposed for object movement representation and matching (Broilo, Piotto et al. 2010). Certain approaches directly use detected object positions across frames that are represented in trajectory form (Zheng, Feng et al. 2005). As object trajectory may be very complex, other authors try to segment an object trajectory into several sub-trajectories (Buchin, Driemel et al. 2010) with the purpose that each sub-

Appearance-Based Retrieval for Tracked Objects in Surveillance Videos 43

In this section, we firstly give some definitions and point out the existing challenges for appearance-based object retrieval in surveillance videos. Then, we describe the solutions proposed for two important tasks: object signature building and object matching in order to

Definition 1: An **object blob** is a region determined by a minimal bounding box in a frame

The minimal bounding box is calculated by the object detection module in video analysis and an object has one sole minimal bounding box. Fig. 4 gives some examples of detected

**3. Appearance-based object retrieval in surveillance videos** 

Fig. 4. Detected objects and their blobs (Bak, Corvee et al. 2010).

In surveillance applications, one object is in general detected and tracked in a number of frames. In other words, a set of object blobs is dened for an object. Therefore, an object can

, 1, *O Bi N <sup>i</sup>* (1)

Definition 2: Object representation

be represented as:

overcome these challenges.

where the object is detected.

objects and their corresponding blobs.

**3.1 Definitions** 

trajectory represents a relatively stable pattern of object movement. Other work attempts to move to higher levels of object trajectory representation, named symbolic level and semantic level. At symbolic level, (Chen, Ozsu et al. 2004; Hsieh, Yu et al. 2006; Le, Boucher et al. 2007) aim to convert object trajectory into a character sequence. The advantage is that they promote the applying of successful and famous methods in text retrieval such as the Edit Distance for object trajectory matching. The approaches dedicated to object trajectory representation at the semantic level try to learn the semantic meaning such as turn left, low speed from object movement (Hu, Xie et al. 2007). As results, the output is close to the human manner of thinking. However, they strongly depend on applications.

Object representation based on its appearance has attracted a lot of research interest. Appearance-based object retrieval methods for surveillance video are distinguished each other by two criteria. The first criterion is the appearance feature extracted on the image/frame where the object is detected and the second one is the way to create object signature from all features extracted over the object's life time and to match objects based on their signatures. In the next section, we describe in detail the object signature building and object matching methods. In this section, we only present the object appearance feature.

There is a great variety of object features used for surveillance object representation. In fact, all features that are proposed for image retrieval can be applied for surveillance object representation. Appearance object features can be divided into two categories: global and local. Global features are color histogram, dominant color, covariance matrix, just to name a few. Besides global features, local features such as interest points and SIFT descriptor can be extracted from the object's region.

In (Yuk, Wong et al. 2007), the authors have proposed to use MPEG-7 descriptors such as dominant colors, edge histograms for surveillance retrieval. In the context of one research project conducted by IBM research center1, the researchers have evaluated a large number of color features for surveillance application that are standard color histograms, weighted color histograms, variable bin size color histograms and color correlograms. Results show color correlogram to have the best performance. Ma et Cohen (Ma and Cohen 2007) suggest to use the covariance matrix as object feature. According to the authors, the covariance matrix is appealing because it fuses different types of features and has small dimensionality. The small dimensionality of the model is well suited for its use in surveillance videos because it takes very little storage space. In our research (Le, Boucher et al. 2010), we have evaluated the performance of 4 descriptors which are dominant color, edge histogram, covariance matrix (CM) and SIFT descriptor for surveillance object representation and matching. The obtained results show that if the objects are detected while the background and context objects are not present in the object region, the used descriptors allow retrieving objects with relatively good results. For other cases, the covariance matrix is more effective than the other descriptors. According to our experiments, it is interesting to see that when the covariance matrix represents information of all pixels in a blob, the points of interest use only few pixels. The dominant color and the edge histogram use the approximate information of pixel color and edge. A pair of descriptors (covariance matrix and dominant color) or (covariance matrix and edge histogram) or (covariance matrix and SIFT descriptors) may be chosen as default descriptors for object representation.

<sup>1</sup> https://researcher.ibm.com/researcher/view\_project.php?id=1393
