**3.3 Object signature building**

Object signature building is a process that aims at calculating one or a set of descriptors, named object signature, from a set of object blobs.

The calculated signature should (1) be able to represent all object appearance aspects, (2) be distinctive and (3) be as compact as possible. Among these characteristics, the two first characteristics ensure the robustness of the retrieval part. The third characteristic relates to the effectiveness of the indexing part. If the signature is compact, it does not require much storage.

Object signature building methods for surveillance video are divided into two approaches. The first object signature building approach is based on the following observation: Surveillance objects are generally detected and tracked in a large number of frames. Consequently, an object is represented by a set of blobs. Due to errors in object detection, using all these blobs for object indexing and retrieval is irrelevant. Moreover, it is redundant because of the similar content between blobs (two consecutive blobs of an object are closely similar). Based on this observation, methods belonging to the first approach try to select the most relevant and representative blobs from a set of blobs and then to compute object features on these blobs. This process is defined by Eq. 2. This approach is composed of two steps. The first step, called representative blob detection, chooses from the object blobs the most relevant and representative ones that represent significantly the object appearance while the second step computes the object features mentioned in Section 2.2 from the calculated representative blobs.

$$\begin{aligned} \left\{ \left\{ B\_{i} \right\}, i \in \mathbf{1}, N \right\} \xrightarrow{(1)} \left\{ \left\{ Br\_{j} \right\}, j \in \mathbf{1}, M \right\} \xrightarrow{(2)} \left\{ \left\{ F\_{j} \right\}, j \in \mathbf{1}, M \right\} \\ \text{with } N \gg M \end{aligned} \tag{2}$$

where:

 , 1, *Bi N <sup>i</sup>* : set of original blobs for the object O determined by using object detection output.

46 Recent Developments in Video Surveillance

Fig. 8. An example of object ID confusion: three ground-truth object IDs associated to one

Object signature building is a process that aims at calculating one or a set of descriptors,

The calculated signature should (1) be able to represent all object appearance aspects, (2) be distinctive and (3) be as compact as possible. Among these characteristics, the two first characteristics ensure the robustness of the retrieval part. The third characteristic relates to the effectiveness of the indexing part. If the signature is compact, it does not require much

Object signature building methods for surveillance video are divided into two approaches. The first object signature building approach is based on the following observation: Surveillance objects are generally detected and tracked in a large number of frames. Consequently, an object is represented by a set of blobs. Due to errors in object detection, using all these blobs for object indexing and retrieval is irrelevant. Moreover, it is redundant because of the similar content between blobs (two consecutive blobs of an object are closely similar). Based on this observation, methods belonging to the first approach try to select the most relevant and representative blobs from a set of blobs and then to compute object features on these blobs. This process is defined by Eq. 2. This approach is composed of two steps. The first step, called representative blob detection, chooses from the object blobs the most relevant and representative ones that represent significantly the object appearance while the second step computes the object features mentioned in Section 2.2 from the

(1) (2)

, 1, , 1, , 1,

(2)

*B i N Br j M F j M i jj*

, 1, *Bi N <sup>i</sup>* : set of original blobs for the object O determined by using object

sole detected object (object ID confusion = 3).

named object signature, from a set of object blobs.

**3.3 Object signature building** 

calculated representative blobs.

detection output.

with

*N M*

storage.

where:


Instead of calculating only the representative blobs, several authors compute a set of pairs: the representative blob and its associating weight while the weight associated with a representative blob shows the importance of this blob. With this, the first approach is defined as follows:

$$\begin{aligned} \left\{ \left\{ B\_{i} \right\}, i \in \mathbf{1}, N \right\} & \xrightarrow{(1)} \left\{ \left\{ Br\_{j}, w\_{j} \right\}, j \in \mathbf{1}, M \right\} \xrightarrow{(2)} \left\{ \left\{ F\_{j}, w\_{j} \right\}, j \in \mathbf{1}, M \right\} \\ \text{with } N \gg M \text{ and } \sum\_{j=1}^{M} w\_{j} &= 1 \end{aligned} \tag{3}$$

Fig. 9 shows an example of the first object signature building approach. From a large number of blobs (905 blobs), the object signature building method selects only 4 representative blobs. Their associated weights are 0.142, 0.005, 0.016 and 0.835.

Fig. 9. An example of representative blob detection: 4 representative blobs are extracted from 905 blobs.

The methods presented in (Ma and Cohen 2007) and in (Le, Thonnat et al. 2009) are the most significant ones of the first object signature building approach. These methods are distinguished each from the other by the way to define the representative blobs.

The representative blob detection method proposed by Ma et Cohen (Ma and Cohen 2007) is based on the agglomerative hierarchical clustering and the covariance matrix extracted from the object blobs. This method is composed of the three following steps:


The first step aims at forming clusters of similar blobs. The similarity of two blobs is defined by using the covariance matrix. The covariance matrix is built over a feature vector f, for

Appearance-Based Retrieval for Tracked Objects in Surveillance Videos 49

Fig. 10. Example result of Ma and Cohen method (Ma and Cohen 2007): (a) original sequence of blobs; (b) clustering results having valid clusters and invalid clusters; (c) representative frame for the second cluster in (b); (d) representative frame for the third

preliminary steps in Ma and Cohen's work. These steps will be performed before the first

**Step 0.** Classify blobs of all objects into relevant (with the object of interest) and irrelevant

It is worth noting that the appearance of tracked objects may vary but their blobs usually have some common visual characteristics (e.g. human shape characteristics for the blobs of different tracked persons). As we can see, the two added steps allow to remove irrelevant blobs before agglomerative clustering. Therefore, this object signature building method is

The second object signature building approach does not perform explicitly the representative blob detection. It attempts to sum up all object appearances into one sole

function (RBF) kernel using edge histograms (Won, Park et al. 2002).

**Step 1.** Remove irrelevant blobs from the set of blobs for each object.

robust while working with poor quality object detection.

signature. This approach is defined as follows:

blobs (without object of interest) by a two-class SVM classier with radial basis

cluster in (b).

step of Ma and Cohen's work.

each pixel, that is: f(x,y)=[x, y, R(x, y), G(x, y), B(x, y), ▽RT (x, y), ▽GT (x, y),▽BT (x, y)] where R, G, B are the colorspace axes and x, y are the coordinates of the pixel contributing to the color and the gradient information. The covariance matrix is computed for each detected blob as follows:

$$C = \sum\_{x,y} (f - \overline{f})(f - \overline{f})^T \tag{4}$$

The covariance matrices for blobs of different sizes have the same size. In fact, the covariance matrix is a N×N matrix while N is the dimension of the feature vector f.

The distance between two blobs is calculated as:

$$d(\mathbf{C}\_i, \mathbf{C}\_j) = \sqrt{\sum\_{k=1}^d \ln^2 \mathcal{A}\_k(\mathbf{C}\_i, \mathbf{C}\_j)}\tag{5}$$

For the agglomerative clustering, the distance *dAB* (,) between two clusters A and B is computed by average linkage as:

$$d(A, B) = \frac{1}{|\lfloor A \rfloor . |\lfloor B \rfloor|} \sum\_{A\_i \in A} \sum\_{B\_j \in B} d(A\_{i'}, B\_j) \tag{6}$$

where ( , ) *<sup>i</sup> <sup>j</sup> dA B* is defined in Eq. 5.

The objective of the second step is to detect and remove outliers that are clusters containing a small number of elements. The final step determines one representative blob for each cluster. For a cluster B, the representative blob Bl is dened as:

$$d = \arg\min\_{j=1,\ldots,|B|\,, j\neq i} \sum\_{i=1,\ldots,|B|} d(B\_i, B\_j) \tag{7}$$

where ( , ) *<sup>i</sup> <sup>j</sup> dB B* is the blob distance defined in Eq. 5.

Fig.10 gives an example result of Ma and Cohen method (Ma and Cohen 2007): (a) original sequence of blobs; (b) clustering results having valid cluster and invalid cluster; (c) representative frame for the second cluster in (b); (d) representative frame for the third cluster in (b). We can see that this method can dominate errors of the object detection if they occur in a small number of frames. However, if the detection error occurs in a large number of frames, the cluster containing the blobs of these frames will be defined as valid cluster by this method (the validity of clusters is decided by their sizes).

Our work presented in (Le, Thonnat et al. 2009) is an improvement of Ma and Cohen work (Ma and Cohen 2007), based on two remarks. The first remark is that the drawback of Ma and Cohen's method is that it cannot work well with imperfect object detection since it processes all object blobs including relevant and irrelevant ones. We can resolve this drawback by removing all irrelevant blobs before doing the agglomerative clustering. The second remark is that one blob of an object is relevant if it contains this object or objects belonging to the same class of this object. For example, one blob of a detected person is relevant if it represents somehow the person class. With these analyses, we add two 48 Recent Developments in Video Surveillance

each pixel, that is: f(x,y)=[x, y, R(x, y), G(x, y), B(x, y), ▽RT (x, y), ▽GT (x, y),▽BT (x, y)] where R, G, B are the colorspace axes and x, y are the coordinates of the pixel contributing to the color and the gradient information. The covariance matrix is computed for each detected

The covariance matrices for blobs of different sizes have the same size. In fact, the

For the agglomerative clustering, the distance *dAB* (,) between two clusters A and B is

<sup>1</sup> (,) (,) | |.| | *i j*

The objective of the second step is to detect and remove outliers that are clusters containing a small number of elements. The final step determines one representative blob for each

argmin (, ) *<sup>j</sup> <sup>B</sup> <sup>j</sup> i i <sup>j</sup>*

1,...,| |,

*l dB B*

Fig.10 gives an example result of Ma and Cohen method (Ma and Cohen 2007): (a) original sequence of blobs; (b) clustering results having valid cluster and invalid cluster; (c) representative frame for the second cluster in (b); (d) representative frame for the third cluster in (b). We can see that this method can dominate errors of the object detection if they occur in a small number of frames. However, if the detection error occurs in a large number of frames, the cluster containing the blobs of these frames will be defined as valid cluster by

Our work presented in (Le, Thonnat et al. 2009) is an improvement of Ma and Cohen work (Ma and Cohen 2007), based on two remarks. The first remark is that the drawback of Ma and Cohen's method is that it cannot work well with imperfect object detection since it processes all object blobs including relevant and irrelevant ones. We can resolve this drawback by removing all irrelevant blobs before doing the agglomerative clustering. The second remark is that one blob of an object is relevant if it contains this object or objects belonging to the same class of this object. For example, one blob of a detected person is relevant if it represents somehow the person class. With these analyses, we add two

*dAB dA B A B*

*A AB B*

1,...,| |

*i B*

covariance matrix is a N×N matrix while N is the dimension of the feature vector f.

*dC C*

cluster. For a cluster B, the representative blob Bl is dened as:

this method (the validity of clusters is decided by their sizes).

where ( , ) *<sup>i</sup> <sup>j</sup> dB B* is the blob distance defined in Eq. 5.

The distance between two blobs is calculated as:

computed by average linkage as:

where ( , ) *<sup>i</sup> <sup>j</sup> dA B* is defined in Eq. 5.

, ( )( )*<sup>T</sup>*

2 1 ( , ) ln ( , ) *d i j k i j k*

*C C*

*x y C f ff f* (4)

(5)

*i j*

(7)

(6)

blob as follows:

Fig. 10. Example result of Ma and Cohen method (Ma and Cohen 2007): (a) original sequence of blobs; (b) clustering results having valid clusters and invalid clusters; (c) representative frame for the second cluster in (b); (d) representative frame for the third cluster in (b).

preliminary steps in Ma and Cohen's work. These steps will be performed before the first step of Ma and Cohen's work.


It is worth noting that the appearance of tracked objects may vary but their blobs usually have some common visual characteristics (e.g. human shape characteristics for the blobs of different tracked persons). As we can see, the two added steps allow to remove irrelevant blobs before agglomerative clustering. Therefore, this object signature building method is robust while working with poor quality object detection.

The second object signature building approach does not perform explicitly the representative blob detection. It attempts to sum up all object appearances into one sole signature. This approach is defined as follows:

Appearance-Based Retrieval for Tracked Objects in Surveillance Videos 51

retrieval in general and in surveillance object retrieval in particular, with a given query, the system will (1) compute the similarity between this query and all elements in the database and (2) return the retrieved results which are a list of elements sorted by their similarity

Corresponding to the two approaches for object signature building, there are two approaches for the object matching. Object matching for the first object signature building approach is expressed in Eq. 11. In this equation, object Oq and Op are represented by

allow to define a similarity/dissimilarity between two sets of blobs. These sets may have different sizes. It is worth noting that we can always compute the similarity/dissimilarity of

( , )| 1, , ( , )| 1, , or

In (Ma and Cohen 2007), the authors dene a similarity measure between two objects Oq and Op using the Hausdorff distance (Eq. 12). The Hausdorff distance is the maximum distance

*Dis Hausdorff F w i M F w j M*

*<sup>i</sup> <sup>j</sup> dF F* is the distance between two blobs by using the covariance matrix.

The above object matching allows to take into consideration multiple appearance aspects of the object being tracked. However, the Hausdorff distance is not relevant when working with object tracking algorithms having a high value of object ID confusion because this distance is extremely sensitive to outliers. If two sets of points A and B are similar, all the points are perfectly superimposed except only one single point in A which is far from any

In (Le, Thonnat et al. 2009), we propose a new object matching based on the EMD (Earth Mover's Distance) (Rubner, Tomasi et al. 1998). This method is widely applied with success

Computing the EMD is based on a solution to the old transportation problem. This is a bipartite network ow problem which can be formalized as the following linear programming problem: Let I be a set of suppliers, J a set of consumers, and cij the cost to ship a unit of supply from i ∈ I to j ∈ J. We want to nd a set of ows fij that minimizes the

( , )| 1, , ( , )| 1,

*q q q p p p*

( , )| 1, , ( , )| 1, *q q q p p p Dis EMD F w i M F w j M i i j j* (13)

*i i j j q p*

*Br w i M Br w j M Dis Dis*

*F w i M F w j M Dis Dis*

*j j Fw j M* respectively. The object matching methods

(11)

(12)

with the query. The number of returned results will be decided for each application.

a pair of blobs based on visual features such as color histogram, covariance matrix.

( , )| 1, , ( , )| 1, ,

*q q q p p p*

*q q q p p p*

*i i j j*

max min ( , ) *q p*

point in B, then the Hausdorff distance determined by this point.

*i j iM jM*

*dF F*

*i i j j*

( , )| 1, *q q <sup>q</sup>*

where (,) *q p*

overall cost:

*i i Fw i M* and ( , )| 1, *p p <sup>p</sup>*

of a set to the nearest point in the other set.

in image and scripted video retrieval.

, 1, *Bi N F <sup>i</sup>* (8)

The work presented in (Calderara, Cucchiara et al. 2006) belongs to the second object signature building approach. In this work, the authors have proposed three notations that are person's appearance (PA), single camera appearance trace (or SCAT in short) and multicamera appearance trace (or MCAT in short). SCAT of the person P on camera Ci is composed of all the past person's appearance (PA) of P at instant time t:

$$SCAT\_i^P = \left\{ PA\_i^P(\mathbf{t}) \mid \mathbf{t} = \mathbf{1}, \dots \\ N\_i^P \right\} \tag{9}$$

where t represents the samples in time in which the person P was visible from the camera Ci and Ni P is the total number of frames in which he was visible and detected.

MCAT for a person P is composed of all the SCATi P for any camera Ci in which, at the current moment, the person P has been detected at least for one frame. We can see that SCAT is equivalent to MCAT if the surveillance system has only a camera and SCAT is equivalent to , 1, *Bi N <sup>i</sup>* in our definition.

The object signature building based on mixture of Gaussians is performed as follows:

	- (a) its value is checked against the mean of each Gaussian and if for none of them the difference is within 2.5σ of the distribution, the mode generates a new Gaussian (using the same process reported above) replacing the existing Gaussian with the lowest weight;
	- (b) the Mahalanobis distance is computed for every Gaussian satisfying the abovereported check, and the mode is assigned to the nearest Gaussian; the mean and the variance of the selected Gaussian are updated with the following adaptive equations:

$$\begin{aligned} \mu\_t &= (1 - \alpha)\mu\_{t-1} + \alpha X\_t\\ \sigma\_t^2 &= (1 - \alpha)\sigma\_{t-1}^2 + \alpha (X\_t - \mu\_t)^T (X\_t - \mu\_t) \end{aligned} \tag{10}$$

where Xt is the vector with the values corresponding to the mode and α is the xed learning factor; the weights are also updated by increasing that of the selected Gaussian and decreasing those of the other Gaussians consequently.

At the end of this process, ten Gaussians and the corresponding weights for each MCAT are available and are used as object signature.

#### **3.4 Object matching**

Object matching is the process that computes the similarity/dissimilarity between two objects based on their signatures calculated by above-mentioned approaches. In information 50 Recent Developments in Video Surveillance

The work presented in (Calderara, Cucchiara et al. 2006) belongs to the second object signature building approach. In this work, the authors have proposed three notations that are person's appearance (PA), single camera appearance trace (or SCAT in short) and multicamera appearance trace (or MCAT in short). SCAT of the person P on camera Ci is

where t represents the samples in time in which the person P was visible from the camera Ci

current moment, the person P has been detected at least for one frame. We can see that SCAT is equivalent to MCAT if the surveillance system has only a camera and SCAT is

**Step 1.** Using the rst PA in the MCAT, the ten principal modes of the color histogram are

**Step 2.** The Gaussians are initialized with a mean μ equal to the color corresponding to the mode and a xed variance σ2; weights are equally distributed for each Gaussian; **Step 3.** successive PA belonging to the MCAT are processed to extract again the ten main

 (a) its value is checked against the mean of each Gaussian and if for none of them the difference is within 2.5σ of the distribution, the mode generates a new Gaussian (using the same process reported above) replacing the existing Gaussian with the

 (b) the Mahalanobis distance is computed for every Gaussian satisfying the abovereported check, and the mode is assigned to the nearest Gaussian; the mean and the variance of the selected Gaussian are updated with the following adaptive

(1 ) ( ) ( )

*t t tt tt X*

where Xt is the vector with the values corresponding to the mode and α is the xed learning factor; the weights are also updated by increasing that of the selected Gaussian and

At the end of this process, ten Gaussians and the corresponding weights for each MCAT are

Object matching is the process that computes the similarity/dissimilarity between two objects based on their signatures calculated by above-mentioned approaches. In information

 

 

*T*

 

(10)

*X X*

 

P is the total number of frames in which he was visible and detected.

The object signature building based on mixture of Gaussians is performed as follows:

modes that are used to update the mixture; then, for each mode:

1

 

*t tt*

1

2 2

(1 )

decreasing those of the other Gaussians consequently.

available and are used as object signature.

composed of all the past person's appearance (PA) of P at instant time t:

MCAT for a person P is composed of all the SCATi

equivalent to , 1, *Bi N <sup>i</sup>* in our definition.

extracted;

lowest weight;

equations:

**3.4 Object matching** 

and Ni

, 1, *Bi N F <sup>i</sup>* (8)

( )| 1,... *PP P SCAT PA t t N ii i* (9)

P for any camera Ci in which, at the

retrieval in general and in surveillance object retrieval in particular, with a given query, the system will (1) compute the similarity between this query and all elements in the database and (2) return the retrieved results which are a list of elements sorted by their similarity with the query. The number of returned results will be decided for each application.

Corresponding to the two approaches for object signature building, there are two approaches for the object matching. Object matching for the first object signature building approach is expressed in Eq. 11. In this equation, object Oq and Op are represented by ( , )| 1, *q q <sup>q</sup> i i Fw i M* and ( , )| 1, *p p <sup>p</sup> j j Fw j M* respectively. The object matching methods allow to define a similarity/dissimilarity between two sets of blobs. These sets may have different sizes. It is worth noting that we can always compute the similarity/dissimilarity of a pair of blobs based on visual features such as color histogram, covariance matrix.

$$\begin{cases} \left< \left( Br\_i^q, w\_i^q \right) \mid i \in \mathbf{1}, M^q \right> \left< \left( Br\_j^p, w\_j^p \right) \mid j \in \mathbf{1}, M^p \right> \right> \to \text{Dis}, \text{Dis} \in \mathfrak{R} \text{ or } \\ \left< \left( F\_i^q, w\_i^q \right) \mid i \in \mathbf{1}, M^q \right> \left< \left( F\_j^p, w\_j^p \right) \mid j \in \mathbf{1}, M^p \right> \rangle \to \text{Dis}, \text{Dis} \in \mathfrak{R} \end{cases} \tag{11}$$

In (Ma and Cohen 2007), the authors dene a similarity measure between two objects Oq and Op using the Hausdorff distance (Eq. 12). The Hausdorff distance is the maximum distance of a set to the nearest point in the other set.

$$\begin{split} \text{Dis} &= \text{Hausdorff} \left( \left\langle \left( F\_i^q, w\_i^q \right) \mid i \in 1, M^q \right\rangle, \left\langle \left( F\_j^p, w\_j^p \right) \mid j \in 1, M^p \right\rangle \right) \\ &= \max\_{i \in M^q} \min\_{j \in M^p} d(F\_i^q, F\_j^p) \end{split} \tag{12}$$

where (,) *q p <sup>i</sup> <sup>j</sup> dF F* is the distance between two blobs by using the covariance matrix.

The above object matching allows to take into consideration multiple appearance aspects of the object being tracked. However, the Hausdorff distance is not relevant when working with object tracking algorithms having a high value of object ID confusion because this distance is extremely sensitive to outliers. If two sets of points A and B are similar, all the points are perfectly superimposed except only one single point in A which is far from any point in B, then the Hausdorff distance determined by this point.

In (Le, Thonnat et al. 2009), we propose a new object matching based on the EMD (Earth Mover's Distance) (Rubner, Tomasi et al. 1998). This method is widely applied with success in image and scripted video retrieval.

$$\text{Dis} = \text{EMD}\left(\left\{ (\mathbf{F}\_i^q, w\_i^q) \mid i \in \mathbf{1}, M^q \right\} / \left\{ (\mathbf{F}\_j^p, w\_j^p) \mid j \in \mathbf{1}, M^p \right\} \right) \tag{13}$$

Computing the EMD is based on a solution to the old transportation problem. This is a bipartite network ow problem which can be formalized as the following linear programming problem: Let I be a set of suppliers, J a set of consumers, and cij the cost to ship a unit of supply from i ∈ I to j ∈ J. We want to nd a set of ows fij that minimizes the overall cost:

Appearance-Based Retrieval for Tracked Objects in Surveillance Videos 53

Fig. 11. Matching between object Oq with 4 representative blobs and object Op with 5

*p Br2*

Table 1. Distance of each pair of blobs of Oq and Op based on covariance matrix distance

Fig. 12. Hausdorff-based and EMD-based object matching methods. Hausdorff-based object matching is determined by the distance between blob 1 of object Oq and blob 5 of object Op

(dot line) while EMD-based object matching search for an optimal solution.

*Blob Br1*

*Br1*

*Br2*

*Br3*

*Br4*

**Object Op** 

*p Br3*

*<sup>q</sup>* 3.873 3.873 3.873 3.873 3.361

*<sup>q</sup>* 2.733 2.733 2.733 2.733 2.161

*<sup>q</sup>* 2.142 2.142 2.142 2.142 1.879

*<sup>q</sup>* 2.193 2.193 2.193 2.193 2.048

*p Br4*

*p Br5*

*p*

representative blobs.

**Object Oq**

$$\sum\_{i \in I} \sum\_{j \in J} f\_{ijc\_{ij}} \tag{14}$$

subject to the following constraints:

$$\begin{aligned} f\_{ij} & \geq 0, i \in I, j \in J\\ \sum\_{i \in I} f\_{ij} &= y\_{j'i}, j \in J\\ \sum\_{j \in J} f\_{ij} &\leq \mathbf{x}\_{i'}, i \in J\\ \sum\_{j \in I} y\_j &\leq \sum\_{i \in I} \mathbf{x}\_i \end{aligned} \tag{15}$$

where xi is the total supply of supplier i and yj is the total capacity of consumer j. Once the transportation problem is solved, and we have found the optimal ow F\* = {f\* ij}, the EMD is dened as:

$$EMD = \frac{\sum\_{i \in I} \sum\_{j \in I} f\_{ij}^{\*} c\_{ij}}{\sum\_{j \in I} y\_j} \tag{16}$$

When applied to surveillance object matching, the cost cij becomes the distance of two blobs and the total supply xj and yj are the blob weights. cij can be various descriptor distance between two blobs such as color histogram distance, covariance matrix.

In comparison with the matching method based on the Hausdorff distance (Ma and Cohen 2007), our matching method based on the EMD distance possesses two precious characteristics. Firstly, it considers the participation of each blob in computing the distance based on its similarity with other blobs and its weight. Thanks to the representative blob detection method, blob weight expresses the important degree of this blob in object representation. The proposed matching method ensures a minor participation of irrelevant blobs produced by errors in object tracking because these blobs are relatively different from other blobs and have a small weight. Therefore, the matching method is robust when working with object tracking algorithms having a high value of *Object Id Confusion*. Secondly, the proposed object matching allows partial matching.

We analyze here an example of these object matching methods: We want to compute the similarity/dissimilarity between object Oq with 4 representative blobs and object Op with 5 representative blobs (Fig. 12). The *Object Id Confusion* values of the object tracking module for the first object and the second object are 2 and 1 respectively.

In order to carry out object matching, firstly, we need to compute the distance of each pair of blobs. Tab. 1 shows the distance of each pair of blobs computed on covariance matrix distance (cf. Eq. 5) while Fig. 12 presents the result of object matching methods. Hausdorffbased object matching is determined by the distance between blob 1 of object Oq and blob 5 of object Op (dot line) while EMD-based object matching search for an optimal solution with the participation of each blob. This example shows how the EMD-based object matching method overcomes the poor object tracking challenge.

52 Recent Developments in Video Surveillance

*i Ij J f* 

*ij*

*i I*

*j J*

transportation problem is solved, and we have found the optimal ow F\* = {f\*

*EMD*

between two blobs such as color histogram distance, covariance matrix.

Secondly, the proposed object matching allows partial matching.

for the first object and the second object are 2 and 1 respectively.

method overcomes the poor object tracking challenge.

0, , ,

*f i I j J f yj J*

 

*ij j*

*ij i*

*j i jJ iI*

where xi is the total supply of supplier i and yj is the total capacity of consumer j. Once the

When applied to surveillance object matching, the cost cij becomes the distance of two blobs and the total supply xj and yj are the blob weights. cij can be various descriptor distance

In comparison with the matching method based on the Hausdorff distance (Ma and Cohen 2007), our matching method based on the EMD distance possesses two precious characteristics. Firstly, it considers the participation of each blob in computing the distance based on its similarity with other blobs and its weight. Thanks to the representative blob detection method, blob weight expresses the important degree of this blob in object representation. The proposed matching method ensures a minor participation of irrelevant blobs produced by errors in object tracking because these blobs are relatively different from other blobs and have a small weight. Therefore, the matching method is robust when working with object tracking algorithms having a high value of *Object Id Confusion*.

We analyze here an example of these object matching methods: We want to compute the similarity/dissimilarity between object Oq with 4 representative blobs and object Op with 5 representative blobs (Fig. 12). The *Object Id Confusion* values of the object tracking module

In order to carry out object matching, firstly, we need to compute the distance of each pair of blobs. Tab. 1 shows the distance of each pair of blobs computed on covariance matrix distance (cf. Eq. 5) while Fig. 12 presents the result of object matching methods. Hausdorffbased object matching is determined by the distance between blob 1 of object Oq and blob 5 of object Op (dot line) while EMD-based object matching search for an optimal solution with the participation of each blob. This example shows how the EMD-based object matching

\* *ij ij iI jJ j j J*

*f c*

*y* 

*y x*

,

*f xi J*

subject to the following constraints:

dened as:

*ij ijc*

(14)

(15)

(16)

ij}, the EMD is

Fig. 11. Matching between object Oq with 4 representative blobs and object Op with 5 representative blobs.


Table 1. Distance of each pair of blobs of Oq and Op based on covariance matrix distance

Fig. 12. Hausdorff-based and EMD-based object matching methods. Hausdorff-based object matching is determined by the distance between blob 1 of object Oq and blob 5 of object Op (dot line) while EMD-based object matching search for an optimal solution.

Appearance-Based Retrieval for Tracked Objects in Surveillance Videos 55

Bak, S., E. Corvee, et al. (2010). Person Re-identication Using Spatial Covariance Regions of

Broilo, M., N. Piotto, et al. (2010). Object Trajectory Analysis in Video Indexing and Retrieval

Buchin, M., A. Driemel, et al. (2010). An Algorithmic Framework for Segmenting

Calderara, S., R. Cucchiara, et al. (2006). Multimedia Surveillance: Content-based Retrieval

Chen, L., M. T. Ozsu, et al. (2004). Symbolic Representation and Retrieval of Moving Object

Hsieh, J. W., S. L. Yu, et al. (2006). "Motion-Based Video Retrieval by Trajectory Matching."

Hu, W., D. Xie, et al. (2007). "Semantic-Based Surveillance Video Retrieval." *IEEE* 

Le, T.-L., A. Boucher, et al. (2007). *Subtrajectory-Based Video Indexing and Retrieval*. The International MultiMedia Modeling Conference (MMM'07), Singapore. Le, T.-L., A. Boucher, et al. (2010). *Surveillance video retrieval: what we have already done?* ICCE,

Le, T.-L., M. Thonnat, et al. (2009)a. *Appearance based retrieval for tracked objects in surveillance* 

Le, T.-L., M. Thonnat, et al. (2009). "Surveillance video indexing and retrieval using object

Ma, Y. and B. M. a. I. Cohen (2007). Video Sequence Querying Using Clustering of Objects'

Nghiem, A.-T., F. Bremond, et al. (2007). ETISEO, performance evaluation for video

Rubner, Y., C. Tomasi, et al. (1998). A metric for distributions with applications to image

Senior, A. (2009). An Introduction to Automatic Video Surveillance *Protecting Privacy in* 

Velipasalar, S., L. M. Brown, et al. (2010). "Detection of user-defined, semantically high-level,

Won, C. S., D. K. Park, et al. (2002). "Efficient use of mpeg-7 edge histogram descriptor."

*Signal Based Surveillance (AVSS'07)*. London, United Kingdom.

*videos*. ACM International Conference on Image and Video Retrieval 2009 (CIVR

features and semantic events." *International Journal of Pattern Recognition and Artificial Intelligence, Special issue on Visual Analysis and Understanding for Surveillance* 

Appearance Models. *International Symposium on Visual Computing (ISVC'07)*: 328–

surveillance systems. *In Proceedings of International Conference on Advanced Video and* 

composite events, and retrieval of event queries " *Multimedia Tools and Applications*

*Proc IEEE Trans. on Circuits and Systems for Video Technology* 16(3).

*Advances in Geographic Information Systems (ACM GIS)*.

*Transactions on Image Processing* 16(4): 1168–1181.

Applications. *Video Search and Mining Studies in Computational Intelligence*, Springer

Trajectories based on Spatio-Temporal Criteria. *18th ACM SIGSPATIAL Int. Conf.* 

with Multicamera People Tracking. *ACM International Workshop on Video Surveillance & Sensor Networks (VSSN'06)*. Santa Barbara, California, USA: 95-

**6. References** 

100.

339.

Human Body Parts. *AVSS*.

Berlin Heidelberg. 287: 3–32.

Trajectories. *MIR'04*.

Nha Trang, VietNam.

2009), Santorini, Greece.

*Applications* 23(7): 1439-1476

databases. *ICCV'98*: 59–66.

*Video Surveillance*: 1-9.

*ETRI Journal* 24: 23–30.

50(1): 249-278.

With the output of the second object signature building approach, the object matching is relatively simple.
