! 

1&6 (-- 

1- 2

 -

 )\*- 

&&


5

+,-


 

/ 0-

&-&

3/14

  1- 2  object is arbitrarily added. The de-mixing matrix will be used in the detection step.

Algorithm 1.


#-\$%&

 


on Stereo Vision for Level Crossings Safety Applications

.

-



> #-\$%&

 

"


 ),-

Fig. 1. The block diagram of the proposed background subtraction agorithm.

ICA can be defined as a statistical and computational technique for revealing hidden factors that underlie sets of random variables, measurements or signals. ICA defines a generative

 ! 

 -

'

'

&-

Another kind of technique, called codebook model, has recently been proposed. It consists in registering, over a long period of time, the possible states of each pixel in what is called a *codebook* Kim et al. (2005) consisting of a set of *codewords*. A pixel is classified into either background or foreground classes by evaluating the difference between a given pixel and the corresponding codebook. A color metric and brightness boundaries are used as criteria for classification. Hence, the existing techniques can handle gradual illumination changes, but remain vulnerable to sudden changes. Several works are mainly focused on how to make foreground object extraction unaffected by background changes. These methods are very sensitive on the background model so that a pixel is correctly classified when a given image is coherent to its corresponding background model. Another issue is in the huge computational time of the background updating process.

In recent years, another kind of technique has emerged to deal with this issue. The Independent Component Analysis (ICA) technique, known for its robustness in the signal processing field, is getting much attention in the image processing field. The purpose of ICA is to restore statistically independent source signals, given only a mixture of these signals. For a short time, ICA is applied to fMRI data by McKeown et al. (1998) and have been then introduced for solving problems related to image processing. Hence, ICA finds applications in many emerging new application areas, such as feature extraction Delfosse & Loubaton (1995), speech and image recognition Cardoso (1997), data communication Oja et al. (1991), sensor signal processing Cvejic et al. (2007) and biomedical signal processing Dun et al. (2007)Waldert (2007).

More recently, ICA has been introduced in video processing to cope with the issue of foreground estimation. Zhang and Chen Zhang & Chen (2006) have introduced a spatio-temporal independent component analysis method (stICA) coupled with multiscale analysis as a postprocessing for automated content-based video processing in indoor environments. Their system is computationally demanding so that the data matrix, from which the independent components must be estimated using ICA, is of a very large dimension. Recently, Tsai and Lai Tsai & Lai (2009) have proposed an ICA model for foreground extraction without background updating in indoor environments. The authors have proposed an algorithm for estimating the de-mixing matrix, which gives the independent components, directly measuring the statistical independence by estimating the joint and marginal probability density functions from relative frequency distributions. But, neither detail of an automated system is proposed. These two related works do not handle the background changes over time and are limited to monochrome images. Furthermore, their algorithms are only tested in indoor environments characterized by small environmental changes.

#### **3.2 Overview of the proposed background subtraction algorithm**

The proposed scheme is a complete modelization of the background subtraction task from an image sequence in real-world environments. While considering the acquisition process achieved, the block diagram of the proposed framework is given by Figure 1. The algorithm can be devided into two complementary steps: training step and detection step.

− The first step consists of the estimation of the de-mixing matrix parameter by performing the ICA algorithm on background images only. While any foreground object may appear in

Another kind of technique, called codebook model, has recently been proposed. It consists in registering, over a long period of time, the possible states of each pixel in what is called a *codebook* Kim et al. (2005) consisting of a set of *codewords*. A pixel is classified into either background or foreground classes by evaluating the difference between a given pixel and the corresponding codebook. A color metric and brightness boundaries are used as criteria for classification. Hence, the existing techniques can handle gradual illumination changes, but remain vulnerable to sudden changes. Several works are mainly focused on how to make foreground object extraction unaffected by background changes. These methods are very sensitive on the background model so that a pixel is correctly classified when a given image is coherent to its corresponding background model. Another issue is in the huge computational

In recent years, another kind of technique has emerged to deal with this issue. The Independent Component Analysis (ICA) technique, known for its robustness in the signal processing field, is getting much attention in the image processing field. The purpose of ICA is to restore statistically independent source signals, given only a mixture of these signals. For a short time, ICA is applied to fMRI data by McKeown et al. (1998) and have been then introduced for solving problems related to image processing. Hence, ICA finds applications in many emerging new application areas, such as feature extraction Delfosse & Loubaton (1995), speech and image recognition Cardoso (1997), data communication Oja et al. (1991), sensor signal processing Cvejic et al. (2007) and biomedical signal processing Dun et al. (2007)Waldert

More recently, ICA has been introduced in video processing to cope with the issue of foreground estimation. Zhang and Chen Zhang & Chen (2006) have introduced a spatio-temporal independent component analysis method (stICA) coupled with multiscale analysis as a postprocessing for automated content-based video processing in indoor environments. Their system is computationally demanding so that the data matrix, from which the independent components must be estimated using ICA, is of a very large dimension. Recently, Tsai and Lai Tsai & Lai (2009) have proposed an ICA model for foreground extraction without background updating in indoor environments. The authors have proposed an algorithm for estimating the de-mixing matrix, which gives the independent components, directly measuring the statistical independence by estimating the joint and marginal probability density functions from relative frequency distributions. But, neither detail of an automated system is proposed. These two related works do not handle the background changes over time and are limited to monochrome images. Furthermore, their algorithms are only tested in indoor environments characterized by small environmental

The proposed scheme is a complete modelization of the background subtraction task from an image sequence in real-world environments. While considering the acquisition process achieved, the block diagram of the proposed framework is given by Figure 1. The algorithm

− The first step consists of the estimation of the de-mixing matrix parameter by performing the ICA algorithm on background images only. While any foreground object may appear in

can be devided into two complementary steps: training step and detection step.

**3.2 Overview of the proposed background subtraction algorithm**

time of the background updating process.

(2007).

changes.

the background images, the ICA algorithm allows estimating a source which represents the temporal difference between pixels. Typically, only the five most recent background images seem to be sufficient in our experiments. The matrix which allows separating the foreground from its background, termed de-mixing matrix, is estimated in the following way: the ICA algorithm is performed only once on a data matrix from which the independent components, i.e. the background and the foreground, will be estimated. The data matrix is constructed from two images which are the most recent background, and another on which a foreground object is arbitrarily added. The de-mixing matrix will be used in the detection step.

− The detection step consists of the approximation and the extraction of foreground objects. However, the data matrix is constructed from two images; one is an incoming image from the sequence and the other is the most recent available background. The approximated foreground is then obtained simply by multiplying the data matrix with the de-mixing matrix. The approximated foreground is filtered in order to effectively segment the true foreground objects. This is performed by the use of a spatio-temporal belief propagation method. The principal guidelines of our framework can be explained and summarized in Algorithm 1.

Fig. 1. The block diagram of the proposed background subtraction agorithm.

ICA can be defined as a statistical and computational technique for revealing hidden factors that underlie sets of random variables, measurements or signals. ICA defines a generative

characterized by a high variation of the intensity of a given pixel. It is to be noted that the presence of an arbitrary foreground object over a background is an independent phenomenon. That is, the intensity of a pixel in a foreground object does not depend on the intensity of its

<sup>81</sup> Intelligent Surveillance System Based

To begin with, we formulate a video sequence as a set I of sequential images. An image captured at time *t* is termed *It*, where *t* = 1, . . . , *T*, *T* is the number of frame in the sequence

extraction. Each image *It* is a matrix of size *K* = *h* × *w*, where *h* and *w* are respectively the height and the width of the image. An observation, noted *Ip*,*t*, corresponds to a pixel sample at location *p* = (*i*, *j*) at time *t*. Knowing that the color information is considered in the design of our framework, we introduce the parameter *c* which represents a component of a color space or a normalized color space. For instance, *c* means either the Red, Green or Blue component in the RGB color space, i.e. **R**3. For reasons of simplicity, *c<sup>i</sup>* means one component in the RGB color space where *c*<sup>1</sup> means the Red, *c*<sup>2</sup> means the Green and *c*<sup>3</sup> means the Blue. The data matrix, termed *X*, is a matrix of two lines and *w* × *h* × 3 columns. To fit the data matrix, each color component of the image *<sup>I</sup><sup>c</sup>* at time *<sup>t</sup>* is resized to be a column vector of size *<sup>h</sup>* <sup>×</sup> *<sup>w</sup>*. Each line of the matrix *X* is a column vector, *V*, consisting of the three adjaxent color components

image while the second line represents an image having an arbitrary foregrounds. *bg* and *bg* + *ob* correspond to the background, and the the background on which an object is added. The estimated de-mixing matrix *W* is a 2-by-2 square matrix and the estimated sources signals *S*˜ has the same size as the data matrix *X*. The matrix *X* obtained at time *t* is given as follows:

*Ic*<sup>1</sup>

The *FastICA* algorithm is performed only once for initializing the detection process which allows estimating the de-mixing matrix. The data matrix in the detection step is constructed in a different way from that of the training step. The data matrix is formed by both an image containing only the recent background image in the sequence, and another image containing an arbitrary foreground object, if any. The estimated source images correspond to the separated background and only foreground: the one represents only the background source and the other highlights the foreground object, without the detailed contents of the reference background. Figure 2 illustrates the inputs and outputs of the algorithm. The estimated de-mixing matrix will be used to extract the foreground objects from their background during the detection step. In the detection step, the source components are extracted simply by multiplying the data matrix with the estimated de-mixing matrix. Therefore, the data matrix is updated for each incoming image in the sequence in the following way: the second row of the data matrix corresponds to the recent incoming image from the sequence, while the first

Using this configuration, the two images which constitute the data matrix are not very different because of their temporal proximity. The existing noise among two consecutive images does not degrade the ICA performances and still allows an estimation of the hidden

*Ic*1

*<sup>t</sup>*∈*<sup>T</sup> It*. We extend the ICA model to exploit the color information for video objects

*<sup>k</sup>* �*t*, where *k* ∈ {*bg*, *bg* + *ob*} so that the first line represents the background

*bg*, *<sup>I</sup>c*<sup>2</sup>

*bg*+*ob*, *<sup>I</sup>c*<sup>2</sup>

*bg*, *<sup>I</sup>c*<sup>3</sup> *bg*

*bg*+*ob*

*t*

(3)

*bg*+*ob*, *<sup>I</sup>c*<sup>3</sup>

corresponding background.

on Stereo Vision for Level Crossings Safety Applications

and I =

*Vk* <sup>=</sup> �*Ic*<sup>1</sup>

*<sup>k</sup>* , *<sup>I</sup>c*<sup>2</sup> *<sup>k</sup>* , *<sup>I</sup>c*<sup>3</sup>

*X* =

**3.3.1 Foreground approximation**

 *Vbg Vbg*<sup>+</sup>*ob*

*t* =

row corresponds to the Most Recent Available Background (MRAB).

**Algorithm 1** Background subtraction for Foreground objects segmentation.


model for separating the observed multivariate data that are mixtures of unknown sources without any previous knowledge. It aims to find the source signals from observation data. In that model, the observed data are assumed to be linear mixtures of some unknown latent variables, and the mixing system is also unknown. The latent variables are assumed to be non-Gaussian and mutually independent; they are called the independent components of the observed data. The problem boils down to finding a linear representation in which the components are as statistically independent as possible. Formally, the data are usually represented by a matrix *X*, the unknown separated source signals by *S*˜ and the unknown matrix that allow obtaining the independent components by *A*˜. Thus, every component, say *s*˜*i*,*<sup>t</sup>* from *S*˜, is expressed as linear combination of the observed variables, the problem can be reformulated as Equation 1:

$$X = \vec{A} \times \vec{S} \tag{1}$$

After estimating the matrix *A*, its inverse, called *W*, can be computed for obtaining the independent components. The model becomes:

$$
\tilde{S} = \tilde{A}^{-1} \times X = \tilde{W} \times X \tag{2}
$$

#### **3.3 Proposed background subtraction using ICA model**

We explain in this section the different parts of our proposed model and how the *FastICA* algorithm is performed for solving ICA model. The number of independent components to be estimated are the two following components: the background and the foreground. Moreover, these two components are assumed to be independent. Indeed, the presence of motion is

**Algorithm 1** Background subtraction for Foreground objects segmentation.

**1.** Perform the *FastICA* algorithm on a data matrix obtained from two consecutive background images for noise model estimation. The noise model, obtained from the *k* previous frame, corresponds to the mean

**2.** Excecute the *FastICA* algorithm only once in order to estimate the de-mixing matrix. The data matrix, from which the *FastICA* algorithm is performed, is constructed of two images: the one corresponds to the background and the other corresponds to the background on which a

**3.** Construct the data matrix for foreground approximation. The data matrix is composed of the Most Recent Available Background and an

**4.** The approximated foreground is obtained by multiplying the data

**5.** Filtering of the estimated foreground by the use of a spatio-temporal

model for separating the observed multivariate data that are mixtures of unknown sources without any previous knowledge. It aims to find the source signals from observation data. In that model, the observed data are assumed to be linear mixtures of some unknown latent variables, and the mixing system is also unknown. The latent variables are assumed to be non-Gaussian and mutually independent; they are called the independent components of the observed data. The problem boils down to finding a linear representation in which the components are as statistically independent as possible. Formally, the data are usually represented by a matrix *X*, the unknown separated source signals by *S*˜ and the unknown matrix that allow obtaining the independent components by *A*˜. Thus, every component, say *s*˜*i*,*<sup>t</sup>* from *S*˜, is expressed as linear combination of the observed variables, the problem can be

After estimating the matrix *A*, its inverse, called *W*, can be computed for obtaining the

We explain in this section the different parts of our proposed model and how the *FastICA* algorithm is performed for solving ICA model. The number of independent components to be estimated are the two following components: the background and the foreground. Moreover, these two components are assumed to be independent. Indeed, the presence of motion is

*<sup>X</sup>* <sup>=</sup> *<sup>A</sup>*˜ <sup>×</sup> *<sup>S</sup>*˜ (1)

*<sup>S</sup>*˜ <sup>=</sup> *<sup>A</sup>*˜−<sup>1</sup> <sup>×</sup> *<sup>X</sup>* <sup>=</sup> *<sup>W</sup>*˜ <sup>×</sup> *<sup>X</sup>* (2)

matrix with the de-mixing matrix obtained from step 2.

and the standard deviation of each color component.

foreground object is added.

belief propagation.

reformulated as Equation 1:

independent components. The model becomes:

**3.3 Proposed background subtraction using ICA model**

incoming image from the sequence.

characterized by a high variation of the intensity of a given pixel. It is to be noted that the presence of an arbitrary foreground object over a background is an independent phenomenon. That is, the intensity of a pixel in a foreground object does not depend on the intensity of its corresponding background.

To begin with, we formulate a video sequence as a set I of sequential images. An image captured at time *t* is termed *It*, where *t* = 1, . . . , *T*, *T* is the number of frame in the sequence and I = *<sup>t</sup>*∈*<sup>T</sup> It*. We extend the ICA model to exploit the color information for video objects extraction. Each image *It* is a matrix of size *K* = *h* × *w*, where *h* and *w* are respectively the height and the width of the image. An observation, noted *Ip*,*t*, corresponds to a pixel sample at location *p* = (*i*, *j*) at time *t*. Knowing that the color information is considered in the design of our framework, we introduce the parameter *c* which represents a component of a color space or a normalized color space. For instance, *c* means either the Red, Green or Blue component in the RGB color space, i.e. **R**3. For reasons of simplicity, *c<sup>i</sup>* means one component in the RGB color space where *c*<sup>1</sup> means the Red, *c*<sup>2</sup> means the Green and *c*<sup>3</sup> means the Blue. The data matrix, termed *X*, is a matrix of two lines and *w* × *h* × 3 columns. To fit the data matrix, each color component of the image *<sup>I</sup><sup>c</sup>* at time *<sup>t</sup>* is resized to be a column vector of size *<sup>h</sup>* <sup>×</sup> *<sup>w</sup>*. Each line of the matrix *X* is a column vector, *V*, consisting of the three adjaxent color components *Vk* <sup>=</sup> �*Ic*<sup>1</sup> *<sup>k</sup>* , *<sup>I</sup>c*<sup>2</sup> *<sup>k</sup>* , *<sup>I</sup>c*<sup>3</sup> *<sup>k</sup>* �*t*, where *k* ∈ {*bg*, *bg* + *ob*} so that the first line represents the background image while the second line represents an image having an arbitrary foregrounds. *bg* and *bg* + *ob* correspond to the background, and the the background on which an object is added. The estimated de-mixing matrix *W* is a 2-by-2 square matrix and the estimated sources signals *S*˜ has the same size as the data matrix *X*. The matrix *X* obtained at time *t* is given as follows:

$$X = \begin{pmatrix} V\_{bg} \\ V\_{bg+ob} \end{pmatrix}\_t = \begin{pmatrix} I\_{bg'}^1 & I\_{bg'}^2 & I\_{bg}^3 \\ I\_{bg+ob'}^1 & I\_{bg+ob'}^2 & I\_{bg+ob}^3 \end{pmatrix}\_t \tag{3}$$

## **3.3.1 Foreground approximation**

The *FastICA* algorithm is performed only once for initializing the detection process which allows estimating the de-mixing matrix. The data matrix in the detection step is constructed in a different way from that of the training step. The data matrix is formed by both an image containing only the recent background image in the sequence, and another image containing an arbitrary foreground object, if any. The estimated source images correspond to the separated background and only foreground: the one represents only the background source and the other highlights the foreground object, without the detailed contents of the reference background. Figure 2 illustrates the inputs and outputs of the algorithm. The estimated de-mixing matrix will be used to extract the foreground objects from their background during the detection step. In the detection step, the source components are extracted simply by multiplying the data matrix with the estimated de-mixing matrix. Therefore, the data matrix is updated for each incoming image in the sequence in the following way: the second row of the data matrix corresponds to the recent incoming image from the sequence, while the first row corresponds to the Most Recent Available Background (MRAB).

Using this configuration, the two images which constitute the data matrix are not very different because of their temporal proximity. The existing noise among two consecutive images does not degrade the ICA performances and still allows an estimation of the hidden

(a) (b)

<sup>83</sup> Intelligent Surveillance System Based

on Stereo Vision for Level Crossings Safety Applications

(c) (d)

Fig. 3. Background subtraction and foreground approximation (a) background image from the "Pontet" dataset, (b) scene image from the same dataset containing a car and two

pedestrians, (c) estimated foreground image obtained by mutliplying the data matrix, formed by the images (a) and (b), and a de-mixing matrix, where w2 = [−0.519938, 0.548299], and (d) zooming on a part of a background, a pedestrian and a car that we attempt to separate.

a Markov Random Field (MRF) in which an inference algorithm is applied to find the most likely setting of the model. Several robust inference algorithms such as Graph Cuts and Belief Propagation have emerged and proved their efficiency especially in the realm of stereo Yang et al. (2009) and image restoration Felzenszwalb & Huttenlocher (2006). The formulation we propose aims at clairly separating the foreground from its background by introducing spatial and temporal dependencies between the pixels. The rest of this section will be focused on the spatio-temporal formulation and the algorithm used for minimizing such energy. In what follows, the problem will be formulated as graphical model which consists of an undirected graph on which the inference is performed by approximating the MAP estimate for this MRF using loopy belief propagation. The bipartite graph is denoted by G = (P, *E*) where P is a set of nodes, i.e. pixels, and *E* is a set of undirected edges between nodes. Each pixel *x* is modeled by a state noted *s*˙*x*. In computer vision, the edges allow establishing spatial dependencies between nodes. During the message passing procedure, a label is assigned to each node which is the vector of three color components. A state *<sup>s</sup>*˙*<sup>x</sup>* = �*lt*,..., *lt*−*k*� of a pixel *<sup>x</sup>* is modeled by a vector of labels such as a label *lt* corresponds to the color components of pixel

*p* at time *t*.

Fig. 2. Principle of the background subtraction using Independent Component Analysis.

sources. The estimated signals are obtained by multiplying the data matrix and the de-mixing matrix.

$$
\tilde{S} = \begin{pmatrix} \tilde{V}\_{bg} \\ \tilde{V}\_{bg+ob} \end{pmatrix} = \begin{pmatrix} \mathbf{w}\_1 \\ \mathbf{w}\_2 \end{pmatrix} \times \begin{pmatrix} V\_{bg} \\ V\_{bg} \end{pmatrix} \tag{4}
$$

The first row of the matrix *S*˜ corresponds to the background model, while the second row represents the estimated foreground signal in which only moving or stationary objects are highlighted. From then on, only the estimated foreground will be taken into account and will be called an "Approximate Foreground". The second row of the matrix *S*˜ is reshaped from a vector of size *h* × *w* × 3 to a 2D color image of size (*h*, *w*) by the linear transformation given by Equation 5:

$$I\_{i,j,t}^{\mathbb{C}} = \tilde{\mathcal{S}}(1, (l\*K) + (i\*h) + j) \tag{5}$$

where *c* ∈ {*R*, *G*, *B*}, *K* = *h* × *w*, and *l* is an integer which takes values 1 if *c* = *R*, 2 if *c* = *G*, and 3 if *c* = *B*. Figure 3 despicts an example of foreground objects extracted by multiplying a de-mixing matrix, estimated using *FastICA* algorithm, and a data matrix formed by the images (a) and (b). The two vectors that form the de-mixing matrix are respectively w1 = [0.0285312, 0.0214363] and w2 = [−0.519938, 0.548299]. Vector w1 allows obtaining the estimated background image while vector w2 allows obtaining the estimated foreground image. Typically, the estimated foreground highlight moving and stationary objects which are smothered by a noisy and uniform background.

The estimated signal is characterized by the presence of zones corresponding to a high intensity variation of the background together with a lot of noise. We have no *a priori* about the noise distribution, making the foreground extraction a task difficult to solve.

#### **3.3.2 Foreground extraction**

#### *A. MRF Formulation and Energy Minimization*

In this part, we propose a robust framework to accurately extract foreground objects from the estimated foreground signal. This module aims at filtering the estimated foreground by reducing the noise obtained from the ICA model. The problem can be expressed in terms of

Fig. 2. Principle of the background subtraction using Independent Component Analysis.

 = w1 w2 × *Vbg Vbg* 

*S*˜ =

*I c*

smothered by a noisy and uniform background.

*A. MRF Formulation and Energy Minimization*

**3.3.2 Foreground extraction**

 *V*˜ *bg V*˜ *bg*+*ob*

matrix.

by Equation 5:

sources. The estimated signals are obtained by multiplying the data matrix and the de-mixing

The first row of the matrix *S*˜ corresponds to the background model, while the second row represents the estimated foreground signal in which only moving or stationary objects are highlighted. From then on, only the estimated foreground will be taken into account and will be called an "Approximate Foreground". The second row of the matrix *S*˜ is reshaped from a vector of size *h* × *w* × 3 to a 2D color image of size (*h*, *w*) by the linear transformation given

where *c* ∈ {*R*, *G*, *B*}, *K* = *h* × *w*, and *l* is an integer which takes values 1 if *c* = *R*, 2 if *c* = *G*, and 3 if *c* = *B*. Figure 3 despicts an example of foreground objects extracted by multiplying a de-mixing matrix, estimated using *FastICA* algorithm, and a data matrix formed by the images (a) and (b). The two vectors that form the de-mixing matrix are respectively w1 = [0.0285312, 0.0214363] and w2 = [−0.519938, 0.548299]. Vector w1 allows obtaining the estimated background image while vector w2 allows obtaining the estimated foreground image. Typically, the estimated foreground highlight moving and stationary objects which are

The estimated signal is characterized by the presence of zones corresponding to a high intensity variation of the background together with a lot of noise. We have no *a priori* about

In this part, we propose a robust framework to accurately extract foreground objects from the estimated foreground signal. This module aims at filtering the estimated foreground by reducing the noise obtained from the ICA model. The problem can be expressed in terms of

the noise distribution, making the foreground extraction a task difficult to solve.

*<sup>i</sup>*,*j*,*<sup>t</sup>* <sup>=</sup> *<sup>S</sup>*˜(1,(*<sup>l</sup>* <sup>∗</sup> *<sup>K</sup>*)+(*<sup>i</sup>* <sup>∗</sup> *<sup>h</sup>*) + *<sup>j</sup>*) (5)

(4)

Fig. 3. Background subtraction and foreground approximation (a) background image from the "Pontet" dataset, (b) scene image from the same dataset containing a car and two pedestrians, (c) estimated foreground image obtained by mutliplying the data matrix, formed by the images (a) and (b), and a de-mixing matrix, where w2 = [−0.519938, 0.548299], and (d) zooming on a part of a background, a pedestrian and a car that we attempt to separate.

a Markov Random Field (MRF) in which an inference algorithm is applied to find the most likely setting of the model. Several robust inference algorithms such as Graph Cuts and Belief Propagation have emerged and proved their efficiency especially in the realm of stereo Yang et al. (2009) and image restoration Felzenszwalb & Huttenlocher (2006). The formulation we propose aims at clairly separating the foreground from its background by introducing spatial and temporal dependencies between the pixels. The rest of this section will be focused on the spatio-temporal formulation and the algorithm used for minimizing such energy. In what follows, the problem will be formulated as graphical model which consists of an undirected graph on which the inference is performed by approximating the MAP estimate for this MRF using loopy belief propagation. The bipartite graph is denoted by G = (P, *E*) where P is a set of nodes, i.e. pixels, and *E* is a set of undirected edges between nodes. Each pixel *x* is modeled by a state noted *s*˙*x*. In computer vision, the edges allow establishing spatial dependencies between nodes. During the message passing procedure, a label is assigned to each node which is the vector of three color components. A state *<sup>s</sup>*˙*<sup>x</sup>* = �*lt*,..., *lt*−*k*� of a pixel *<sup>x</sup>* is modeled by a vector of labels such as a label *lt* corresponds to the color components of pixel *p* at time *t*.

*<sup>E</sup>*(G) = ∑

as a constant.

*x*∈*P*

on Stereo Vision for Level Crossings Safety Applications

*Dx*(*s*˙*x*(*t*)) + ∑

*x*∈P *y*∈N*s*,*<sup>x</sup>*

the evolution of the energy function, the data term can be written as:

The rest of parameters are not considered for the computation of this term.

*Dx*(*s*˙*x*) =

The temporal filtering term is given by the following:

*θ* 

*<sup>s</sup>*˙*x*(*t*),*s*˙*x*(*t*−*i*)

neighboring pixels which are classified as background, 0 otherwise.

*Vs*(*s*˙*x*(*t*),*s*˙*y*(*t*)) + ∑

0 if |*s*˙*x*(*t*) − *lx*| ≤ *ε*

– Data term : In stereo problems, data term usually corresponds to the cost of matching a pixel in the left-image to another in the right-image. Typically, this term, i.e. cost, is based on the intensity differences between the two pixels. Just like in this present case, the data term *Dx*(*s*˙*x*) is defined as the cost of assigning a label *lx* to pixel *x*. It can be expressed as the Euclidean distance between the color components of the pixel *x* and a given label *lx*. In order to control

<sup>85</sup> Intelligent Surveillance System Based

where *ε* is constant. The data term depends only on the parameter *α* of the state of the node.

– Spatial smoothness term : The choice of the smoothness term is a critical issue. Several cost terms have been proposed which heavily depend on the problem to be solved. Assuming the pair-wise interaction between two adjacent pixels *x* and *y*, the spatial smoothness term can be formulated using the Potts model Wu (1982). This is motivated by the fact that this piecewise smooth model encourages a labeling consisting of several regions where pixels in the same moving region have similar labels. The cost of the spatial smoothness term is given by:

*Vs*(*s*˙*x*(*t*),*s*˙*y*(*t*)) = 0 if <sup>Δ</sup>*x*,*<sup>y</sup>* <sup>≤</sup> *<sup>ξ</sup>*

where Δ*x*,*<sup>y</sup>* is the Euclidean distance between the two neighboring pixels *x* and *y* at the same time, *ξ* is a constant, and *T* is the temperature variable of the Potts model which can be estimated appropriately by simulating the Potts model. In our case, we choose to take *T*

– Temporal filtering term Making use of an additional term which represents the temporal filtering in the energy function is very useful for improving the performances of the passing message procedure. The optimal labels obtained for a node during the *k* previous images are used for quickly reaching the optimal label in the current node at time *t*. Each current node uses its previous best set of labels obtained for the same node during the *k* previous images.

= ∑*κ*.

where *κ* is a binary parameter which takes values 0 or 1. In the case where the most temporally neighboring pixel is classified as a background, the parameter *κ* is set to 1 for all temporally

 

*<sup>s</sup>*˙*x*(*t*) <sup>−</sup> *<sup>s</sup>*˙*x*(*t*−*i*)

 

(12)

*x*(*t*−*i*)∈N*t*,*x*

<sup>|</sup>*s*˙*x*(*t*) <sup>−</sup> *lx*<sup>|</sup> otherwise (10)

<sup>Δ</sup>*x*,*y*.*<sup>T</sup>* otherwise (11)

*Vt*(*s*˙*x*(*t*),*s*˙*x*(*t*−*<sup>i</sup>*)) (9)

Referring to the standard four-connected rectangular lattice configuration, the joint probability of the MRF is given by the product of one- and two-nodes having spatial and temporal dependencies as follows:

$$P(\mathcal{G}) = \prod\_{\mathbf{x} \in \mathcal{P}} \Phi(\dot{\mathbf{s}}\_{\mathbf{x}(t)}) + \prod\_{\substack{\mathbf{x} \in \mathcal{P} \\ \mathbf{y} \in \mathcal{N}\_{\mathbf{s},\mathbf{r}}}} \Psi(\dot{\mathbf{s}}\_{\mathbf{x}(t)}, \dot{\mathbf{s}}\_{\mathbf{y}(t)}) + \prod\_{\mathbf{x}(t-\bar{t}) \in \mathcal{N}\_{l,\mathbf{r}}} \Theta(\dot{\mathbf{s}}\_{\mathbf{x}(t)}, \dot{\mathbf{s}}\_{\mathbf{x}(t-\bar{t})}) \tag{6}$$

where Φ, Ψ and Θ are functions which describe the dependency between nodes, which will be detailed in section *B*. *s*˙*<sup>x</sup>* and *s*˙*<sup>y</sup>* are the state of node *x* and *y* respectively, given that *y* is one of the four spatial neighbors of node *x*, *s*˙*x*(*t*) is the state of the most recent node *x* at time *t*. For a given pixel *x*, the spatial four-connected nodes form a set of spatial neighboring noted N*s*,*x*, and the consecutive temporal neighboring denoted by N*t*,*x*. Typically, the optimization is performed by computing the *a posteriori* belief of a variable, which is NP-hard. This has generally been viewed as being too slow to be practical for early vision. The idea is to approximate the optimal solution by inference using belief propagation which is one of the most efficient methods of finding the optimum solution. This allows minimizing the energy function using either the Maximum A Posteriori (MAP) or the Minimum Mean Squared Error (MMSE) estimator.

#### *B. Energy Minimization using Spatio-Temporal Belief Propagation*

Intuitively, the objectives can be reformulated, in terms of energy minimization, as the research of the optimal labeling *f* <sup>∗</sup> that assigns each pixel *x* ∈ P a label *l* ∈ L by minimizing an energy function. In our case, the energy to minimize is represented as a linear combination of three terms: data term, spatial smoothness term, and temporal filtering term. The data term measures how well state *s*˙ fits pixel *x*, given its observed data, the spatial smoothness term measures the extent to which the state *s*˙ is not spatially piecewise smooth, and the temporal filtering term evaluates the temporal dependencies of the consecutive states of a pixel *x* over time by using its known previous optimal labels. Checking both the piecewise spatial smoothness and temporal filtering allows obtaining a robust motion-based classification.

Intuitively, the optimal labeling can be found by maximizing a probability. The MAP estimate is equivalent to minimizing the Equation 6 by taking the negative log, so writing *φ* = −log Φ, *ψ* = −log Ψ, and *θ* = −log Θ, the objectibe can be reformulated as minimizing the posterior log as a function of the form:

$$E(\mathcal{G}) = \sum\_{\mathbf{x} \in P} \boldsymbol{\Phi}(\dot{\mathbf{s}}\_{\mathbf{x}(t)}) + \sum\_{\substack{\mathbf{x} \in \mathcal{P} \\ \mathbf{y} \in \mathcal{N}\_{\delta, \mathbf{x}}}} \boldsymbol{\Psi}(\dot{\mathbf{s}}\_{\mathbf{x}(t)}, \dot{\mathbf{s}}\_{\mathbf{y}(t)}) + \sum\_{\mathbf{x}(t-i) \in \mathcal{N}\_{\delta}, \mathbf{x}} \boldsymbol{\theta}(\dot{\mathbf{s}}\_{\mathbf{x}(t)}, \dot{\mathbf{s}}\_{\mathbf{x}(t-i)}) \tag{7}$$

The formulation we consider in our framework consists in optimizing the data term denoted by *Edata*( <sup>ˆ</sup> *<sup>f</sup>*), the spatial smoothness term denoted by *Es*\_*smooth*( <sup>ˆ</sup> *f*) and the temporal filtering term denoted by *Et*\_ *filtering*( <sup>ˆ</sup> *f*). By treating the problem in the context of filtering, the outputs from previous frames can be incorporated by adding an extra term to the energy function. The Global Energy Minimization function can be formulated as follows:

$$E(\mathcal{G}) = E\_{data}(\mathcal{f}) + E\_{\text{s\\_smooth}}(\mathcal{f}) + E\_{\text{t\\_filtering}}(\mathcal{f}) \tag{8}$$

However, Equation 7 can be expressed as:

Referring to the standard four-connected rectangular lattice configuration, the joint probability of the MRF is given by the product of one- and two-nodes having spatial and

where Φ, Ψ and Θ are functions which describe the dependency between nodes, which will be detailed in section *B*. *s*˙*<sup>x</sup>* and *s*˙*<sup>y</sup>* are the state of node *x* and *y* respectively, given that *y* is one of the four spatial neighbors of node *x*, *s*˙*x*(*t*) is the state of the most recent node *x* at time *t*. For a given pixel *x*, the spatial four-connected nodes form a set of spatial neighboring noted N*s*,*x*, and the consecutive temporal neighboring denoted by N*t*,*x*. Typically, the optimization is performed by computing the *a posteriori* belief of a variable, which is NP-hard. This has generally been viewed as being too slow to be practical for early vision. The idea is to approximate the optimal solution by inference using belief propagation which is one of the most efficient methods of finding the optimum solution. This allows minimizing the energy function using either the Maximum A Posteriori (MAP) or the Minimum Mean Squared Error

Intuitively, the objectives can be reformulated, in terms of energy minimization, as the research of the optimal labeling *f* <sup>∗</sup> that assigns each pixel *x* ∈ P a label *l* ∈ L by minimizing an energy function. In our case, the energy to minimize is represented as a linear combination of three terms: data term, spatial smoothness term, and temporal filtering term. The data term measures how well state *s*˙ fits pixel *x*, given its observed data, the spatial smoothness term measures the extent to which the state *s*˙ is not spatially piecewise smooth, and the temporal filtering term evaluates the temporal dependencies of the consecutive states of a pixel *x* over time by using its known previous optimal labels. Checking both the piecewise spatial smoothness and temporal filtering allows obtaining a robust motion-based classification.

Intuitively, the optimal labeling can be found by maximizing a probability. The MAP estimate is equivalent to minimizing the Equation 6 by taking the negative log, so writing *φ* = −log Φ, *ψ* = −log Ψ, and *θ* = −log Θ, the objectibe can be reformulated as minimizing the posterior

The formulation we consider in our framework consists in optimizing the data term denoted

from previous frames can be incorporated by adding an extra term to the energy function.

*<sup>f</sup>*) + *Es*\_*smooth*( <sup>ˆ</sup>

*ψ*(*s*˙*x*(*t*),*s*˙*y*(*t*)) + ∑

*x*(*t*−*i*)∈N*t*,*x*

*f*). By treating the problem in the context of filtering, the outputs

*<sup>f</sup>*) + *Et*\_ *filtering*( <sup>ˆ</sup>

Ψ(*s*˙*x*(*t*),*s*˙*y*(*t*)) + ∏

*x*(*t*−*i*)∈N*t*,*<sup>x</sup>*

<sup>Θ</sup>(*s*˙*x*(*t*),*s*˙*x*(*t*−*<sup>i</sup>*)) (6)

*<sup>θ</sup>*(*s*˙*x*(*t*),*s*˙*x*(*t*−*<sup>i</sup>*)) (7)

*f*) and the temporal filtering

*f*) (8)

temporal dependencies as follows:

*<sup>P</sup>*(G) = <sup>∏</sup>*<sup>x</sup>*∈P

(MMSE) estimator.

log as a function of the form:

term denoted by *Et*\_ *filtering*( <sup>ˆ</sup>

by *Edata*( <sup>ˆ</sup>

*<sup>E</sup>*(G) = ∑

*x*∈*P*

However, Equation 7 can be expressed as:

*φ*(*s*˙*x*(*t*)) + ∑

*<sup>E</sup>*(G) = *Edata*( <sup>ˆ</sup>

*x*∈P *y*∈N*s*,*<sup>x</sup>*

*<sup>f</sup>*), the spatial smoothness term denoted by *Es*\_*smooth*( <sup>ˆ</sup>

The Global Energy Minimization function can be formulated as follows:

Φ(*s*˙*x*(*t*)) + ∏

*B. Energy Minimization using Spatio-Temporal Belief Propagation*

*x*∈P *y*∈N*s*,*<sup>x</sup>*

$$E(\mathcal{G}) = \sum\_{\mathbf{x} \in \mathcal{P}} D\_{\mathbf{x}}(\dot{\mathbf{s}}\_{\mathbf{x}(t)}) + \sum\_{\substack{\mathbf{x} \in \mathcal{P} \\ \mathbf{y} \in \mathcal{N}\_{\mathbf{s},\mathbf{x}}}} V\_{\mathbf{s}}(\dot{\mathbf{s}}\_{\mathbf{x}(t)}, \dot{\mathbf{s}}\_{\mathbf{y}(t)}) + \sum\_{\mathbf{x}(t-i) \in \mathcal{N}\_{\mathbf{t}}\times \mathbf{}} V\_{\mathbf{t}}(\dot{\mathbf{s}}\_{\mathbf{x}(t)}, \dot{\mathbf{s}}\_{\mathbf{x}(t-i)}) \tag{9}$$

– Data term : In stereo problems, data term usually corresponds to the cost of matching a pixel in the left-image to another in the right-image. Typically, this term, i.e. cost, is based on the intensity differences between the two pixels. Just like in this present case, the data term *Dx*(*s*˙*x*) is defined as the cost of assigning a label *lx* to pixel *x*. It can be expressed as the Euclidean distance between the color components of the pixel *x* and a given label *lx*. In order to control the evolution of the energy function, the data term can be written as:

$$D\_{\mathbf{x}}(\dot{\mathbf{s}}\_{\mathbf{x}}) = \begin{cases} 0 & \text{if } |\dot{\mathbf{s}}\_{\mathbf{x}(t)} - l\_{\mathbf{x}}| \le \varepsilon \\ |\dot{\mathbf{s}}\_{\mathbf{x}(t)} - l\_{\mathbf{x}}| & \text{otherwise} \end{cases} \tag{10}$$

where *ε* is constant. The data term depends only on the parameter *α* of the state of the node. The rest of parameters are not considered for the computation of this term.

– Spatial smoothness term : The choice of the smoothness term is a critical issue. Several cost terms have been proposed which heavily depend on the problem to be solved. Assuming the pair-wise interaction between two adjacent pixels *x* and *y*, the spatial smoothness term can be formulated using the Potts model Wu (1982). This is motivated by the fact that this piecewise smooth model encourages a labeling consisting of several regions where pixels in the same moving region have similar labels. The cost of the spatial smoothness term is given by:

$$V\_{\mathbf{s}}(\mathbf{s}\_{\mathbf{x}(t)}, \dot{\mathbf{s}}\_{\mathbf{y}(t)}) = \begin{cases} 0 & \text{if } \Delta\_{\mathbf{x}, \mathbf{y}} \le \tilde{\xi} \\ \Delta\_{\mathbf{x}, \mathbf{y}}.T & \text{otherwise} \end{cases} \tag{11}$$

where Δ*x*,*<sup>y</sup>* is the Euclidean distance between the two neighboring pixels *x* and *y* at the same time, *ξ* is a constant, and *T* is the temperature variable of the Potts model which can be estimated appropriately by simulating the Potts model. In our case, we choose to take *T* as a constant.

– Temporal filtering term Making use of an additional term which represents the temporal filtering in the energy function is very useful for improving the performances of the passing message procedure. The optimal labels obtained for a node during the *k* previous images are used for quickly reaching the optimal label in the current node at time *t*. Each current node uses its previous best set of labels obtained for the same node during the *k* previous images. The temporal filtering term is given by the following:

$$\theta\left(\left.\dot{s}\_{\mathbf{x}(t)},\dot{s}\_{\mathbf{x}(t-i)}\right\rangle = \sum \kappa \left.\left\|\dot{s}\_{\mathbf{x}(t)} - \dot{s}\_{\mathbf{x}(t-i)}\right\|\right\|\tag{12}$$

where *κ* is a binary parameter which takes values 0 or 1. In the case where the most temporally neighboring pixel is classified as a background, the parameter *κ* is set to 1 for all temporally neighboring pixels which are classified as background, 0 otherwise.

*global approaches*. Other approaches have been proposed: they are based on a probabilistic framework optimization, such as belief propagation Lee & Ho (2008). These methods aim to obtain high-quality and accurate results, but are very expensive in terms of processing time. It is a real challenge to evaluate stereo methods in the case of noise, depth discontinuity,

<sup>87</sup> Intelligent Surveillance System Based

In our context, the proposed stereo matching algorithm is applied only on moving and stationary zones automatically detected referring to the technique discussed in the previous section. However, costs and processing time are decreasing at a steady pace, and it is becoming realistic to believe that such a thing will be commonplace soon. The stereo algorithm presented further, springs from relaxation and energy minimization fields. Our approach aims to represent a novel framework to improve color dense stereo matching. As a first step, disparity map volume is initialized applying a new local correlation function. After that, a confidence measure is attributed to all the pairs of matched pixels, which are classified into two classes: well-matched pixel and badly-amtched pixel. Then, the disparity value of all badly-matched pixels is updated based only on stable pixels classified as well-matched in an homogeneous color region. The well-matched pixels are used as input into disparity re-estimation modules to update the remaining pixels. The confidence measure is based on a set of original local parameters related to the correlation function used in the first step of our algorithm. This paper will focus on this point. The main goal of our study is to take into account both quality and speed. The global scheme of the stereo matching algorithme we

occlusions and non-textured image regions.

on Stereo Vision for Level Crossings Safety Applications

propose is given by figure 4.


&-- ' ( - 

! -

#-

 

)!-

Fig. 4. Global scheme of the stereo matching algorithm for 3D localization.

 -

- -  -

 - 

 !-

 ! 

 -

"

 -

 


 ! \$-

 


 -! 

 -

 !

 -

 

 - -

&-- -!'

( - 

! -%"

> -

#### *C. Foreground/Background Classification*

This final step consists in automatically extracting all foreground objects from their background in the aforementioned filtered foreground image, noted *If* . The classification step is preceded by postprocessing the output color image on which the background is uniform, while all moving and stationary objects are well highlighted. The postprocessing aims at binarizing the image and classifying all pixels into two classes: foreground and background. To this end, the color components *I ci <sup>f</sup>* , where *ci* is the *i th* color component, are extracted from the color image *If* . Then, a Sobel operator *Sob* is applied to each color component which aims at performing a 2-D spatial gradient measurement on an image and so emphasizes regions of high spatial frequency that correspond to edges. For each color component, two edge images are obtained which represent the edges obtained from the horizontal and the vertical directions. The union of these two images forms the V-H (Vertical-Horizontal) edges of the image. Thus, an edge image is obtained for each color component. The final edge image is obtained by intersecting the points of the edges of the three images. The final edges *E* of the image is obtained as follows.

$$E(I\_f^{\mathbb{C}}) = \mathbb{S}ob(I\_f^{\mathbb{C}})\_{d \ge \cdot} \cup \mathbb{S}ob(I\_f^{\mathbb{C}})\_{dy} \tag{13}$$

$$E(T\_f) = E(I\_f^{\mathbb{C}\_1}) \cap E(I\_f^{\mathbb{C}\_2}) \cap E(I\_f^{\mathbb{C}\_3}) \tag{14}$$

The final map which contains only moved and stationary onjects is obtained by merging *E*(*Tf*) and another map *S*(*Tf*). The *S*(*Tf*) map is obtained by a color segmentation process applied on the obtained filtering image.

#### **4. 3D localization of obstacles by stereo matching**

The use of more than one camera provides additional information, such as the depth of objects in a given scene. Dense or sparse stereovision techniques can be used to match points. When a point is imaged from two different viewpoints, its image projection undergoes a displacement from its position in the first image to that in the second image. Each disparity, determined for each pixel in the reference image, represents the coordinate difference of the projection of a real point in the 3-D space, i.e. scene point, in the left- and right-hand images of the cameras. A depth map is obtained from the two images and, for each disparity, the corresponding real 3-D coordinates are estimated according to the intrinsic and extrinsic parameters, such as the focal length and the baseline. The amount of displacement, alternatively called disparity, is inversely proportional to distance and may therefore be used to compute 3D geometry. Given a correspondence between imaged points from two known viewpoints, it is possible to compute depth by triangulation.

Several well-known stereo algorithms compute an initial disparity map from a pair of images under a known camera configuration. These algorithms are based loosely on local methods, such as window correlation, which take into account only neighborhood points of the pixel to be matched. The obtained disparity map has a lot of noise and erroneous values. This noise concerns mostly the pixels belonging to occluded or textureless image regions. An iterative process is then applied to the initial disparity map in order to improve it. These methods use global primitives. Some research has used a graph-based method Foggia et al. (2007) and color segmentation based stereo methods Taguchi et al. (2008) which belong to what is called

This final step consists in automatically extracting all foreground objects from their background in the aforementioned filtered foreground image, noted *If* . The classification step is preceded by postprocessing the output color image on which the background is uniform, while all moving and stationary objects are well highlighted. The postprocessing aims at binarizing the image and classifying all pixels into two classes: foreground and background.

*<sup>f</sup>* , where *ci* is the *i*

*ci*

The final map which contains only moved and stationary onjects is obtained by merging *E*(*Tf*) and another map *S*(*Tf*). The *S*(*Tf*) map is obtained by a color segmentation process applied

The use of more than one camera provides additional information, such as the depth of objects in a given scene. Dense or sparse stereovision techniques can be used to match points. When a point is imaged from two different viewpoints, its image projection undergoes a displacement from its position in the first image to that in the second image. Each disparity, determined for each pixel in the reference image, represents the coordinate difference of the projection of a real point in the 3-D space, i.e. scene point, in the left- and right-hand images of the cameras. A depth map is obtained from the two images and, for each disparity, the corresponding real 3-D coordinates are estimated according to the intrinsic and extrinsic parameters, such as the focal length and the baseline. The amount of displacement, alternatively called disparity, is inversely proportional to distance and may therefore be used to compute 3D geometry. Given a correspondence between imaged points from two known viewpoints, it is possible to

Several well-known stereo algorithms compute an initial disparity map from a pair of images under a known camera configuration. These algorithms are based loosely on local methods, such as window correlation, which take into account only neighborhood points of the pixel to be matched. The obtained disparity map has a lot of noise and erroneous values. This noise concerns mostly the pixels belonging to occluded or textureless image regions. An iterative process is then applied to the initial disparity map in order to improve it. These methods use global primitives. Some research has used a graph-based method Foggia et al. (2007) and color segmentation based stereo methods Taguchi et al. (2008) which belong to what is called

*<sup>f</sup>* )*dx* ∪ *Sob*(*I*

*c*2 *<sup>f</sup>* ) ∩ *E*(*I*

*ci*

*c*3

the color image *If* . Then, a Sobel operator *Sob* is applied to each color component which aims at performing a 2-D spatial gradient measurement on an image and so emphasizes regions of high spatial frequency that correspond to edges. For each color component, two edge images are obtained which represent the edges obtained from the horizontal and the vertical directions. The union of these two images forms the V-H (Vertical-Horizontal) edges of the image. Thus, an edge image is obtained for each color component. The final edge image is obtained by intersecting the points of the edges of the three images. The final edges *E* of the

*th* color component, are extracted from

*<sup>f</sup>* )*dy* (13)

*<sup>f</sup>* ) (14)

*ci*

*E*(*I ci*

**4. 3D localization of obstacles by stereo matching**

*E*(*Tf*) = *E*(*I*

*<sup>f</sup>* ) = *Sob*(*I*

*c*1 *<sup>f</sup>* ) ∩ *E*(*I*

*C. Foreground/Background Classification*

To this end, the color components *I*

image is obtained as follows.

on the obtained filtering image.

compute depth by triangulation.

*global approaches*. Other approaches have been proposed: they are based on a probabilistic framework optimization, such as belief propagation Lee & Ho (2008). These methods aim to obtain high-quality and accurate results, but are very expensive in terms of processing time. It is a real challenge to evaluate stereo methods in the case of noise, depth discontinuity, occlusions and non-textured image regions.

In our context, the proposed stereo matching algorithm is applied only on moving and stationary zones automatically detected referring to the technique discussed in the previous section. However, costs and processing time are decreasing at a steady pace, and it is becoming realistic to believe that such a thing will be commonplace soon. The stereo algorithm presented further, springs from relaxation and energy minimization fields. Our approach aims to represent a novel framework to improve color dense stereo matching. As a first step, disparity map volume is initialized applying a new local correlation function. After that, a confidence measure is attributed to all the pairs of matched pixels, which are classified into two classes: well-matched pixel and badly-amtched pixel. Then, the disparity value of all badly-matched pixels is updated based only on stable pixels classified as well-matched in an homogeneous color region. The well-matched pixels are used as input into disparity re-estimation modules to update the remaining pixels. The confidence measure is based on a set of original local parameters related to the correlation function used in the first step of our algorithm. This paper will focus on this point. The main goal of our study is to take into account both quality and speed. The global scheme of the stereo matching algorithme we propose is given by figure 4.

Fig. 4. Global scheme of the stereo matching algorithm for 3D localization.

obtained disparity volume allows initializing the Belief Propagation graph by attributing a set of possible labels (i.e. disparities) for each node (i.e. pixels). The originality is to consider a subset of nodes among all the nodes to begin the inference algorithm. This subset is obtained thanks to a confidence measure computed at each node of a graph of connected pixels. Second, the propagation of messages between nodes is performed hierarchically from the nodes having the highest confidence measure to those having the lowest one. A message is a vector of parameters (e.g. possible disparities, (*x*, *y*) coordinates, etc.) that describes the state of a node. To begin with, the propagation is performed inside an homogeneous color region and then passed from a region to another. The set of regions are obtained by a color-based segmentation using the MeanShift method Comaniciu & Meer (2002). A summary of our

<sup>89</sup> Intelligent Surveillance System Based

**1)** Initialize the data cost for nodes in the graph using the method in Fakhfakh et al. (2010).

node number, having a data term lower than a confidence threshold *�*.

,*y*) greater than *�*.

Using the WACD dissimilarity function allows initializing the set of labels. It represents a first estimate of the disparity map which contains matching errors. Then, each pair of pixels is evaluated using the Confidence Measure method described in Fakhfakh et al. (2009). The likelihood function used to initialize the disparity set is applied to each pixel of the image. Furthermore, for each matched pair of pixels a confidence measure is computed. It is termed

pixel *p*. This confidence measure function depends on several local parameters and is given

*x*� ,*<sup>y</sup> <sup>r</sup>* /*<sup>p</sup> x*,*y*

– *Best Correlation Score (min):* The output of the dissimilarity function is a measure of the degree of similarity between two pixels. Then, the candidate pixels are ranked in increasing order according to their corresponding scores. The couple of pixels that has the minimum score is considered as the best-matched pixels. The lower the score, the better the matching.

,*<sup>y</sup> <sup>r</sup>* ) = *<sup>P</sup>*(*<sup>p</sup>*

The confidence measure with its parameters is given by Equation 19:

*ψ*(*p x*,*y <sup>l</sup>* , *p x*,*y*� *<sup>r</sup>* ) =

,*<sup>y</sup> <sup>r</sup>* ) which represents the level of certainty of considering a label *<sup>l</sup>* as the best label for

<sup>1</sup> <sup>−</sup> *min ω*

*<sup>τ</sup>*2*log*(*σ*)

*<sup>l</sup>* , *ρ*, *min*, *σ*, *ω*) (18)

(19)

**b)** Select the k-nearest neighbor nodes within a cubic 3D support

,*y*) for each node.

algorithm is given in Algorithm 2:

**Algorithm 2** Hierarchical Belief Propagation.

on Stereo Vision for Level Crossings Safety Applications

**2)** Compute a Confidence Measure *ψ*(*px*→*x*�

**a)** Select node (i.e. pixel) *Nodei*, *i* being the

**3)** Repeat steps a, b, c and d for each node

window that have a *ψ*(*px*→*x*�

**4.2 Selective matching approach**

*ψ*(*p x*,*y <sup>l</sup>* , *p x*�

Where

by Equation 18:

**c)** Update the label of the current node. **d)** Update the weight of the current node. **4)** Repeat step 3) until reaching minimal energy.

> *ψ*(*p x*,*y <sup>l</sup>* , *p x*�

#### **4.1 Global energy formulation**

The global energy to minimize is composed of two terms: data cost and smoothness constraint, noted *f* and ˆ *f* respectively. The first term, *f* , allows evaluating the local matching of pixels by attributing a label *<sup>l</sup>* to each node in a graph <sup>G</sup>. The second term, <sup>ˆ</sup> *f* , allows evaluating the smoothness constraint by measuring how well label *l* fits pixel *p* given the observed data. Some works Felzenszwalb & Huttenlocher (2006) Boykov et al. (2001) consider the smoothness term as the amount of difference between the disparity of neighboring pixels. This can be seen as the cost of assigning a label *l* � to a node during the inference step, where L is the set of all the possible disparity values for a given pixel. The Global Energy Minimization function can be formulated as follows (Equation. 15):

$$E(\mathcal{G}) = E\_{l \in \mathcal{L}}(f) + E\_{l' \in \mathcal{L}}(\mathring{f}) \tag{15}$$

The minimization of this energy is performed iteratively by passing messages between all the neighboring nodes. These messages are updated at each iteration, until convergence. However, a node can be represented as a pixel having a vector of parameters such as, typically, its possible labels (i.e. disparities). However, reducing the complexity of the inference algorithm leads in most cases to reduced matching quality. Other algorithm variants can be derived from this basic model by introducing additional parameters in the message to be passed. One of the important parameters is the spatio-colorimetric proximity between nodes Trinh (2008).

– The data term we propose can be defined as a local evaluation of attributing a label *l* to a node. It is given by Equation 16:

$$E\_{l \in \mathcal{L}}(f) = \sum\_{p} \propto \phi^{\mathbf{x} \rightarrow \mathbf{x}', y}(z\_1) \tag{16}$$

Where *φx*→*x*� ,*y*(*zm*) is the *mth* dissimilarity cost obtained for each matched pixel (*px*� ,*y*, *px*,*y*). Parameter *α* is a fuzzy value within the [0,1] interval. It allows to compute a confidence measure for attributing a disparity value *d* to the pixel *p*. *α* is given by Equation 17:

$$\mathfrak{a} = \begin{cases} \psi(p^{\mathbf{x} \rightarrow \mathbf{x}', \mathbf{y}}) & \text{if } \psi(p^{\mathbf{x} \rightarrow \mathbf{x}', \mathbf{y}}) > \mathfrak{e} \\ 0 & \text{otherwise} \end{cases} \tag{17}$$

Where *ψ*(*px*→*x*� ,*y*) is a confidence measure computed for each pair (*px*,*y*, *px*� ,*y*) of matched pixels and *�* is a confidence threshold. The way of computing the confidence measures.

– The smoothness term is used to ensure that neighboring pixels have similar disparities.

The two-frame stereo matching approaches allow computing disparities and detecting occlusions, assuming that each pixel in the input image corresponds to a unique depth value. The stereo algorithm described in this section stems from the inference principle based on hierarchical belief propagation and energy minimization.

It takes into account the advantages of local methods for reducing the complexity of the Belief Propagation method which leads to an improvement in the quality of results. A Hierarchical Belief Propagation (HBP) based on a Confidence Measure technique is proposed: first, the data term (detailed in Section 4.1) is computed using the WACD dissimilarity function. The

The global energy to minimize is composed of two terms: data cost and smoothness

evaluating the smoothness constraint by measuring how well label *l* fits pixel *p* given the observed data. Some works Felzenszwalb & Huttenlocher (2006) Boykov et al. (2001) consider the smoothness term as the amount of difference between the disparity of neighboring pixels.

�

� ∈L( <sup>ˆ</sup>

is the set of all the possible disparity values for a given pixel. The Global Energy Minimization

The minimization of this energy is performed iteratively by passing messages between all the neighboring nodes. These messages are updated at each iteration, until convergence. However, a node can be represented as a pixel having a vector of parameters such as, typically, its possible labels (i.e. disparities). However, reducing the complexity of the inference algorithm leads in most cases to reduced matching quality. Other algorithm variants can be derived from this basic model by introducing additional parameters in the message to be passed. One of the important parameters is the spatio-colorimetric proximity between nodes

– The data term we propose can be defined as a local evaluation of attributing a label *l* to a

*α φx*→*x*�

,*y*(*zm*) is the *mth* dissimilarity cost obtained for each matched pixel (*px*�

,*y*) if *ψ*(*px*→*x*�

,*y*) *�* <sup>0</sup> otherwise (17)

*p*

Parameter *α* is a fuzzy value within the [0,1] interval. It allows to compute a confidence

,*y*) is a confidence measure computed for each pair (*px*,*y*, *px*�

The two-frame stereo matching approaches allow computing disparities and detecting occlusions, assuming that each pixel in the input image corresponds to a unique depth value. The stereo algorithm described in this section stems from the inference principle based on

It takes into account the advantages of local methods for reducing the complexity of the Belief Propagation method which leads to an improvement in the quality of results. A Hierarchical Belief Propagation (HBP) based on a Confidence Measure technique is proposed: first, the data term (detailed in Section 4.1) is computed using the WACD dissimilarity function. The

pixels and *�* is a confidence threshold. The way of computing the confidence measures. – The smoothness term is used to ensure that neighboring pixels have similar disparities.

*El*∈L(*f*) = ∑

measure for attributing a disparity value *d* to the pixel *p*. *α* is given by Equation 17:

*ψ*(*px*→*x*�

*α* =

hierarchical belief propagation and energy minimization.

*<sup>E</sup>*(G) = *El*∈L(*f*) + *El*

of pixels by attributing a label *<sup>l</sup>* to each node in a graph <sup>G</sup>. The second term, <sup>ˆ</sup>

*f* respectively. The first term, *f* , allows evaluating the local matching

to a node during the inference step, where L

*f*) (15)

,*y*(*z*1) (16)

*f* , allows

,*y*, *px*,*y*).

,*y*) of matched

**4.1 Global energy formulation**

This can be seen as the cost of assigning a label *l*

function can be formulated as follows (Equation. 15):

constraint, noted *f* and ˆ

Trinh (2008).

Where *φx*→*x*�

Where *ψ*(*px*→*x*�

node. It is given by Equation 16:

obtained disparity volume allows initializing the Belief Propagation graph by attributing a set of possible labels (i.e. disparities) for each node (i.e. pixels). The originality is to consider a subset of nodes among all the nodes to begin the inference algorithm. This subset is obtained thanks to a confidence measure computed at each node of a graph of connected pixels. Second, the propagation of messages between nodes is performed hierarchically from the nodes having the highest confidence measure to those having the lowest one. A message is a vector of parameters (e.g. possible disparities, (*x*, *y*) coordinates, etc.) that describes the state of a node. To begin with, the propagation is performed inside an homogeneous color region and then passed from a region to another. The set of regions are obtained by a color-based segmentation using the MeanShift method Comaniciu & Meer (2002). A summary of our algorithm is given in Algorithm 2:

**Algorithm 2** Hierarchical Belief Propagation.

	- **a)** Select node (i.e. pixel) *Nodei*, *i* being the node number, having a data term lower than a confidence threshold *�*.
	- **b)** Select the k-nearest neighbor nodes within a cubic 3D support window that have a *ψ*(*px*→*x*� ,*y*) greater than *�*.
	- **c)** Update the label of the current node.
	- **d)** Update the weight of the current node.

#### **4.2 Selective matching approach**

Using the WACD dissimilarity function allows initializing the set of labels. It represents a first estimate of the disparity map which contains matching errors. Then, each pair of pixels is evaluated using the Confidence Measure method described in Fakhfakh et al. (2009). The likelihood function used to initialize the disparity set is applied to each pixel of the image. Furthermore, for each matched pair of pixels a confidence measure is computed. It is termed *ψ*(*p x*,*y <sup>l</sup>* , *p x*� ,*<sup>y</sup> <sup>r</sup>* ) which represents the level of certainty of considering a label *<sup>l</sup>* as the best label for pixel *p*. This confidence measure function depends on several local parameters and is given by Equation 18:

$$\Psi(p\_1^{x,y}, p\_r^{x',y}) = P(p\_r^{x',y} / p\_1^{x,y}, \rho, \min, \sigma, \omega) \tag{18}$$

The confidence measure with its parameters is given by Equation 19:

$$
\psi(p\_l^{\ge,y}, p\_r^{\ge,y'}) = \left(1 - \frac{\min}{\omega}\right)^{\pi^2 \log(\sigma)}\tag{19}
$$

Where

– *Best Correlation Score (min):* The output of the dissimilarity function is a measure of the degree of similarity between two pixels. Then, the candidate pixels are ranked in increasing order according to their corresponding scores. The couple of pixels that has the minimum score is considered as the best-matched pixels. The lower the score, the better the matching.

pixels is given by Equation 22:

on Stereo Vision for Level Crossings Safety Applications

**4.3 Hierarchical belief propagation**

art. The main ideas of the HBP are as follows:

MeanShift method Comaniciu & Meer (2002).

**5.1 Evaluation of the obstacle extraction module**

in the next iteration.

**5. Experimental results**

*τ* = arg max *m*

The inference algorithm based on a belief propagation method Felzenszwalb & Huttenlocher (2006) can be applied to reach the optimal solution that corresponds to the best disparity set. A set of messages are iteratively transmitted from a node to its neighbors until convergence. This global optimization is NP-hard and far from real time. Referring to this basic framework, all the nodes have the same weight. The main drawback is that several erroneous messages might be passed across the graph, leading moreover to an increase in the number of iterations without guarantee of reaching the best solution. Several works have tried to decrease the number of iterations of the belief propagation method. The proposed HBP technique allows an improvement in both quality of results and processing time compared with the state of the

<sup>91</sup> Intelligent Surveillance System Based

– The confidence measure is used to assign a weight to each node in the graph. At each iteration, messages are passed hierarchically from nodes having a high confidence measure (i.e. high weight) to nodes having a low confidence measure (i.e. small weight). A high weight means a high certainty of the message to be passed. The weights of the nodes are updated after each iteration, so that a subset of nodes is activated to be able to send messages

– The propagation is first performed inside a consistent color region, and then passed to the neighboring regions. The set of regions is obtained by a color-based segmentation using the

– In our framework, the messages are passed differently from the standard BP algorithm. Instead of considering the 4-connected nodes, the k-nearest neighboring nodes are considered. These k-nearest neighboring nodes belong to a 3D support window. We assume that the labels of nodes vary smoothly within a 3D support window centered on the node to be updated.

The first proposed module is evaluated on real-world data in outdoor environments in various weather conditions. The datasets concern videosurveillance system in which a stationary camera monitors real-worlds, such as a level crossing. The four datasets include: "Pontet" and "Chamberonne", which are two level crossings in Switzerland in cloudy weather, given that test images were 384 × 288 pixels; a dataset entitled "Pan", which represents a level crossing in France in sunny weather, given that test images were 720 × 576; and a dataset taken in snowy weather in EPFL–Switzerland is also considered for the evaluation. The test images are the same size as the "Pontet" and "Chamberonne" datasets. For a qualitative evaluation purpose, 1000 of foreground ground truths images have been obtained by manual segmentation from the "Pontet" and "Chamberonne" datasets. This allows computing the recall and the precision of the detection. In the experiments, the proposed framework is compared with the Mixture Of Gaussians (MOG) and Codebook algorithms. Furthermore, ICA model is evaluated on

*ξ*(*φx*�

,*y*) (22)

The nearer the minimum score to zero, the greater the chance of the candidate pixel to be the actual correspondent.

– *Number of Potential Candidate Pixels (τ):* This parameter represents the number of potential candidate pixels having similar scores. *τ* has a big influence because it reflects the behavior of the dissimilarity function. A high value of *τ* means that the first candidate pixel is located in a uniform color region of the frame. The lower the value of *τ*, the fewer the candidate pixels. If there are few candidates, the chosen candidate pixel has a greater chance of being the actual correspondent. Indeed, the pixel to be matched belongs to a region with high variation of color components. A very small value of *τ* and a *min* score close to zero, means that the pixel to be matched probably belongs to a region of high color variation.

– *Disparity variation of the τ pixels (σ):* A disparity value is obtained for each candidate pixel. For the *τ* potential candidate pixels, we compute the standard deviation *σ* of the *τ* disparity values. A small *σ* means that the *τ* candidate pixels are spatially neighbors. In this case, the true candidate pixel should belong to a particular region of the frame, such as an edge or a transition point. Therefore, it increases the confidence measure. A large *σ* means that the *τ* candidate pixels taken into account are situated in a uniform color region.

– *Gap value (ω):* This parameter represents the difference between the *τth* and (*τ* + 1)*th* scores given with the dissimilarity function used. It is introduced to adjust the impact of the minimum score.

To ensure that function *ψ* has a value between 0 and 1, a few constraints are introduced. The *min* parameter must not be higher than *ω*. If so, parameter *ω* is forced to *min* + 1. Moreover, the *log*(*σ*) term is used instead of *σ*, so as to reduce the impact of high value of *σ* and obtain coherent confidence measures. The number *τ* of potential candidate pixels is deduced from the L scores obtained with the WACD likelihood function. The main idea is to detect major differences between successive scores. These major differences are called main gaps. Let *φ* denote a discrete function which represents all the scores given by the dissimilarity function in increasing order. We introduced a second function denoted *η*, which represents the average growth rate of the *φ* function. *η* can be seen as the ratio of the difference between a given score and the first score, and the difference between their ranks. This function is defined in Equation 20:

$$\eta(\boldsymbol{\phi}^{\mathbf{x'},\boldsymbol{y}}) = \frac{\boldsymbol{\phi}^{\mathbf{x'},\boldsymbol{y}}(\boldsymbol{z}\_m) - \boldsymbol{\phi}^{\mathbf{x'},\boldsymbol{y}}(\boldsymbol{z}\_1)}{\boldsymbol{z}\_m - \boldsymbol{z}\_1} \qquad \quad m \in \mathcal{L} \tag{20}$$

where *φx*� ,*y*(*zm*) is the *<sup>m</sup>th* dissimilarity cost among the <sup>L</sup> scores obtained for the pair of matched pixels (*px*,*y*, *px*� ,*y*). *zm* is the rank of the *mth* score. *η*(*φx*� ,*y*) is a discrete function that allows to highlight the large gaps between scores. It is materialized using Equation 21:

$$\mathcal{L}(\boldsymbol{\phi}^{\boldsymbol{x'},\boldsymbol{y}}) = \begin{cases} \frac{\nabla \boldsymbol{\eta}^{\boldsymbol{x'},\boldsymbol{y}}}{m^2} & \text{if } \nabla \boldsymbol{\eta}^{\boldsymbol{x'},\boldsymbol{y}} \ge 0 \\ -1 & \text{otherwise} \end{cases} \tag{21}$$

The previous function (Equation 7) is used to characterize the major scores and is applied only in the case where the gradient <sup>∇</sup>*ηx*� ,*<sup>y</sup>* has a positive sign. We have introduced parameter *m*<sup>2</sup> in order to penalize the candidate pixels according to their rank. The number of candidate pixels is given by Equation 22:

16 Will-be-set-by-IN-TECH

The nearer the minimum score to zero, the greater the chance of the candidate pixel to be the

– *Number of Potential Candidate Pixels (τ):* This parameter represents the number of potential candidate pixels having similar scores. *τ* has a big influence because it reflects the behavior of the dissimilarity function. A high value of *τ* means that the first candidate pixel is located in a uniform color region of the frame. The lower the value of *τ*, the fewer the candidate pixels. If there are few candidates, the chosen candidate pixel has a greater chance of being the actual correspondent. Indeed, the pixel to be matched belongs to a region with high variation of color components. A very small value of *τ* and a *min* score close to zero, means that the pixel

– *Disparity variation of the τ pixels (σ):* A disparity value is obtained for each candidate pixel. For the *τ* potential candidate pixels, we compute the standard deviation *σ* of the *τ* disparity values. A small *σ* means that the *τ* candidate pixels are spatially neighbors. In this case, the true candidate pixel should belong to a particular region of the frame, such as an edge or a transition point. Therefore, it increases the confidence measure. A large *σ* means that the *τ*

– *Gap value (ω):* This parameter represents the difference between the *τth* and (*τ* + 1)*th* scores given with the dissimilarity function used. It is introduced to adjust the impact of

To ensure that function *ψ* has a value between 0 and 1, a few constraints are introduced. The *min* parameter must not be higher than *ω*. If so, parameter *ω* is forced to *min* + 1. Moreover, the *log*(*σ*) term is used instead of *σ*, so as to reduce the impact of high value of *σ* and obtain coherent confidence measures. The number *τ* of potential candidate pixels is deduced from the L scores obtained with the WACD likelihood function. The main idea is to detect major differences between successive scores. These major differences are called main gaps. Let *φ* denote a discrete function which represents all the scores given by the dissimilarity function in increasing order. We introduced a second function denoted *η*, which represents the average growth rate of the *φ* function. *η* can be seen as the ratio of the difference between a given score and the first score, and the difference between their ranks. This function is defined in

,*y*(*zm*) <sup>−</sup> *<sup>φ</sup>x*�

*zm* − *z*<sup>1</sup>

,*y*). *zm* is the rank of the *mth* score. *η*(*φx*�

The previous function (Equation 7) is used to characterize the major scores and is applied only

in order to penalize the candidate pixels according to their rank. The number of candidate

that allows to highlight the large gaps between scores. It is materialized using Equation 21:

<sup>∇</sup>*ηx*�,*<sup>y</sup>*

,*y*(*z*1)

,*y*(*zm*) is the *<sup>m</sup>th* dissimilarity cost among the <sup>L</sup> scores obtained for the pair of

*<sup>m</sup>*<sup>2</sup> if <sup>∇</sup>*ηx*�

,*<sup>y</sup>* 0

<sup>−</sup>1 otherwise (21)

,*<sup>y</sup>* has a positive sign. We have introduced parameter *m*<sup>2</sup>

*m* ∈ L (20)

,*y*) is a discrete function

to be matched probably belongs to a region of high color variation.

candidate pixels taken into account are situated in a uniform color region.

*η*(*φx*�

,*y*) = *<sup>φ</sup>x*�

*ξ*(*φx*�

,*y*) =

actual correspondent.

the minimum score.

Equation 20:

where *φx*�

matched pixels (*px*,*y*, *px*�

in the case where the gradient <sup>∇</sup>*ηx*�

$$\pi = \underset{m}{\text{arg}\,\text{max}}\,\tilde{\xi}(\boldsymbol{\phi}^{\boldsymbol{\chi}',\boldsymbol{\mathcal{Y}}}) \tag{22}$$

## **4.3 Hierarchical belief propagation**

The inference algorithm based on a belief propagation method Felzenszwalb & Huttenlocher (2006) can be applied to reach the optimal solution that corresponds to the best disparity set. A set of messages are iteratively transmitted from a node to its neighbors until convergence. This global optimization is NP-hard and far from real time. Referring to this basic framework, all the nodes have the same weight. The main drawback is that several erroneous messages might be passed across the graph, leading moreover to an increase in the number of iterations without guarantee of reaching the best solution. Several works have tried to decrease the number of iterations of the belief propagation method. The proposed HBP technique allows an improvement in both quality of results and processing time compared with the state of the art. The main ideas of the HBP are as follows:

– The confidence measure is used to assign a weight to each node in the graph. At each iteration, messages are passed hierarchically from nodes having a high confidence measure (i.e. high weight) to nodes having a low confidence measure (i.e. small weight). A high weight means a high certainty of the message to be passed. The weights of the nodes are updated after each iteration, so that a subset of nodes is activated to be able to send messages in the next iteration.

– The propagation is first performed inside a consistent color region, and then passed to the neighboring regions. The set of regions is obtained by a color-based segmentation using the MeanShift method Comaniciu & Meer (2002).

– In our framework, the messages are passed differently from the standard BP algorithm. Instead of considering the 4-connected nodes, the k-nearest neighboring nodes are considered. These k-nearest neighboring nodes belong to a 3D support window. We assume that the labels of nodes vary smoothly within a 3D support window centered on the node to be updated.
