**5.1.1 Quantitative evaluation**

The performance of the proposed scheme is compared to the two best-known techniques, which are the pixel-wise Gaussian Mixture Model (GMM) and Codebook. These algorithms are chosen because of their algorithmic complexity, which is close to ours. This qualitative comparison is performed by comparing the segmentation results of each algorithm with a ground truth. The ground truth is established by a manual segmentation of moving and stationary foreground objects in the image sequence. The evaluation criterion is expressed by the *Recall* and *Precision* parameters which describe the way the segmented image matches the corresponding ground truth. The *Recall* measures the ability of an algorithm to detect the true foreground pixels, while the *Precision* is an intrinsic criterion which gives a clue to the accuracy of the detection. These parameters can be expressed in terms of true positives *Tp*, true negatives *Tn*, false positives *Fp* and false negatives *Fn* terms. The *Recall* and *Precision* are obtained by Equation 23 and 24:

18 Will-be-set-by-IN-TECH

different color spaces and different color constancy. The obtained results are also compared with those obtained from gray scale images. The algorithms are implemented in Visual Studio C++ 2008, using the OpenCV and IT++ libraries. The four datasets are given in Figure 5.

(a) (b)

(c) (d)

The performance of the proposed scheme is compared to the two best-known techniques, which are the pixel-wise Gaussian Mixture Model (GMM) and Codebook. These algorithms are chosen because of their algorithmic complexity, which is close to ours. This qualitative comparison is performed by comparing the segmentation results of each algorithm with a ground truth. The ground truth is established by a manual segmentation of moving and stationary foreground objects in the image sequence. The evaluation criterion is expressed by the *Recall* and *Precision* parameters which describe the way the segmented image matches the corresponding ground truth. The *Recall* measures the ability of an algorithm to detect the true foreground pixels, while the *Precision* is an intrinsic criterion which gives a clue to the accuracy of the detection. These parameters can be expressed in terms of true positives *Tp*, true negatives *Tn*, false positives *Fp* and false negatives *Fn* terms. The *Recall* and *Precision* are

Fig. 5. The four datasets used for the evaluation (a) pontet: a level crossing in

EPFL–Parking (d) pan: a level crossing in France.

**5.1.1 Quantitative evaluation**

Lausanne–Switzerland (b) chamberonne: a level crossing in Lausanne–Switzerland (c)

$$Recalll = \frac{T\_p}{T\_p + F\_{\text{lt}}} \tag{23}$$

$$Precision = \frac{T\_p}{T\_p + F\_p} \tag{24}$$

where *Tp* represents the number of well pixels classified as foreground, compared to the ground truth, *Fn* is the number of pixels classified as background, whereas they are really foreground pixels while referring to the ground truth, and *Fp* is the number of pixels classified as foreground, whereas they are really background pixels. *Tp* + *Fn* can be seen as the number of the true foreground pixels obtained by the ground truth, while *Tp* + *Fp* is the foreground pixels classified by a given algorithm. The image samples used for computing these two previous parameters are taken from the two datasets *Pontet* and *Chamberonne*, given that five hundred images from the each dataset are used for a manual extraction of foreground objects. This allows obtaining a ground truth dataset from which the different algorithms are evaluated. Table 1 shows the qualitative evaluation of the foreground extraction process, given by *Recall* and *Precision* measures:


Table 1. Qualitative evaluation given by *Recall* and *Precision* measures.

A visual comparison of our method compared with two other methods from the literature is given by fugure 6.

The implementation of the proposed framework runs on a personal computer with an Intel 32-bit 3.1-GHz processor. For the Pontet dataset, the proposed algorithm runs at a speed of 13 fps (frame per second). The processing time of our algorithm is compared with MOG and Codebook algorithms. Table 2 shows that our algorithm is faster than the other algorithms.


Table 2. Processing time of Pontet dataset.

#### **5.2 Evaluation of the 3D localization module**

The proposed depth estimation for 3D localization algorithm is first evaluated on the Middlebury stereo benchmark (http://www.middlebury.edu/stereo), using the Tsukuba, Venus, Teddy and Cones standard datasets. The evaluation concerns non occluded regions (nonocc), all regions (all) and depth-discontinuity regions (disc). In the first step of our algorithm, the WACD likelihood function is first performed on all the pixels. Applying the *winner-take-all* strategy, a label corresponding to the best estimated disparity is attributed to each pixel. The second step consists in selecting a subset of pixels according to their confidence

*Algorithm Tsukuba Venus Teddy Cones*

<sup>95</sup> Intelligent Surveillance System Based

Table 3. Algorithm evaluation on the Middlebury dataset

on Stereo Vision for Level Crossings Safety Applications

H-Cut 2.85 4.86 14.4 1.73 3.14 20.2 10.7 19.5 25.8 5.46 15.6 15.7 **Proposed 4.87 5.04 8.47 3.42 3.99 10.5 17.5 20.8 28.0 7.46 12.5 13.3** Max-Product 1.88 3.78 10.1 1.31 2.34 15.7 24.6 32.4 34.7 21.2 28.5 30.1 PhaseBased 4.26 6.53 15.4 6.71 8.16 26.4 14.5 23.1 25.5 10.8 20.5 21.2

(a) (b) (c) (d) (e) (f)

The disparity allows estimating the 3-D position and spatial occupancy rate of each segmented object. The transformation of an image plane point *p* = {*x*, *y*} into a 3-D reference system point *P* = {*X*,*Y*, *Z*} must be performed. The distance of an object point is calculated by

*<sup>Z</sup>* <sup>=</sup> *<sup>b</sup>*. *<sup>f</sup>*

– *Z* is the depth, i.e. the distance between the sensor camera and the object point along the Z

– *f* is the focal length, i.e. the distance between the lens and the sensor, supposed identical for

*<sup>d</sup>* (25)

Fig. 7. Different steps of our algorithm in different types of regions. (a) Left image (b) Segmented face and teddy bear extracted from the Cones and Teddy images, respectively, using Mean Shift (c) Dense disparity map obtained using WACD (d) Sparse disparity map corresponding to the well-matched pixels, with 60% confidence threshold (e) Dense disparity

map after performing the HBP (f) Corresponding ground truth.

triangulation, assuming parallel optical axes:

Where

axis,

both cameras,

*nonocc all disc nonocc all disc nonocc all disc nonocc all disc*

Fig. 6. Visual comparison of our algorithm with other methods. (a) original images (b) ground truth (c) changes detection obtained with our method (d) with MOG method and (e) with Codebook method.

measure. Indeed, the pixels having a low confidence measure belongs to either occluded or textureless regions. However, the subset corresponding to the well-matched pixels is taken as the starting point of the hierarchical belief propagation module.

Quantitatively, our method was compared to several other methods from the literature. These methods are H-Cut Miyazaki et al. (2009), max-product Felzenszwalb & Huttenlocher (2006) and PhaseBased El-Etriby et al. (2007). Table 3 provides quantitative comparison results between all four methods. This table shows the percentage of pixels incorrectly matched for the non-occluded pixels (nonocc), the discontinuity pixels (disc), and for all the matched pixels (all). More specifically, the proposed method is better for Tsukuba in "all" and "disc" pixels, in Venus for "disc" pixels and in Cones for "all" pixels.

Figure 7 illustrates an example of two objects extracted from the Cones and Teddy images, respectively. The face extracted from Cones corresponds to an non-occluded region while the teddy bear corresponds to a depth discontinuity. This proves that the propagation of disparities preserves the discontinuity between regions and gives a good accuracy in terms of matching pixels in the non-occluded regions.

(a) (b) (c) (d) (e)

measure. Indeed, the pixels having a low confidence measure belongs to either occluded or textureless regions. However, the subset corresponding to the well-matched pixels is taken as

Quantitatively, our method was compared to several other methods from the literature. These methods are H-Cut Miyazaki et al. (2009), max-product Felzenszwalb & Huttenlocher (2006) and PhaseBased El-Etriby et al. (2007). Table 3 provides quantitative comparison results between all four methods. This table shows the percentage of pixels incorrectly matched for the non-occluded pixels (nonocc), the discontinuity pixels (disc), and for all the matched pixels (all). More specifically, the proposed method is better for Tsukuba in "all" and "disc" pixels, in

Figure 7 illustrates an example of two objects extracted from the Cones and Teddy images, respectively. The face extracted from Cones corresponds to an non-occluded region while the teddy bear corresponds to a depth discontinuity. This proves that the propagation of disparities preserves the discontinuity between regions and gives a good accuracy in terms of

Fig. 6. Visual comparison of our algorithm with other methods. (a) original images (b) ground truth (c) changes detection obtained with our method (d) with MOG method and (e)

the starting point of the hierarchical belief propagation module.

Venus for "disc" pixels and in Cones for "all" pixels.

matching pixels in the non-occluded regions.

with Codebook method.


Table 3. Algorithm evaluation on the Middlebury dataset

Fig. 7. Different steps of our algorithm in different types of regions. (a) Left image (b) Segmented face and teddy bear extracted from the Cones and Teddy images, respectively, using Mean Shift (c) Dense disparity map obtained using WACD (d) Sparse disparity map corresponding to the well-matched pixels, with 60% confidence threshold (e) Dense disparity map after performing the HBP (f) Corresponding ground truth.

The disparity allows estimating the 3-D position and spatial occupancy rate of each segmented object. The transformation of an image plane point *p* = {*x*, *y*} into a 3-D reference system point *P* = {*X*,*Y*, *Z*} must be performed. The distance of an object point is calculated by triangulation, assuming parallel optical axes:

$$Z = \frac{b.f}{d} \tag{25}$$

Where

– *Z* is the depth, i.e. the distance between the sensor camera and the object point along the Z axis,

– *f* is the focal length, i.e. the distance between the lens and the sensor, supposed identical for both cameras,

(a)

<sup>97</sup> Intelligent Surveillance System Based

on Stereo Vision for Level Crossings Safety Applications

(b)

(c)

(d)

(e)

(f)

(A) (B) (C)

Fig. 9. 3D localization steps of a given scenario. (A) left-hand image, (B) dense disparity map obtained by applying WACD correlation function, (C) final disparity map obtained after

applying the Selective Belief Propagation.

**--**

**--**

**--**

**--**

**--**

**--**

– *b* is the baseline, i.e. the distance separating the cameras.

– *d* is the estimated disparity.

The proposed 3D localization algorithm is evaluated on image sequences of long of one hour acquired on two real level crossings. We have used a system composed of two cameras of model Sony DXC-390/390P 3-CCD with an optical lens of model Cinegon 3 CCD Lens 5.3mm FL. The cameras are fixed on a metal support of 1.5 meter of height, and the distance between their optical axis is fixed at 0.4 meter. The whole is placed at around 20 meters far from the dangerous zone. We illustrate in figure 8 an example of two pedestrians extracted by the stICA algorithm from the left-hand image. The image (b) is estimated by applying the WACD local stereo matching algorithm allows us to obtain a first disparity map which contains a lot of errors of matching. Much of them are identified by applying the confidence measure function. Only the pairs of matched pixels having a confidence measure higher than 60% as threshold, are kept (image c). These retained pixels are considered as a starting point for the belief propagation algorithm leading to estimate the disparity of the remaining pixels (image d). This example show the accuracy of the 3D localization in a case of occlusion, knowing that the two pedestrians are at two different distances from the cameras.

Fig. 8. 3D localization of two pedestrians partially occluded. (a) pedestrians extracted by stICA, (b) first disparity map obtained with the WACD algorithm, (c) sparse disparity map obtained after applying the confidence measure function, (d) final disparity map obtained with the selective belief propagation algorithm.

We illustrate in figure 9 some examples of obstacles which are extracted and localized with the proposed algorithms (from (a) to (d)). The first column corresponds to the left-hand images acuiqres from the left camera. The middle column represents the first disparity map obtained from the WACD stereo matching algorithm. Hence, the last column correspond to the final disparity map which will allows localizing all of obstacles in the 3D space.

The proposed 3D localization algorithm is evaluated on image sequences of long of one hour acquired on two real level crossings. We have used a system composed of two cameras of model Sony DXC-390/390P 3-CCD with an optical lens of model Cinegon 3 CCD Lens 5.3mm FL. The cameras are fixed on a metal support of 1.5 meter of height, and the distance between their optical axis is fixed at 0.4 meter. The whole is placed at around 20 meters far from the dangerous zone. We illustrate in figure 8 an example of two pedestrians extracted by the stICA algorithm from the left-hand image. The image (b) is estimated by applying the WACD local stereo matching algorithm allows us to obtain a first disparity map which contains a lot of errors of matching. Much of them are identified by applying the confidence measure function. Only the pairs of matched pixels having a confidence measure higher than 60% as threshold, are kept (image c). These retained pixels are considered as a starting point for the belief propagation algorithm leading to estimate the disparity of the remaining pixels (image d). This example show the accuracy of the 3D localization in a case of occlusion, knowing that

(a) (b) (c) (d)

We illustrate in figure 9 some examples of obstacles which are extracted and localized with the proposed algorithms (from (a) to (d)). The first column corresponds to the left-hand images acuiqres from the left camera. The middle column represents the first disparity map obtained from the WACD stereo matching algorithm. Hence, the last column correspond to the final

disparity map which will allows localizing all of obstacles in the 3D space.

Fig. 8. 3D localization of two pedestrians partially occluded. (a) pedestrians extracted by stICA, (b) first disparity map obtained with the WACD algorithm, (c) sparse disparity map obtained after applying the confidence measure function, (d) final disparity map obtained

with the selective belief propagation algorithm.

– *b* is the baseline, i.e. the distance separating the cameras.

the two pedestrians are at two different distances from the cameras.

– *d* is the estimated disparity.

Fig. 9. 3D localization steps of a given scenario. (A) left-hand image, (B) dense disparity map obtained by applying WACD correlation function, (C) final disparity map obtained after applying the Selective Belief Propagation.

Cvejic, N., Bull, D. & Canagarajah, N. (2007). Improving fusion of surveilliance images in

<sup>99</sup> Intelligent Surveillance System Based

Delfosse, N. & Loubaton, P. (1995). Infomax and maximum likelihood for sources separation,

Dun, B. V., Wouters, J. & Moonen, M. (2007). Improving auditory steady-state response

El-Etriby, S., Al-Hamadi, A. & b. Michaelis (2007). Desnse stereo correspondance with slanted

Elgammal, A., Harwood, D. & Davis, L. (2000). Non-parametric model for background

Fakhfakh, N., Khoudour, L., El-Koursi, E., Bruyelle, J.-L., Dufaux, A., & Jacot, J. (2011).

Fakhfakh, N., Khoudour, L., El-Koursi, E., Jacot, J. & Dufaux, A. (2010). A video-based

Fakhfakh, N., Khoudour, L., El-Koursi, M., Jacot, J. & Dufaux, A. (2009). A new selective

Felzenszwalb, P. & Huttenlocher, D. (2006). Efficient belief propagation for early vision,

Foggia, P., Jolion, J., Limongiello, A. & Vento, M. (2007). Stereo vision for obstacle detection: A

Foresti, G. (1998). A real-time system for video surveillance of unattended outdoor

Griffioen, E. (2004). Improving level crossings using findings from human behaviour studies,

Kim, K., Chalidabhongse, T., Harwood, D. & Davis, L. (2005). Real-time

Lee, C. & Ho, Y. (2008). Disparity estimation using belief propagation for view interpolation,

McKeown, M., Makeig, S., Brown, G., Jung, T., ndermann, S., Bell, A. & Sejnowski, T. (1998).

Miyazaki, D., Matsushita, Y. & Ikeuchi, K. (2009). Interactive shadow removal from a single

Nelson, A. (2002). The uk approach to managing risk at passive level crossings, *Inter. Symposium on RailRoad-Highway Grade Crossing Research and Safety, 7th*.

*Berlin Heidelberg*, Vol. 5711, Santiago, Chile, pp. 184–191.

*International Journal of Computer Vision (IJCV)* 70(1): 41–54.

*Imaging, Special Issue on Video Object Processing* 11(3): 172–185.

*Proc. of 8th Inter. Level Crossing Symposium*.

*In: ITC-CSC*, Japan, pp. 21–24.

*Human Brain Mappin* 6(3): 160–188.

*Electronics* 53(3): 1029–1035.

on Stereo Vision for Level Crossings Safety Applications

*Electronic*.

subtraction, *ECCV*.

2011(4): 1–15.

*Journal* 5: 1–15.

pp. 37–48.

8(6): 697–704.

*IEEE Letters on Signal Processing*, Vol. 45, pp. 59–83.

*Trans. On Biomedical Engineering* 54(7): 1220–1230.

sensor networks using independent component analysis, *IEEE Trans. On Consumer*

detection using independent component analysis on multichannel eeg data, *IEEE*

surface using phase-based algorithm, *In : IEEE International Symposium on Indistrual*

3d objects localization using fuzzy approach and hierarchical belief propagation : application at level crossings, *In EURASIP Journal on Image and Video Processing*

object detection system for improving safety at level crossings, *In Open Transportation*

confidence measure-based approach for stereo matching, *International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, Springer-Verlag*

graph-based approach, *Lecture Notes in Computer Science, Springer Berlin / Heidelberg*,

environments, *IEEE Transactions on Circuits and Systems for Video Technology*

foregroundUbackground segmentation using codebook model, ˝ *Journal of Real-Time*

Analysis of fmri data by blind separation into independent spatial components,

image using hierarchical graph cut, *In : Asian Conference on Computer Vision (ACCV)*.
