**3. Low-cost hardware system design and set-up for pavement data acquisition using Kinect v2.0**

The establishment and design of an optimal low-cost imaging system, comprising of the hardware platform and peripheral requirements, with interface for Kinect-computer data acquisition, visualization and storage, in both static and dynamic acquisition modes is illustrated in **Figure 2**, and is termed as integrated Mobile Mapping Sensor System (iMMSS). For the implementation of the iMMSS, two main sets of equipment are used: (i) the Kinect v2.0—for RGB, Infrared (IR) and depth data capture, and (ii) a DC-AC power inverter—12 V DC to AC 220 V/ 200 W output. The power inverter is adaptable to the car charger port for powering the Kinect sensor for static and continuous pavement data acquisition modes. The iMMSS data acquisition system hardware-software set-up is as illustrated in the photo in **Figure 2**. The three main criteria in the field experimentation using the iMMSS comprise of: the shooting angle (vertical and oblique), shooting distance from the pavement, and the overall target positioning. **Figure 2** illustrates the hardware layout and software data capture system. The sensing device is housed within a sensor rack mounted onto the exterior of the wagon. To improve the contrast of the Kinect's laser pattern over the road surfaces, from the reflected IR radiation from sunlight an umbrella was used to block the rays from the sun and to create a shadow.

In terms of data acquisition in static and dynamic mode (**Figure 2**), the Kinect sensor captures depth and color images simultaneously at a frame rate of up to 30 fps. The integration of depth and color data results in a colored point cloud that

#### **Figure 2.**

*iMMSS hardware-software set-up for road pavement data capture, visualization and storage using the Kinect sensor.*

projected onto the screen and compared with the current pattern on the screen [19]. If there are any obstacles in the way, the IR pattern changes shape from which the depth values can be deciphered. The Kinect v2.0 however, uses ToF technique to acquire depth values, where the sensor measures the time it takes for the modulated laser pulses from the IR projector to reach the object and then back to the IR camera [13]. The RGB resolution of the Kinect v2.0 is at 1920 � 1080 pixels, and the IR camera has a resolution of 512 � 424 pixels, with corresponding pixel sizes of 3.1 and 10 μm respectively. The collection of the ð Þ *x; y; z* points results into 3D point

**(a)**

Range (short) N/A N/A 15 cm–1.5 m N/A Range (long) 0.8–4 m 1.5–4.5 m N/A 0.8–8 m

57.5° � 43.5° � N/A 57.3° � 42° �

**SoftKinetic DS311**

240)

73.8°

60°

Frame rate (RGB) 30 <25 <25 N/A

Price \$99 \$299 \$249 \$4295 **(b) Parameter specification Kinect v1.0 Kinect v2.0** Resolution of RGB camera (pixel) 640 � 480 or 1280 � 960 1920 � 1080 Resolution of IR and depth camera (pixel) 640 � 480 512 � 424 Field of view (FOV) of color camera 62° � 48.6° 84.1° � 53.8° Field of view (FOV) of IR and depth image 57.5° � 43.5° 70.6° � 60° Tilt motor Yes No Maximum skeletal tracking 2 6

Method of depth measurement Structured light Time-of-Flight (ToF) Depth distance working range 0.8–4.0 m 0.5–4.5 m USB 2.0 3.0 Price \$99 \$200

*Comparative specifications of Kinect v1.0 and Kinect v2.0 and other low-cost sensors.*

Light coding Depth sense CAPD ToF Time of Flight

30 25–60 25–60 50

640 � 480 1280 � 720

USB 2.0 (1) USB 2.0 (1) USB 2.0 (1) Lumberg M8 Male

24 � 5.8 � 4 cm 10.5 � 3.1 �

**SoftKinetic DS325**

QVGA (160 � 120)

(HD)

63.2° � 49.3° � 75.2°

2.7 cm

**SwissRanger SR4000**

176 � 144

(ToF)

N/A

N/A

3-pin

6.5 � 6.5 � 6.8 cm

74° � 58° � 87° 43° � 34° � N/A

**Specifications Microsoft Kinect**

Resolution (RGB) 640 � 480 or 1280

Size (W � H � D) 27.94 � 7.62 �

Field of view (H � V

Technology (depth

Frame rate (depth

� D)

sensor)

sensor)

Power/data connection

**Table 1.**

**148**

**v1.0**

� 960

7.62 cm

Field of view (RGB) 57.3° � 42° � N/A 50° � 40° �

Resolution (depth) QVGA (640 � 480) QVGA (320 �

*Geographic Information Systems in Geospatial Intelligence*

contains about 300,000 points in every frame. By registering the consecutive depth images it is possible to obtain an increased point density, and to create a complete point cloud. To realize the full potential of the sensor for mapping applications an analysis of the systematic and random errors of the data is necessary. The correction of systematic errors is a prerequisite for the alignment of the depth and color data, and relies on the identification of the mathematical model of depth measurement and the calibration parameters involved. The characterization of random errors is important and useful in further processing of the depth data, for example in weighting the point pairs or planes in the registration algorithm [20].

#### 1.Pothole detection and the bias field effect

Under perfect conditions, potholes tend to have two visual properties characterized by: (i) low-intensity areas that are darker than nearby pavement because of road surface irregularity [21], and (ii) the texture inside the potholes being coarser than the nearby pavement [1, 22]. However, as illustrated in [8, 23], the pothole area is not always darker than nearby pavement. Furthermore, the irregularity of the road surface produces shadows at pothole boundaries, which is darker than nearby pavement. These conditions results into the lower accuracy of pothole detection using visual 2D techniques as was reported in [8]. In RGB imagery, pothole detection is influenced by the spill-in and spill-out phenomenon [1, 8], which is typically characterized by the similarities in the defect and non-defect features and regions. These results in the corruption of the defect regions on the pavement, with a smoothly varying intensity inhomogeneity called bias field. Bias is inherent to pavement imaging, and is associated with the imaging equipment limitations and also the pavement surface noise [1, 2].

Bias field in pothole detection can be modeled as a multiplicative component of an observed image, and varies spatially because of inhomogeneities, and can be modeled as in Eq. (4).

$$Y\_j = B\_j X\_j + n \tag{4}$$

be similar to its surrounding pixel. That is, the data points with similar feature vectors can be grouped into a single cluster and the data points with dissimilar feature vectors are also grouped into different clusters. By using a pre-segmentation clustering algorithm, the Euclidean distance between neighboring pixels is computed and used for the *a priori* clustering. This means that pixels that produce the lowest distance values to their neighbors are categorized as being nearly similar. Two pixels with similar neighboring values are expected to be close to each other, and hence the pixels can be clustered together. On way of minimizing noise through clustering is by using the *k*-means clustering algorithm, whereby the distance mea-

*On the Use of Low-Cost RGB-D Sensors for Autonomous Pothole Detection with Spatial…*

*<sup>j</sup>* , and the cluster *vj* is optimized by calculating the

. The value of this distance measure func-

sure between every point *z*

noise and other artifacts.

**151**

Euclidean distance measure *z*

*DOI: http://dx.doi.org/10.5772/intechopen.88877*

ð Þ*j*

 

then be applied, to cluster the smoothened pavement image.

ð Þ*j <sup>i</sup>* � *vj*

 2

tion is an indicator of the proximity of the *n* data points to their cluster prototypes. Once the pre-clustering is carried out, a more robust segmentation approach can

Image segmentation can be performed using different techniques such as: thresholding, clustering, transform and texture based methods [24]. Histogrambased thresholding is the simplest and often used approach [25]. Many global and local thresholding methods have been developed. While the global thresholds segment the entire image, with a single threshold using the gray-level histogram, the local based thresholds partition the image into a number of sub-images and select a threshold for each of the sub-image. The global thresholding methods select the thresholding based on different criterion such as: Otsu's method [24], minimum error thresholding [26], and entropic method [27]. These one-dimensional (1D) histogram thresholding methods work well when the two consecutive gray levels of the images are distinct. Further, all the 1D thresholding techniques do not combine the spatial information and the gray-level information of the pixels into the segmentation process. The performance of the thresholding techniques will lead to misclassifications in inherently correlated imagery, which are already corrupted by

Real-world images are often ambiguous, with indistinguishable histograms. As such, it is complicated for the classical thresholding techniques to find criterion of similarity or closeness for optimal thresholding. This ambiguity in image segmentation can be solved by using fuzzy set theory, as a probabilistic global image segmentation approach. Using the conventional FCM formulation, each class is assumed to have a uniform value as given by its centroid. Similarly, each data point is also assumed to be independent of every other data point and spatial interaction between data points is not considered. However, for image data, there is strong correlation between neighboring pixels. In addition, due to the intensity non-uniformity artifacts, the data in a class no longer have a uniform value. Thus to realize meaningful segmentation results, the conventional FCM algorithm has to be modified to take into account both local spatial continuity between neighboring data and intensity nonuniformity artifact compensation. This chapter illustrates the use of spatial fuzzy *c*means SFCM ð Þ, so as to incorporate the spatial neighboring information into the

FCM is an unsupervised fuzzy clustering algorithm. The conventional clustering

algorithms determine the "hard partition" of a given dataset based on certain criteria that evaluates the goodness of partition, so that each datum belongs to exactly one cluster of the partition. The soft clustering on the other hand finds the

standard fuzzy *c*-means for pothole detection on pavement surfaces.

**3.1 Fuzzy** *c-***means clustering with spatial constraints**

where *Yj* is the measured image at voxel *j*; *Xj* is the true image signal to be restored; *Bj* is an unknown noise or bias field, and *n* is the additive zero-mean Gaussian noise. Eq. (4) modeled as an additive component by applying a logarithmic transformation, it is possible to obtain a simplified form as:

$$y\_j = \mathbf{x}\_j + b\_j \tag{5}$$

where *xj* and *yj* are the true and observed log transformed intensities at the *j*th voxel, respectively, and *bj* is the noise or bias field at the *j*th voxel.

Bias or noise can be corrected by using prospective and retrospective methods. Prospective methods for noise minimization aim at avoiding the intensity inhomogeneities in the image acquisition process. Prospective methods are capable of correcting intensity inhomogeneity induced by the imaging devices; they are not able to remove object-induced effects. Retrospective methods in contrast, rely only on the information in the acquired images, and can thus remove intensity inhomogeneities regardless of their sources. The obvious choice in noise minimization is therefore the retrospective methods, which include filtering, surface fitting, histogram, and segmentation. Among the retrospective methods, segmentation-based approaches are particularly attractive, as they unify the tasks of segmentation and bias correction into a single framework. When an observed pixel *yj* is defined as noisy, the neighboring pixels can be used to correct it, since the pixel is expected to *On the Use of Low-Cost RGB-D Sensors for Autonomous Pothole Detection with Spatial… DOI: http://dx.doi.org/10.5772/intechopen.88877*

be similar to its surrounding pixel. That is, the data points with similar feature vectors can be grouped into a single cluster and the data points with dissimilar feature vectors are also grouped into different clusters. By using a pre-segmentation clustering algorithm, the Euclidean distance between neighboring pixels is computed and used for the *a priori* clustering. This means that pixels that produce the lowest distance values to their neighbors are categorized as being nearly similar. Two pixels with similar neighboring values are expected to be close to each other, and hence the pixels can be clustered together. On way of minimizing noise through clustering is by using the *k*-means clustering algorithm, whereby the distance measure between every point *z* ð Þ*j <sup>j</sup>* , and the cluster *vj* is optimized by calculating the Euclidean distance measure *z* ð Þ*j <sup>i</sup>* � *vj* 2 . The value of this distance measure function is an indicator of the proximity of the *n* data points to their cluster prototypes. Once the pre-clustering is carried out, a more robust segmentation approach can then be applied, to cluster the smoothened pavement image.

Image segmentation can be performed using different techniques such as: thresholding, clustering, transform and texture based methods [24]. Histogrambased thresholding is the simplest and often used approach [25]. Many global and local thresholding methods have been developed. While the global thresholds segment the entire image, with a single threshold using the gray-level histogram, the local based thresholds partition the image into a number of sub-images and select a threshold for each of the sub-image. The global thresholding methods select the thresholding based on different criterion such as: Otsu's method [24], minimum error thresholding [26], and entropic method [27]. These one-dimensional (1D) histogram thresholding methods work well when the two consecutive gray levels of the images are distinct. Further, all the 1D thresholding techniques do not combine the spatial information and the gray-level information of the pixels into the segmentation process. The performance of the thresholding techniques will lead to misclassifications in inherently correlated imagery, which are already corrupted by noise and other artifacts.

Real-world images are often ambiguous, with indistinguishable histograms. As such, it is complicated for the classical thresholding techniques to find criterion of similarity or closeness for optimal thresholding. This ambiguity in image segmentation can be solved by using fuzzy set theory, as a probabilistic global image segmentation approach. Using the conventional FCM formulation, each class is assumed to have a uniform value as given by its centroid. Similarly, each data point is also assumed to be independent of every other data point and spatial interaction between data points is not considered. However, for image data, there is strong correlation between neighboring pixels. In addition, due to the intensity non-uniformity artifacts, the data in a class no longer have a uniform value. Thus to realize meaningful segmentation results, the conventional FCM algorithm has to be modified to take into account both local spatial continuity between neighboring data and intensity nonuniformity artifact compensation. This chapter illustrates the use of spatial fuzzy *c*means SFCM ð Þ, so as to incorporate the spatial neighboring information into the standard fuzzy *c*-means for pothole detection on pavement surfaces.

#### **3.1 Fuzzy** *c-***means clustering with spatial constraints**

FCM is an unsupervised fuzzy clustering algorithm. The conventional clustering algorithms determine the "hard partition" of a given dataset based on certain criteria that evaluates the goodness of partition, so that each datum belongs to exactly one cluster of the partition. The soft clustering on the other hand finds the

contains about 300,000 points in every frame. By registering the consecutive depth images it is possible to obtain an increased point density, and to create a complete point cloud. To realize the full potential of the sensor for mapping applications an analysis of the systematic and random errors of the data is necessary. The correction of systematic errors is a prerequisite for the alignment of the depth and color data, and relies on the identification of the mathematical model of depth measurement and the calibration parameters involved. The characterization of random errors is important and useful in further processing of the depth data, for example in weighting the point pairs or planes in the registration algorithm [20].

Under perfect conditions, potholes tend to have two visual properties characterized by: (i) low-intensity areas that are darker than nearby pavement because of road surface irregularity [21], and (ii) the texture inside the potholes being coarser than the nearby pavement [1, 22]. However, as illustrated in [8, 23], the pothole area is not always darker than nearby pavement. Furthermore, the irregularity of the road surface produces shadows at pothole boundaries, which is darker than nearby pavement. These conditions results into the lower accuracy of pothole detection using visual 2D techniques as was reported in [8]. In RGB imagery, pothole detection is influenced by the spill-in and spill-out phenomenon [1, 8], which is typically characterized by the similarities in the defect and non-defect features and regions. These results in the corruption of the defect regions on the pavement, with a smoothly varying intensity inhomogeneity called bias field. Bias is inherent to pavement imaging, and is associated with the imaging equipment limi-

Bias field in pothole detection can be modeled as a multiplicative component of an observed image, and varies spatially because of inhomogeneities, and can be

where *Yj* is the measured image at voxel *j*; *Xj* is the true image signal to be restored; *Bj* is an unknown noise or bias field, and *n* is the additive zero-mean Gaussian noise. Eq. (4) modeled as an additive component by applying a logarith-

where *xj* and *yj* are the true and observed log transformed intensities at the *j*th

Bias or noise can be corrected by using prospective and retrospective methods. Prospective methods for noise minimization aim at avoiding the intensity inhomogeneities in the image acquisition process. Prospective methods are capable of correcting intensity inhomogeneity induced by the imaging devices; they are not able to remove object-induced effects. Retrospective methods in contrast, rely only on the information in the acquired images, and can thus remove intensity inhomogeneities regardless of their sources. The obvious choice in noise minimization is therefore the retrospective methods, which include filtering, surface fitting, histogram, and segmentation. Among the retrospective methods, segmentation-based approaches are particularly attractive, as they unify the tasks of segmentation and bias correction into a single framework. When an observed pixel *yj* is defined as noisy, the neighboring pixels can be used to correct it, since the pixel is expected to

mic transformation, it is possible to obtain a simplified form as:

voxel, respectively, and *bj* is the noise or bias field at the *j*th voxel.

*Yj* ¼ *BjXj* þ *n* (4)

*yj* ¼ *xj* þ *bj* (5)

1.Pothole detection and the bias field effect

*Geographic Information Systems in Geospatial Intelligence*

tations and also the pavement surface noise [1, 2].

modeled as in Eq. (4).

**150**

"soft partition" of a given dataset. And in "soft partition", the datum can partially belong to multiple clusters. Soft clustering algorithms do generate a soft partition that also forms fuzzy partition. A type of soft clustering of special interest is one that ensures membership degree of point *xj* in all clusters adding up to one (Eq. (6)), and also satisfies the constrained soft partition condition.

$$\sum\_{i} \mu\_{c\_i}(\mathbf{x}\_j) = \mathbf{1}, \forall \mathbf{x}\_j \in \mathbf{X} \tag{6}$$

function or the cluster center at two successive iteration steps. In an image, as illustrated in [1], the neighboring pixels are normally highly correlated. This is because these neighboring pixels possess similar feature values, and the probability that they belong to the same cluster is often high. The introduction of the spatial information is an important cue in resolving the *mixel* problem within a pavement pothole *voxel*. While this spatial relationship is important in clustering, it is not utilized in a standard FCM algorithm. To overcome the effect of noise in the segmentation process, [30] proposed spatial FCM algorithm in which spatial information can be incorporated into fuzzy membership functions directly using a spatial function. The spatial information is introduced while updating the membership function *uij* in the repetitive FCM algorithm because the neighborhood pixels possess same properties as the center pixel. To exploit the spatial information, the

*On the Use of Low-Cost RGB-D Sensors for Autonomous Pothole Detection with Spatial…*

*hij* <sup>¼</sup> <sup>X</sup>

into membership function as follows as presented in Eq. (11) [30].

*u* 0 *ij* <sup>¼</sup> *up ijh q ij*

*<sup>k</sup>* <sup>∈</sup> *NB x*ð Þ*<sup>j</sup>*

Like the membership function, the spatial function *hij* represents the probability that pixel *xj* belongs to the *i*th cluster. The spatial function of a pixel for a cluster is large if the majority of its neighborhood belongs to the same clusters. The spatial function is used in updating the membership function again, and is incorporated

> P*c k*¼1 *up kjh q kj*

where *p* and *q* are two parameters used to control the relative importance of both

In a homogenous region within an image, the spatial functions will strengthen the original membership, and the clustering result remains unchanged. However, for a noisy pixel, this formula reduces the weighting of a noisy cluster by the labels of its neighboring pixels. As a result, misclassified pixels from noisy regions or spurious blobs can easily be corrected. The spatial FCM with parameter *p* and *q* is denoted *SFCMp, <sup>q</sup>*. For *p* ¼ 1 and *q* ¼ 0, the *SFCM*1*,*<sup>0</sup> is identical to the conventional or standard FCM. In the *SFCMp, <sup>q</sup>*, the objective function is not changed, instead the membership function is updated twice. The first update is the same as in standard FCM that calculates the membership function in the spectral domain. However in the second phase, the membership information of each pixel is mapped to the spatial domain, and the spatial function is computed from that. The spatial function is defined as the sum of the membership values in spatial domain in the entire neighborhood around the pixel under consideration. The FCM iteration proceeds with the new membership that is incorporated with the spatial function. The iteration is stopped when the maximum difference between two cluster centers at two successive iterations is less than a threshold (=0.02). After the convergence, defuzzification is applied to assign each pixel to a specific cluster for which the membership is maximal. The *SFCMp, <sup>q</sup>* works well for high as well as low density noise, and can be applied for single and multiple feature data. As compared to other methods FCM based methods, *SFCMp, <sup>q</sup>* gives superior results without any boundary leakage even at high density noise, when the *q* value is carefully selected [31].

� � is a local square window centered on pixel *xj* in the spatial domain,

*uik* (10)

(11)

spatial function is defined by *hij* (Eq. (10)).

*DOI: http://dx.doi.org/10.5772/intechopen.88877*

and in this illustration, a 5 � 5 window is used.

the membership and spatial functions respectively.

where *NB xj*

**153**

The fuzzy *c*-means is a clustering method which allows one piece of data to belong to two or more clusters [28, 29]. The standard FCM algorithm considers the clustering as an optimization problem where an objective function must be minimized, and assigns pixels to each category by using fuzzy memberships. If *I* ¼ *xj* ∈*R<sup>d</sup>* � � *<sup>j</sup>*¼1*,*…*,N* is a *<sup>p</sup>* � *<sup>N</sup>* data matrix, where, *<sup>p</sup>* represents the dimension of each *xj* "feature" vector, and *N* represents the number of feature vectors (pixel numbers in the image), then the FCM algorithm is an iterative optimization that iteratively minimizes the objective function, with respect to fuzzy membership <sup>0</sup> *U*0 , and set of cluster centroids, <sup>0</sup> *V*0 as in Eq. (7).

$$J\_{\rm FCM} = \sum\_{j=1}^{N} \sum\_{i=1}^{c} u\_{ij}^{m} \cdot \left\| \mathbf{x}\_{j} - \mathbf{v}\_{i} \right\|^{2} \tag{7}$$

where *uij* represents the fuzzy membership of pixel *xj* in the *i*th cluster and *u* ¼ ð Þ *u*1*; u*2*;* …*; uc* are the set of cluster centers; <sup>0</sup> *C*<sup>0</sup> is the number of clusters; *vi*is the *i*th cluster center; k k� is a Euclidean distance or the norm metric, and *m* is a constant for fuzziness exponent. The parameter *m* controls the fuzziness of the resulting partition or the fuzziness of the consequential partition, and *m* ¼ 2 is used in this study.

The cost function is minimized when pixels close to the centroid of their clusters are assigned high membership values, and low membership values are assigned to pixels with data far from the centroid. The membership function represents the probability that a pixel belongs to a specific cluster. In the FCM algorithm, the probability is dependent solely on the distance between the pixel and each individual cluster center in the feature domain. By minimizing Eq. (7) using the first derivatives with respect to *uij* and *vi* then setting them to zero using the Lagrange method, the membership functions and cluster centers are updated by solutions of *uij* and the fuzzy centers *vi*:

$$\mathfrak{U}\_{\vec{ij}} = \frac{\mathbf{1}}{\sum\_{k=1}^{c} \left( \frac{||\boldsymbol{x}\_{\vec{j}} - \boldsymbol{v}\_{i}||}{||\boldsymbol{x}\_{\vec{j}} - \boldsymbol{v}\_{k}||} \right)^{2/(m-1)}} \tag{8}$$

and

$$\begin{aligned} \boldsymbol{\nu}\_{i} &= \frac{\sum\_{j=1}^{N} \boldsymbol{u}\_{ij}^{m} \boldsymbol{x}\_{j}}{\sum\_{j=1}^{N} \boldsymbol{u}\_{ij}^{m}} \end{aligned} \tag{9}$$

Starting with an initial guess for each cluster center, the FCM converges to a solution for *vi* representing the local minimum or a saddle point of the cost function. Convergence can be detected by comparing the changes in the membership

*On the Use of Low-Cost RGB-D Sensors for Autonomous Pothole Detection with Spatial… DOI: http://dx.doi.org/10.5772/intechopen.88877*

function or the cluster center at two successive iteration steps. In an image, as illustrated in [1], the neighboring pixels are normally highly correlated. This is because these neighboring pixels possess similar feature values, and the probability that they belong to the same cluster is often high. The introduction of the spatial information is an important cue in resolving the *mixel* problem within a pavement pothole *voxel*. While this spatial relationship is important in clustering, it is not utilized in a standard FCM algorithm. To overcome the effect of noise in the segmentation process, [30] proposed spatial FCM algorithm in which spatial information can be incorporated into fuzzy membership functions directly using a spatial function. The spatial information is introduced while updating the membership function *uij* in the repetitive FCM algorithm because the neighborhood pixels possess same properties as the center pixel. To exploit the spatial information, the spatial function is defined by *hij* (Eq. (10)).

$$h\_{ij} = \sum\_{k \in NB\left(\chi\_j\right)} u\_{ik} \tag{10}$$

where *NB xj* � � is a local square window centered on pixel *xj* in the spatial domain, and in this illustration, a 5 � 5 window is used.

Like the membership function, the spatial function *hij* represents the probability that pixel *xj* belongs to the *i*th cluster. The spatial function of a pixel for a cluster is large if the majority of its neighborhood belongs to the same clusters. The spatial function is used in updating the membership function again, and is incorporated into membership function as follows as presented in Eq. (11) [30].

$$\boldsymbol{u}'\_{ij} = \frac{\boldsymbol{u}^p\_{ij}\boldsymbol{h}^q\_{ij}}{\sum\_{k=1}^c \boldsymbol{u}^p\_{kj}\boldsymbol{h}^q\_{kj}}\tag{11}$$

where *p* and *q* are two parameters used to control the relative importance of both the membership and spatial functions respectively.

In a homogenous region within an image, the spatial functions will strengthen the original membership, and the clustering result remains unchanged. However, for a noisy pixel, this formula reduces the weighting of a noisy cluster by the labels of its neighboring pixels. As a result, misclassified pixels from noisy regions or spurious blobs can easily be corrected. The spatial FCM with parameter *p* and *q* is denoted *SFCMp, <sup>q</sup>*. For *p* ¼ 1 and *q* ¼ 0, the *SFCM*1*,*<sup>0</sup> is identical to the conventional or standard FCM. In the *SFCMp, <sup>q</sup>*, the objective function is not changed, instead the membership function is updated twice. The first update is the same as in standard FCM that calculates the membership function in the spectral domain. However in the second phase, the membership information of each pixel is mapped to the spatial domain, and the spatial function is computed from that. The spatial function is defined as the sum of the membership values in spatial domain in the entire neighborhood around the pixel under consideration. The FCM iteration proceeds with the new membership that is incorporated with the spatial function. The iteration is stopped when the maximum difference between two cluster centers at two successive iterations is less than a threshold (=0.02). After the convergence, defuzzification is applied to assign each pixel to a specific cluster for which the membership is maximal. The *SFCMp, <sup>q</sup>* works well for high as well as low density noise, and can be applied for single and multiple feature data. As compared to other methods FCM based methods, *SFCMp, <sup>q</sup>* gives superior results without any boundary leakage even at high density noise, when the *q* value is carefully selected [31].

"soft partition" of a given dataset. And in "soft partition", the datum can partially belong to multiple clusters. Soft clustering algorithms do generate a soft partition that also forms fuzzy partition. A type of soft clustering of special interest is one that ensures membership degree of point *xj* in all clusters adding up to one (Eq. (6)), and also satisfies the constrained soft partition condition.

The fuzzy *c*-means is a clustering method which allows one piece of data to belong to two or more clusters [28, 29]. The standard FCM algorithm considers the clustering as an optimization problem where an objective function must be minimized, and assigns pixels to each category by using fuzzy memberships. If *I* ¼

"feature" vector, and *N* represents the number of feature vectors (pixel numbers in the image), then the FCM algorithm is an iterative optimization that iteratively

> X*c i*¼1 *um*

where *uij* represents the fuzzy membership of pixel *xj* in the *i*th cluster and *u* ¼

cluster center; k k� is a Euclidean distance or the norm metric, and *m* is a constant for fuzziness exponent. The parameter *m* controls the fuzziness of the resulting partition or the fuzziness of the consequential partition, and *m* ¼ 2 is used in this study. The cost function is minimized when pixels close to the centroid of their clusters are assigned high membership values, and low membership values are assigned to pixels with data far from the centroid. The membership function represents the probability that a pixel belongs to a specific cluster. In the FCM algorithm, the probability is dependent solely on the distance between the pixel and each individual cluster center in the feature domain. By minimizing Eq. (7) using the first derivatives with respect to *uij* and *vi* then setting them to zero using the Lagrange method, the membership functions and cluster centers are updated by solutions of

*<sup>j</sup>*¼1*,*…*,N* is a *<sup>p</sup>* � *<sup>N</sup>* data matrix, where, *<sup>p</sup>* represents the dimension of each *xj*

*ij* � *xj* � *vi* � � � �

� � <sup>¼</sup> <sup>1</sup>*,* <sup>∀</sup>*xj* <sup>∈</sup>*<sup>X</sup>* (6)

*U*0

<sup>2</sup> (7)

*C*<sup>0</sup> is the number of clusters; *vi*is the *i*th

� �<sup>2</sup>*=*ð Þ *<sup>m</sup>*�<sup>1</sup> (8)

, and set of

(9)

X *i*

*Geographic Information Systems in Geospatial Intelligence*

*xj* ∈*R<sup>d</sup>* � �

cluster centroids, <sup>0</sup>

*V*0

ð Þ *u*1*; u*2*;* …*; uc* are the set of cluster centers; <sup>0</sup>

*uij* and the fuzzy centers *vi*:

and

**152**

as in Eq. (7).

*μci xj*

minimizes the objective function, with respect to fuzzy membership <sup>0</sup>

*j*¼1

*uij* <sup>¼</sup> <sup>1</sup> P*c k*¼1

*vi* ¼

Convergence can be detected by comparing the changes in the membership

k k *xj*�*vi* k k *xj*�*vk*

> P *N j*¼1 *um ij xj*

> > P *N j*¼1 *um ij*

Starting with an initial guess for each cluster center, the FCM converges to a solution for *vi* representing the local minimum or a saddle point of the cost function.

*JFCM* <sup>¼</sup> <sup>X</sup> *N*

#### **3.2 Depth image data smoothing and hole-filling**

To correctly analyze and potentially combine the RGB image with the depth data, the spatial alignment of the RGB and the depth camera outputs is necessary. Additionally, the raw depth data are very noisy and many pixels in the image may have no depth due to multiple reflections, transparent objects or scattering in certain nearby surfaces. As such the inaccurate and or missing depth data (holes) need to be recovered prior to data processing. The recovery is conducted through application-specific camera recalibration and or depth data filtering. In this section we deal with the depth data filtering first, and in the next subsection, the camera calibration is discussed. By enhancing the depth image using color image, the following issues are addressed: (i) due to various environmental reasons, specular reflections, or simply the device range, there are regions of missing data in the depth map; (ii) the accuracy of the pixels values in the depth image is low, and the noise level is high. This is true mostly along depth edges and object boundaries, which is exactly where such information is most valuable; (iii) despite the calibration, the depth and color images are still not aligned well enough. They are acquired by two close, but not similar, sensors and may also have differences in their internal camera properties (e.g., focal length). This misalignment leads to small projection differences, even, again, these small errors are more noticeable especially along edges, and (iv) usually the depth image has lower resolution than the color image, and therefore it should be up-sampled in a consistent manner.

By letting *IX* be the color at pixel **x**, and *I*

*DOI: http://dx.doi.org/10.5772/intechopen.88877*

*I I X* ¼

*I J <sup>X</sup>* ¼

maintaining the edge-preserving property [42].

**3.3 Calibration of RGB and IR Kinect cameras**

P

P

P

P

*I I <sup>X</sup>* to be:

exp � k k <sup>x</sup>�<sup>y</sup> <sup>2</sup> 2*σ*<sup>2</sup> *S*

� � and, *<sup>f</sup> <sup>R</sup> Ix;Iy*

determined as in Eq. (14).

filtered image *I*

**155**

*I*

*<sup>y</sup>*<sup>∈</sup> *N x*ð Þ*<sup>f</sup> <sup>S</sup>*ð Þ� *<sup>x</sup>; <sup>y</sup> <sup>f</sup> <sup>R</sup> Ix;Iy*

where **y** is a pixel in the neighborhood *N(***x***)* of pixel **x**, where *f <sup>S</sup>*ð Þ¼ *x; y*

*On the Use of Low-Cost RGB-D Sensors for Autonomous Pothole Detection with Spatial…*

kernels measuring the spatial and range/color similarities. The parameter *σ<sup>S</sup>* defines the size of the spatial neighborhood used to filter a pixel, and *σ<sup>R</sup>* controls how much

The limitation of the conventional bilateral filter is that it can interpret impulse noise spikes as forming an edge. A joint or cross bilateral filter [39, 40] is similar to the conventional bilateral filter except that in the case of the joint bilateral filter, the range filter kernel *f <sup>R</sup>*ð Þ� is computed from another image called the *guidance image*. The guide image *J* indicates where similar pixels are located in each neighborhood. With *J* as the guidance image, then the joint bilateral filtered value at pixel **x** is

*<sup>y</sup>*<sup>∈</sup> *N x*ð Þ *<sup>f</sup> <sup>S</sup>*ð Þ *<sup>x</sup>; <sup>y</sup> <sup>f</sup> <sup>R</sup> Jx; Jy*

It is important to note that the joint bilateral filter ensures the texture of the

tion this paper, the image intensity was normalized such that it ranges from [0*,* 1], and image coordinates were also normalized so that **x** and **y** also reside in [0*,* 1]. With this depth hole filling based on the bilateral filter, the depth value at each pixel in an image is replaced by a weighted average of depth values from nearby pixels. While the joint bilateral filter has been demonstrated to be very effective for color image upsampling, if it is directly applied to a depth image with a registered RGB color image as the guidance image, the texture of the guidance image (that is independent of the depth information) is likely to be introduced to the upsampled depth image, and the upsampling errors mainly reside in the texture transferring property of the joint bilateral filter [38]. Meanwhile, the median filtering operation minimizes the sum of the absolute error of the given data [41], and is much more robust to outliers than the bilateral filter. A possible solution to the "hole-filling" problem in depth imagery is to focus on the combination of the median operation with the bilateral filter so that the texture influence can be better suppressed while

Despite the fact that the Kinect, like other off-the-shelf sensors, has been calibrated during manufacturing, and the camera parameters are stored in the device's memory, this calibration information not accurate enough for reconstructing 3D information, from which a highly precise cloud of 3D points should be obtained. Furthermore, the manufacturer's calibration does not correct the depth distortion,

and is thus incapable of recovering the missing depth [43]. Using a 9 � 8

*<sup>y</sup>*<sup>∈</sup> *N x*ð Þ *<sup>f</sup> <sup>S</sup>*ð Þ *<sup>x</sup>; <sup>y</sup> <sup>f</sup> <sup>R</sup> Jx; Jy*

*<sup>J</sup>* to follow the texture of the guidance image *J*. In the implementa-

� � *Iy*

� � <sup>¼</sup> exp � k k *Ix*�*Iy*

an adjacent pixel is down-weighted because of the color difference.

*<sup>y</sup>* <sup>∈</sup> *N x*ð Þ*<sup>f</sup> <sup>S</sup>*ð Þ� *<sup>x</sup>; <sup>y</sup> <sup>f</sup> <sup>R</sup> Ix;Iy*

2 2*σ*<sup>2</sup> *R*

� � � *Iy*

� � are the spatial and range filter

*<sup>X</sup>* be the filtered value, it is desired for

� � (13)

� � (14)

Because of the limitations in the depth measuring principle and object surface properties, the depth image from Kinect inevitably contains optical noise and unmatched edges, together with holes or invalid pixels, which makes it unsuitable for direct application [32]. In order to remove noise from the depth image, the joint bilateral filter is preferred. This is because the joint bilateral filter has the advantage of preserving edges while removing noises, analyzing through every image pixel and replacing every image pixel-by-pixel with the median of the pixels in the corresponding filter region *R*. This process can be expressed according to Eq. (12).

$$I'(u, v) \to \operatorname{median}\{I(u+i, v+j) | (i, j) \in R\}\tag{12}$$

where, ð Þ *u; v* is the position of the image pixel and ð Þ *i; j* is the neighborhood size of the image region and these are specified as a two element numeric vector of positive integers. By using median filtering, each output pixel contains the median value in the *i* � *j* neighborhood around the corresponding pixel in the input image.

In filling holes in depth images: (i) [33] used bilateral filter and median filter in the temporal domain; (ii) [34] proposed joint bilateral filter and Kalman filter for depth map smoothing, and to reduce the random fluctuations in the time domain. Jung [35] proposed a modified version of the joint trilateral filter (JTF) by using both depth and color pixels to estimate a filter kernel and by assuming the presence of no holes. Liu et al. [36] employed an energy minimization method with a regularization term to fill the depth-holes and remove the noise in depth images. The linear regression model utilized was based on both depth values and pixel colors. From the above studies, it is noted that the methods are primarily based on different types of filters to smooth noise in depth images and to fill holes by using color images to guide the process.

Introduced by [37], the bilateral filter is a robust edge-preserving filter with two filter kernels: a spatial filter kernel and a range filter kernel, which are traditionally based on a Gaussian distribution, for measuring the spatial and range distance between the center pixel and its neighbors, respectively [38].

*On the Use of Low-Cost RGB-D Sensors for Autonomous Pothole Detection with Spatial… DOI: http://dx.doi.org/10.5772/intechopen.88877*

By letting *IX* be the color at pixel **x**, and *I I <sup>X</sup>* be the filtered value, it is desired for *I I <sup>X</sup>* to be:

$$I\_X^I = \frac{\sum\_{\mathbf{y} \in N(\mathbf{x})} f\_S(\mathbf{x}, \mathbf{y}) \cdot f\_R\left(I\_{\mathbf{x}}, I\_{\mathbf{y}}\right) \cdot I\_{\mathbf{y}}}{\sum\_{\mathbf{y} \in N(\mathbf{x})} f\_S(\mathbf{x}, \mathbf{y}) \cdot f\_R\left(I\_{\mathbf{x}}, I\_{\mathbf{y}}\right)}\tag{13}$$

where **y** is a pixel in the neighborhood *N(***x***)* of pixel **x**, where *f <sup>S</sup>*ð Þ¼ *x; y*

exp � k k <sup>x</sup>�<sup>y</sup> <sup>2</sup> 2*σ*<sup>2</sup> *S* � � and, *<sup>f</sup> <sup>R</sup> Ix;Iy* � � <sup>¼</sup> exp � k k *Ix*�*Iy* 2 2*σ*<sup>2</sup> *R* � � are the spatial and range filter kernels measuring the spatial and range/color similarities. The parameter *σ<sup>S</sup>* defines the size of the spatial neighborhood used to filter a pixel, and *σ<sup>R</sup>* controls how much an adjacent pixel is down-weighted because of the color difference.

The limitation of the conventional bilateral filter is that it can interpret impulse noise spikes as forming an edge. A joint or cross bilateral filter [39, 40] is similar to the conventional bilateral filter except that in the case of the joint bilateral filter, the range filter kernel *f <sup>R</sup>*ð Þ� is computed from another image called the *guidance image*. The guide image *J* indicates where similar pixels are located in each neighborhood. With *J* as the guidance image, then the joint bilateral filtered value at pixel **x** is determined as in Eq. (14).

$$I\_X^l = \frac{\sum\_{\mathcal{Y} \in N(\mathbf{x})} f\_S(\mathbf{x}, \boldsymbol{\mathcal{Y}}) f\_R\left(\boldsymbol{J}\_{\mathbf{x}}, \boldsymbol{J}\_{\mathbf{y}}\right) I\_{\mathbf{y}}}{\sum\_{\mathcal{Y} \in N(\mathbf{x})} f\_S(\mathbf{x}, \boldsymbol{\mathcal{Y}}) f\_R\left(\boldsymbol{J}\_{\mathbf{x}}, \boldsymbol{J}\_{\mathbf{y}}\right)}\tag{14}$$

It is important to note that the joint bilateral filter ensures the texture of the filtered image *I <sup>J</sup>* to follow the texture of the guidance image *J*. In the implementation this paper, the image intensity was normalized such that it ranges from [0*,* 1], and image coordinates were also normalized so that **x** and **y** also reside in [0*,* 1].

With this depth hole filling based on the bilateral filter, the depth value at each pixel in an image is replaced by a weighted average of depth values from nearby pixels. While the joint bilateral filter has been demonstrated to be very effective for color image upsampling, if it is directly applied to a depth image with a registered RGB color image as the guidance image, the texture of the guidance image (that is independent of the depth information) is likely to be introduced to the upsampled depth image, and the upsampling errors mainly reside in the texture transferring property of the joint bilateral filter [38]. Meanwhile, the median filtering operation minimizes the sum of the absolute error of the given data [41], and is much more robust to outliers than the bilateral filter. A possible solution to the "hole-filling" problem in depth imagery is to focus on the combination of the median operation with the bilateral filter so that the texture influence can be better suppressed while maintaining the edge-preserving property [42].

#### **3.3 Calibration of RGB and IR Kinect cameras**

Despite the fact that the Kinect, like other off-the-shelf sensors, has been calibrated during manufacturing, and the camera parameters are stored in the device's memory, this calibration information not accurate enough for reconstructing 3D information, from which a highly precise cloud of 3D points should be obtained. Furthermore, the manufacturer's calibration does not correct the depth distortion, and is thus incapable of recovering the missing depth [43]. Using a 9 � 8

**3.2 Depth image data smoothing and hole-filling**

*Geographic Information Systems in Geospatial Intelligence*

and therefore it should be up-sampled in a consistent manner.

*I* 0

input image.

**154**

images to guide the process.

Because of the limitations in the depth measuring principle and object surface properties, the depth image from Kinect inevitably contains optical noise and unmatched edges, together with holes or invalid pixels, which makes it unsuitable for direct application [32]. In order to remove noise from the depth image, the joint bilateral filter is preferred. This is because the joint bilateral filter has the advantage of preserving edges while removing noises, analyzing through every image pixel and replacing every image pixel-by-pixel with the median of the pixels in the corresponding filter region *R*. This process can be expressed according to Eq. (12).

where, ð Þ *u; v* is the position of the image pixel and ð Þ *i; j* is the neighborhood size of the image region and these are specified as a two element numeric vector of positive integers. By using median filtering, each output pixel contains the median value in the *i* � *j* neighborhood around the corresponding pixel in the

In filling holes in depth images: (i) [33] used bilateral filter and median filter in the temporal domain; (ii) [34] proposed joint bilateral filter and Kalman filter for depth map smoothing, and to reduce the random fluctuations in the time domain. Jung [35] proposed a modified version of the joint trilateral filter (JTF) by using both depth and color pixels to estimate a filter kernel and by assuming the presence of no holes. Liu et al. [36] employed an energy minimization method with a regularization term to fill the depth-holes and remove the noise in depth images. The linear regression model utilized was based on both depth values and pixel colors. From the above studies, it is noted that the methods are primarily based on different types of filters to smooth noise in depth images and to fill holes by using color

Introduced by [37], the bilateral filter is a robust edge-preserving filter with two filter kernels: a spatial filter kernel and a range filter kernel, which are traditionally based on a Gaussian distribution, for measuring the spatial and range distance

between the center pixel and its neighbors, respectively [38].

ð Þ! *u; v median I u* f ð Þ þ *i; v* þ *j* j g ð Þ *i; j* ∈*R* (12)

To correctly analyze and potentially combine the RGB image with the depth data, the spatial alignment of the RGB and the depth camera outputs is necessary. Additionally, the raw depth data are very noisy and many pixels in the image may have no depth due to multiple reflections, transparent objects or scattering in certain nearby surfaces. As such the inaccurate and or missing depth data (holes) need to be recovered prior to data processing. The recovery is conducted through application-specific camera recalibration and or depth data filtering. In this section we deal with the depth data filtering first, and in the next subsection, the camera calibration is discussed. By enhancing the depth image using color image, the following issues are addressed: (i) due to various environmental reasons, specular reflections, or simply the device range, there are regions of missing data in the depth map; (ii) the accuracy of the pixels values in the depth image is low, and the noise level is high. This is true mostly along depth edges and object boundaries, which is exactly where such information is most valuable; (iii) despite the calibration, the depth and color images are still not aligned well enough. They are acquired by two close, but not similar, sensors and may also have differences in their internal camera properties (e.g., focal length). This misalignment leads to small projection differences, even, again, these small errors are more noticeable especially along edges, and (iv) usually the depth image has lower resolution than the color image,
