**4. Automated ground truth labeling for multimodal UAV perception dataset**

Classic methods using trained classifiers would pick designed features (based on several metrics and detectors, as discussed earlier) to parametrize a given sample and assign a label. This would allow creating small specific datasets, which could be used to infer the knowledge to create bigger datasets in a posterior step. The high specificity of the features chosen (generally with expert domain knowledge applied implicitly) with respect to the task generally made them unsuitable to export learning to other domains.

By contrast, deep learning offers several transfer learning options. That is, as it was proven by Yosinski [26], trained with a distant domain dataset are generally

**139**

*Deep Learning-Based Detection of Pipes in Industrial Environments*

useful for different domains and usually better than training from an initial random state. Notice that the transferability of features decreases with the difference between the previously trained task and the target one and implies that the network

*The framework proposed to automatically produce labeled datasets with the multimodal perception UAV.*

With this concept in mind, we decided to build a dataset to train an outdoor industrial pipe detector with pixel level annotation to be able to determine the position of the pipe. While the ability of transfer learning allows us to skip building a dataset with several tens of thousands of images, and therefore, the authors will work with a few thousand, which were used to fine-tune the network. These orders of magnitude are required as a "shallow" deep network. For instance, the AlexNet

Capturing and labeling a dataset is a cumbersome task, so we also set to automatize this task with minimal human supervision/interaction, exploiting the capabilities of the sensing architecture proposed in earlier works described in Section 2. This framework, see **Figure 6**, uses the images captured by the UAV camera sensor, the data processed by the localization approach chosen (see Section 2) to obtain the UAV odometry, and pipe detection seeds from the RANSAC technique treating the LiDAR point cloud data. When a pipe (or generally a cylinder) is detected and segmented in the data sensor provided by the LiDAR, this is used to produce a label for the temporally near images, to identify the region of the image (the set of pixels) containing the pipe or cylinder detected and its pose w.r.t. the camera. Notice that even running the perception part, the camera works at a higher rate than the LiDAR, so the full odometric estimation is used to interpolate between pipe detections, to estimate where the label should be projected into the in-between images

This methodology was used to create an initial labeled dataset with actual data captured in real industrial scenarios during test and development flights, as it will

To evaluate the viability of the proposed automated dataset generation methodology, we apply it to capture a dataset and train several semantic segmentation networks with it. To provide some quantitative quality measurement for the solutions produced, we use modified standard metrics for state-of-the-art deep learning, accounting that in our problem we are dealing with only one semantic class:

architecture is the same up to the transferred layers at least.

(just as it was described for the pipe prediction in Section 2).

already presents 60 million parameters.

**Figure 6.**

be discussed in the next section.

**5. Experimental evaluation**

*DOI: http://dx.doi.org/10.5772/intechopen.93164*

### *Deep Learning-Based Detection of Pipes in Industrial Environments DOI: http://dx.doi.org/10.5772/intechopen.93164*

**Figure 6.**

*Industrial Robotics - New Paradigms*

In order to be able to produce this fusion, data from the deeper layers are upsampled

*Detail of the skip architectures (FCN32, FCN16, and FCN8) used to produce results with data from several layers to recover both deep features and spatial information from shallow layers (courtesy of [25]).*

contain more spatial information. Thus, up to three different levels can be processed through FCN, depending on the quantity of layers deconvoluted and fused, as seen

More information on the detailed working of the different FCN models can be found in [25]. It is still worth noting that the more shallow layers are fused, the more accurate the model becomes, but according to the literature, the gain from

**4. Automated ground truth labeling for multimodal UAV perception** 

Classic methods using trained classifiers would pick designed features (based on several metrics and detectors, as discussed earlier) to parametrize a given sample and assign a label. This would allow creating small specific datasets, which could be used to infer the knowledge to create bigger datasets in a posterior step. The high specificity of the features chosen (generally with expert domain knowledge applied implicitly) with respect to the task generally made them unsuitable to export learn-

By contrast, deep learning offers several transfer learning options. That is, as it was proven by Yosinski [26], trained with a distant domain dataset are generally

using deconvolution. Notice that data from shallow layers will be coarser but

*Diagram of the AlexNet architecture, showcasing its pioneering use of convolutional layers.*

**138**

in **Figure 5**.

**Figure 5.**

**Figure 4.**

**dataset**

ing to other domains.

FCN16 to FCN8 is minimal (below 2%).

*The framework proposed to automatically produce labeled datasets with the multimodal perception UAV.*

useful for different domains and usually better than training from an initial random state. Notice that the transferability of features decreases with the difference between the previously trained task and the target one and implies that the network architecture is the same up to the transferred layers at least.

With this concept in mind, we decided to build a dataset to train an outdoor industrial pipe detector with pixel level annotation to be able to determine the position of the pipe. While the ability of transfer learning allows us to skip building a dataset with several tens of thousands of images, and therefore, the authors will work with a few thousand, which were used to fine-tune the network. These orders of magnitude are required as a "shallow" deep network. For instance, the AlexNet already presents 60 million parameters.

Capturing and labeling a dataset is a cumbersome task, so we also set to automatize this task with minimal human supervision/interaction, exploiting the capabilities of the sensing architecture proposed in earlier works described in Section 2.

This framework, see **Figure 6**, uses the images captured by the UAV camera sensor, the data processed by the localization approach chosen (see Section 2) to obtain the UAV odometry, and pipe detection seeds from the RANSAC technique treating the LiDAR point cloud data. When a pipe (or generally a cylinder) is detected and segmented in the data sensor provided by the LiDAR, this is used to produce a label for the temporally near images, to identify the region of the image (the set of pixels) containing the pipe or cylinder detected and its pose w.r.t. the camera. Notice that even running the perception part, the camera works at a higher rate than the LiDAR, so the full odometric estimation is used to interpolate between pipe detections, to estimate where the label should be projected into the in-between images (just as it was described for the pipe prediction in Section 2).

This methodology was used to create an initial labeled dataset with actual data captured in real industrial scenarios during test and development flights, as it will be discussed in the next section.

## **5. Experimental evaluation**

To evaluate the viability of the proposed automated dataset generation methodology, we apply it to capture a dataset and train several semantic segmentation networks with it. To provide some quantitative quality measurement for the solutions produced, we use modified standard metrics for state-of-the-art deep learning, accounting that in our problem we are dealing with only one semantic class:

• **PA** (pixel accuracy): base metric, defined by the ratio between properly classified pixels *TP* and the total number of pixels in an image, *pixtotal*:

$$PA = \frac{TP}{p i \mathfrak{x}\_{total}} \tag{1}$$

Notice that usually, besides the PA the mean pixel accuracy (MPA) is provided, but in our case, it reduces to the same value of PA, thus it will not be provided.

• **IoU** (intersection over union): standard metric in segmentation. The ration is computed between the intersection and union of two sets, namely the found segmentation and the labeled ground truth. Conceptually, it equals to the ratio between the number of correct positives (i.e., the intersection of the sets) TP, over all the correct positives, spurious positives FP and false negatives FN (i.e., the union of both the ground truth and segmentation provided). Usually, it is used as mean IoU (MIU), averaging the same ratio for all classes.

$$IoL = \frac{TP}{TP + FP + FN} \tag{2}$$

An additional metric usually computed along with the MIU is the frequency weighted MIU, which just weighs the average IoU computed at MIU according to the relative frequency of each class. The MIU, in our case, IoU is the most relevant metric and the most widely used when reporting segmentation results (semantic or otherwise).

### **5.1 Dataset generation results**

The system proposed was implemented over the ROS meta-operating system, just as in previous works [6], where the UAV system used to capture the data is described. A set of real flights in simulated industry environments was performed, where flights around a pipe were done. During these flights, averaging ~240 s, an OTS USB camera was used to capture images (at 640 × 480 resolution), achieving an average frame rate of around 17 fps. This translated in around 20,000 raw images captured, including the parts of flight where no industry-like elements are present, thus of limited use.

Notice that as per the method described, the pipe to be found can be only labeled automatically when the LiDAR sensor can detect it; thus, the number of images was further reduced due to the range limitations of the LiDAR scanner. Other factors, such as vibrations and disruptions in the input or results of required perceptual data, further reduced the number of images with accurate labels.

Around ~2100 images were automatically labeled with a mask assigning a ground truth for the pipe in the image. After an initial human inspection of the assigned label, a further ~320 were rejected, obtaining a final set of 1750. The image rejected produced spurious ground truths/masks. Some of them had inconsistent data and the reprojection of the cylinder detected in through RANSAC in LiDAR scans was not properly aligned (error could be produced by spurious interpolation of poses, faulty synchronization data from the sensors, or due to deformation of the UAV frame, as it is impossible for it to be perfectly rigid). Another group presented partial detections (only one of the edges of the pipe is visible in the image), thus making it useless for the apparent contour optimization. A third type of error found was produced by the vision-based pipeline, where a spurious mask was generated,

**141**

**Figure 7.**

*Deep Learning-Based Detection of Pipes in Industrial Environments*

commonly some shadows/textures displace/retort the edge, or areas not pertaining to the pipe are assigned due similarity of the texture and complexity of delimiting

Out of the several options available to test the validity of the dataset produced, the shallow architecture AlexNet was selected, as it could be easily trained and it would provide some insight in the performance that could be realistically expected

According to previous literature, the dataset was divided into training, valida-

To match the input of AlexNet the images were resized to 256 × 256 resolution. This was mainly done to reduce the computational load, as the input size could be easily fit adjusting some parameters, like the stride. To train and test the network, the Pytorch library was used, which provides full support for its own implementa-

To produce some metrics relevant to the network architecture just trained, a modified version of the technique used to label the dataset was used. Note that this approach, as described in previous sections, uses LiDAR, cameras, and odometry to: acquire an initial robust detection (from LiDAR), track its projection and predict it in the camera image space (using odometric data), and finally determine its edges/contour in the image. The robustness of the LiDAR detection is mainly due to exploiting prior knowledge (in the form of the known radius of the pipe to detect) that cannot be introduced into the AlexNet architecture to produce a meaningful comparison. So, a modified method, referred to as NPMD (no-priors multimodal detector) was employed to estimate the accuracy of earlier work detector without priors. The main difference was modifying the LiDAR pipeline to be able to detect several pipes with different radius (as it should be considered unknown). This led to the appearance of false positives and spurious measurements, which in turn weakened the results produced by the segmentation part of

Thus, FCN with AlexNet classification was trained using a pre-trained model for AlexNet, with the standard stochastic gradient descend (SGD) with a momentum

and 2, respectively. Without any prior data, and no benefit to obtain by doing

batches of 20. The weight decay and bias learning rate were set to standard values of

otherwise reported in any previous works, the classifier layer was set to 0, and the dropout layer in the AlexNet left unmodified. This trained model produced the

*Left: dataset image. Middle: bounding box and centroid of the region detected. Right: segmentation mask image.*

was used, according to known literature, with image

A sample of the labeling process can be seen in **Figure 7**, with the original image, the segmented pipe image, and approximations to centroid and

from a CNN-based approach deployed in the limited hardware of a UAV.

tion, and test at the standard ratio of 70, 15, and 15% respectively.

*DOI: http://dx.doi.org/10.5772/intechopen.93164*

the areas.

bounding bow.

tion of AlexNet.

the visual pipeline.

5.10<sup>−</sup><sup>4</sup>

of 0.9. A learning rate of 10<sup>−</sup><sup>3</sup>

results found in **Table 1**.

### *Deep Learning-Based Detection of Pipes in Industrial Environments DOI: http://dx.doi.org/10.5772/intechopen.93164*

*Industrial Robotics - New Paradigms*

• **PA** (pixel accuracy): base metric, defined by the ratio between properly classi-

*TP PA*

but in our case, it reduces to the same value of PA, thus it will not be provided.

used as mean IoU (MIU), averaging the same ratio for all classes.

Notice that usually, besides the PA the mean pixel accuracy (MPA) is provided,

• **IoU** (intersection over union): standard metric in segmentation. The ration is computed between the intersection and union of two sets, namely the found segmentation and the labeled ground truth. Conceptually, it equals to the ratio between the number of correct positives (i.e., the intersection of the sets) TP, over all the correct positives, spurious positives FP and false negatives FN (i.e., the union of both the ground truth and segmentation provided). Usually, it is

> <sup>=</sup> + + *TP IoU*

An additional metric usually computed along with the MIU is the frequency weighted MIU, which just weighs the average IoU computed at MIU according to the relative frequency of each class. The MIU, in our case, IoU is the most relevant metric and the most widely used when reporting segmentation results (semantic or

The system proposed was implemented over the ROS meta-operating system, just as in previous works [6], where the UAV system used to capture the data is described. A set of real flights in simulated industry environments was performed, where flights around a pipe were done. During these flights, averaging ~240 s, an OTS USB camera was used to capture images (at 640 × 480 resolution), achieving an average frame rate of around 17 fps. This translated in around 20,000 raw images captured, including the parts of flight where no industry-like elements are present,

Notice that as per the method described, the pipe to be found can be only labeled automatically when the LiDAR sensor can detect it; thus, the number of images was further reduced due to the range limitations of the LiDAR scanner. Other factors, such as vibrations and disruptions in the input or results of required perceptual

Around ~2100 images were automatically labeled with a mask assigning a ground truth for the pipe in the image. After an initial human inspection of the assigned label, a further ~320 were rejected, obtaining a final set of 1750. The image rejected produced spurious ground truths/masks. Some of them had inconsistent data and the reprojection of the cylinder detected in through RANSAC in LiDAR scans was not properly aligned (error could be produced by spurious interpolation of poses, faulty synchronization data from the sensors, or due to deformation of the UAV frame, as it is impossible for it to be perfectly rigid). Another group presented partial detections (only one of the edges of the pipe is visible in the image), thus making it useless for the apparent contour optimization. A third type of error found was produced by the vision-based pipeline, where a spurious mask was generated,

data, further reduced the number of images with accurate labels.

*TP FP FN*

*total*

*pix* <sup>=</sup> (1)

(2)

fied pixels *TP* and the total number of pixels in an image, *pixtotal*:

**140**

otherwise).

thus of limited use.

**5.1 Dataset generation results**

commonly some shadows/textures displace/retort the edge, or areas not pertaining to the pipe are assigned due similarity of the texture and complexity of delimiting the areas.

A sample of the labeling process can be seen in **Figure 7**, with the original image, the segmented pipe image, and approximations to centroid and bounding bow.

Out of the several options available to test the validity of the dataset produced, the shallow architecture AlexNet was selected, as it could be easily trained and it would provide some insight in the performance that could be realistically expected from a CNN-based approach deployed in the limited hardware of a UAV.

According to previous literature, the dataset was divided into training, validation, and test at the standard ratio of 70, 15, and 15% respectively.

To match the input of AlexNet the images were resized to 256 × 256 resolution. This was mainly done to reduce the computational load, as the input size could be easily fit adjusting some parameters, like the stride. To train and test the network, the Pytorch library was used, which provides full support for its own implementation of AlexNet.

To produce some metrics relevant to the network architecture just trained, a modified version of the technique used to label the dataset was used. Note that this approach, as described in previous sections, uses LiDAR, cameras, and odometry to: acquire an initial robust detection (from LiDAR), track its projection and predict it in the camera image space (using odometric data), and finally determine its edges/contour in the image. The robustness of the LiDAR detection is mainly due to exploiting prior knowledge (in the form of the known radius of the pipe to detect) that cannot be introduced into the AlexNet architecture to produce a meaningful comparison. So, a modified method, referred to as NPMD (no-priors multimodal detector) was employed to estimate the accuracy of earlier work detector without priors. The main difference was modifying the LiDAR pipeline to be able to detect several pipes with different radius (as it should be considered unknown). This led to the appearance of false positives and spurious measurements, which in turn weakened the results produced by the segmentation part of the visual pipeline.

Thus, FCN with AlexNet classification was trained using a pre-trained model for AlexNet, with the standard stochastic gradient descend (SGD) with a momentum of 0.9. A learning rate of 10<sup>−</sup><sup>3</sup> was used, according to known literature, with image batches of 20. The weight decay and bias learning rate were set to standard values of 5.10<sup>−</sup><sup>4</sup> and 2, respectively. Without any prior data, and no benefit to obtain by doing otherwise reported in any previous works, the classifier layer was set to 0, and the dropout layer in the AlexNet left unmodified. This trained model produced the results found in **Table 1**.

### **Figure 7.**

*Left: dataset image. Middle: bounding box and centroid of the region detected. Right: segmentation mask image.*


**Table 1.**

*Experimental results obtained by AlexNet-based FCN.*

It can be seen that eliminating the seed/prior data from the multimodal detector made it rather weak, with very low values for IoU, signaling the presence of spurious detections and probably fake positives. The FCN-based solution was around 1.5 times better segmenting the pipe, being a clear winner. This was to be expected as we deliberately removed one of the key factors contributing to the LiDAR-based RANSAC detection robustness, the radius priors, leading to the appearance of spurious detections.

It is worth noting that although the results are not that strong in terms of metrics achieved for a single-class case, there are no other vision-only pipe detectors with better results in the literature, neither other approaches actually tested in real UAV's platforms, like authors' previous works [6].
