**3. Semantic segmentation problem: classic approaches and deep learning (DL)**

In the context of computer vision, the semantic segmentation problem is used to determine which regions of an image present an object of a given category, that is, a class or label is assigned to a given area (be it a pixel, window, or segmented region). The different granularity accepted is produced by how the technique and its solution evolved: for a long time, it was completely unfeasible to produce pixelwise solutions, so images were split according to different procedures, which added a complexity layer to the problem.

Current off-the-shelf technologies have changed the paradigm, as GPUs present huge capabilities in terms of parallelization, while solid-state disks make fast reliable storage cheap. These technical advancements have increased dramatically the performance, complexity, and memory available for data representation, especially for techniques inherently strong in highly parallelized environments. One of the fields where the impact has been more noticeable has been the artificial intelligence community, where the artificial neural network (ANN) has seen a resurgence thanks to the support this kind of hardware provides to otherwise computationally unfeasible techniques. The most impactful development in recent years has been the convolutional neural networks (CNNs), which have become the most popular computer vision approach for several of the classic problem and the default solution for semantic segmentation.

To understand the impact of deep learning into our proposed solution, we will discuss briefly how the classical segmentation pipeline worked and how the modern CNN-based classifier became the modern semantic segmentation techniques.

### **3.1 Classic semantic segmentation pipeline**

The classic semantic segmentation pipeline can be split into two generic blocks, namely image processing for feature extraction and feature level classification. The first block generally includes any image preprocessing done and resizing/ resampling, splitting the image into the regions/windows, defining the granularity level of the classification, and finally, extracting the features itself. The features can be of any type and frequently the ones feed to the classification modules will be a composition of several individual features from different detectors. The use of different window/region-based approaches helps build up higher level features, and the classification can be refined at later stages with data from adjacent regions.

Notice that this kind of architecture generally relies on classifiers which required very accurate knowledge or a dataset with the classes to learn specified for each input so it can be trained. **Figure 3** shows the detection of pipelines in classic semantic segmentation. Notice that to train the classifier, the image mask or classification result becomes also an input for the training process.

So, it can be seen that solving the semantic segmentation problem though classic pattern recognition methods requires acute insight into the specifics of the problem domain, as the features to be detected and extracted are built/designed specifically. This implies (as mentioned earlier) working from low-level features and explicitly deriving the higher level features from them is a very complex problem itself, as they are affected by the input characteristics, what is to be found/discriminated, and which techniques will be used in the classification part of the pipeline.

### **3.2 The segmentation problem with deep learning**

Modern semantic segmentation techniques have organically evolved with the rise of the deep learning field to its current prominence. This evolution can be seen as a refinement in the scale of the inference produced from very coarse (image level probabilistic detection) to very fine (pixel level classification). The earliest ANN examples made probabilistic predictions about the presence of an object of a given class, that is, detection of objects with a probability assigned. The next step, achieved thanks to increased parallelization and network depth, was starting to tackle the localization problem, providing centroids and/or boxes for the detected

**137**

*Deep Learning-Based Detection of Pipes in Industrial Environments*

classes (the use of *classes* instead of *objects* here is deliberate, as the instance segmentation problem, separating adjacent objects of the same class, would be dealt with

*Block diagram of a classical architecture approach for semantic segmentation using computer vision.*

The first big break into the classification problem was done by AlexNet [13] in 2012, when it won the ILSVRC challenge, with a score of 84.6% in the top-5 accuracy test, while the next best score was only 73.8% (based on classic techniques). AlexNet has since then become a known standard and a default network architecture to test problems, as it is actually not very deep or complex (see **Figure 4**). It presents five convolutional layers, with max-pooling after the first two, three fully connected layers, and a ReLU to deal with non-linearities. This clear victory of the CNN-based approaches was validated next year by Oxford's VGG16 [16], one of the several architectures presented, winning the ILSVRC challenge with a 92.7% score. While several other networks have been presented with deeper architecture, relevant development focused on introducing new types of structures into the networks. GoogLeNet [17], the 2014 ILSVRC winner, achieved victory thanks to the novel contribution of the inception module, which validated the concept that the CNN layers of a network could operate in other orders different from the classic sequential approach. Another relevant contribution produced by technology giants was ResNet [18], which scored a win for Microsoft in 2016. The introduction of residual blocks allowed them to increase the depth to 152 layers while keeping initial data meaningful for training the deeper layers. These residual blocks architecture essentially forwards a copy of the received inputs of a layer; thus, later layers received the results and same inputs of prior layers and can learn from the residuals. More recently, ReNet [19] architecture was used to extend recurrent neural

The jump from the classification problem with some spatial data to pixel level labeling (refining inference from image/region to pixel level) was presented by Long [20], with the fully convolutional network (FCN). The method they proposed was based on using the full classifier (like the ones just discussed) as layers in a convolutional network architecture. FCN architecture, and its derivatives like U-Net [21] are the best solutions to semantic segmentation for most domains. These derivatives may include classic methods, such as DeepLab's [22] conditional random fields [23], which reinforces the inference from spatially distant dependencies, usually lost due to CNN spatial invariance. The latest promising contributions to the semantic segmentation problem are based on the encoder-decoder architecture,

For the works discussed in this chapter, a FCN16 model with AlexNet as a semantic segmentation model was used. The main innovation introduced by the general FCN was exploiting the classification power via convolution of the common semantic segmentation DL network, but at the same time, reversing the downsampling effect of the convolution operation itself. Taking AlexNet as an example, as seen in **Figure 4**, convolutional layers apply a filter like operation while reducing the size of the data forwarded to the next layer. This process allows producing more accurate "deep features" but at the same time also removes high-level information describing the spatial relation between the features found. Thus, in order to exploit the features from the deep layers while the keeping information from spatial relation, data from multiple layers has to be fused (with element-wise summation).

*DOI: http://dx.doi.org/10.5772/intechopen.93164*

networks (RNNs) to multidimensional inputs.

known as autoenconders, like for example SegNet [24].

much later).

**Figure 3.**

*Deep Learning-Based Detection of Pipes in Industrial Environments DOI: http://dx.doi.org/10.5772/intechopen.93164*

### **Figure 3.**

*Industrial Robotics - New Paradigms*

a complexity layer to the problem.

for semantic segmentation.

**3.1 Classic semantic segmentation pipeline**

its solution evolved: for a long time, it was completely unfeasible to produce pixelwise solutions, so images were split according to different procedures, which added

Current off-the-shelf technologies have changed the paradigm, as GPUs present huge capabilities in terms of parallelization, while solid-state disks make fast reliable storage cheap. These technical advancements have increased dramatically the performance, complexity, and memory available for data representation, especially for techniques inherently strong in highly parallelized environments. One of the fields where the impact has been more noticeable has been the artificial intelligence community, where the artificial neural network (ANN) has seen a resurgence thanks to the support this kind of hardware provides to otherwise computationally unfeasible techniques. The most impactful development in recent years has been the convolutional neural networks (CNNs), which have become the most popular computer vision approach for several of the classic problem and the default solution

To understand the impact of deep learning into our proposed solution, we will discuss briefly how the classical segmentation pipeline worked and how the modern CNN-based classifier became the modern semantic segmentation techniques.

The classic semantic segmentation pipeline can be split into two generic blocks, namely image processing for feature extraction and feature level classification. The first block generally includes any image preprocessing done and resizing/ resampling, splitting the image into the regions/windows, defining the granularity level of the classification, and finally, extracting the features itself. The features can be of any type and frequently the ones feed to the classification modules will be a composition of several individual features from different detectors. The use of different window/region-based approaches helps build up higher level features, and the classification can be refined at later stages with data from adjacent regions.

Notice that this kind of architecture generally relies on classifiers which required

So, it can be seen that solving the semantic segmentation problem though classic pattern recognition methods requires acute insight into the specifics of the problem domain, as the features to be detected and extracted are built/designed specifically. This implies (as mentioned earlier) working from low-level features and explicitly deriving the higher level features from them is a very complex problem itself, as they are affected by the input characteristics, what is to be found/discriminated, and which techniques will be used in the classification part of the pipeline.

Modern semantic segmentation techniques have organically evolved with the rise of the deep learning field to its current prominence. This evolution can be seen as a refinement in the scale of the inference produced from very coarse (image level probabilistic detection) to very fine (pixel level classification). The earliest ANN examples made probabilistic predictions about the presence of an object of a given class, that is, detection of objects with a probability assigned. The next step, achieved thanks to increased parallelization and network depth, was starting to tackle the localization problem, providing centroids and/or boxes for the detected

very accurate knowledge or a dataset with the classes to learn specified for each input so it can be trained. **Figure 3** shows the detection of pipelines in classic semantic segmentation. Notice that to train the classifier, the image mask or clas-

sification result becomes also an input for the training process.

**3.2 The segmentation problem with deep learning**

**136**

*Block diagram of a classical architecture approach for semantic segmentation using computer vision.*

classes (the use of *classes* instead of *objects* here is deliberate, as the instance segmentation problem, separating adjacent objects of the same class, would be dealt with much later).

The first big break into the classification problem was done by AlexNet [13] in 2012, when it won the ILSVRC challenge, with a score of 84.6% in the top-5 accuracy test, while the next best score was only 73.8% (based on classic techniques). AlexNet has since then become a known standard and a default network architecture to test problems, as it is actually not very deep or complex (see **Figure 4**). It presents five convolutional layers, with max-pooling after the first two, three fully connected layers, and a ReLU to deal with non-linearities. This clear victory of the CNN-based approaches was validated next year by Oxford's VGG16 [16], one of the several architectures presented, winning the ILSVRC challenge with a 92.7% score.

While several other networks have been presented with deeper architecture, relevant development focused on introducing new types of structures into the networks. GoogLeNet [17], the 2014 ILSVRC winner, achieved victory thanks to the novel contribution of the inception module, which validated the concept that the CNN layers of a network could operate in other orders different from the classic sequential approach. Another relevant contribution produced by technology giants was ResNet [18], which scored a win for Microsoft in 2016. The introduction of residual blocks allowed them to increase the depth to 152 layers while keeping initial data meaningful for training the deeper layers. These residual blocks architecture essentially forwards a copy of the received inputs of a layer; thus, later layers received the results and same inputs of prior layers and can learn from the residuals.

More recently, ReNet [19] architecture was used to extend recurrent neural networks (RNNs) to multidimensional inputs.

The jump from the classification problem with some spatial data to pixel level labeling (refining inference from image/region to pixel level) was presented by Long [20], with the fully convolutional network (FCN). The method they proposed was based on using the full classifier (like the ones just discussed) as layers in a convolutional network architecture. FCN architecture, and its derivatives like U-Net [21] are the best solutions to semantic segmentation for most domains. These derivatives may include classic methods, such as DeepLab's [22] conditional random fields [23], which reinforces the inference from spatially distant dependencies, usually lost due to CNN spatial invariance. The latest promising contributions to the semantic segmentation problem are based on the encoder-decoder architecture, known as autoenconders, like for example SegNet [24].

For the works discussed in this chapter, a FCN16 model with AlexNet as a semantic segmentation model was used. The main innovation introduced by the general FCN was exploiting the classification power via convolution of the common semantic segmentation DL network, but at the same time, reversing the downsampling effect of the convolution operation itself. Taking AlexNet as an example, as seen in **Figure 4**, convolutional layers apply a filter like operation while reducing the size of the data forwarded to the next layer. This process allows producing more accurate "deep features" but at the same time also removes high-level information describing the spatial relation between the features found. Thus, in order to exploit the features from the deep layers while the keeping information from spatial relation, data from multiple layers has to be fused (with element-wise summation).

**Figure 4.** *Diagram of the AlexNet architecture, showcasing its pioneering use of convolutional layers.*

### **Figure 5.**

*Detail of the skip architectures (FCN32, FCN16, and FCN8) used to produce results with data from several layers to recover both deep features and spatial information from shallow layers (courtesy of [25]).*

In order to be able to produce this fusion, data from the deeper layers are upsampled using deconvolution. Notice that data from shallow layers will be coarser but contain more spatial information. Thus, up to three different levels can be processed through FCN, depending on the quantity of layers deconvoluted and fused, as seen in **Figure 5**.

More information on the detailed working of the different FCN models can be found in [25]. It is still worth noting that the more shallow layers are fused, the more accurate the model becomes, but according to the literature, the gain from FCN16 to FCN8 is minimal (below 2%).
