**3. Example of concrete crack inspection using CNN**

#### **3.1 Textbook example of crack inspection using CNN**

This chapter provides an example of how convolution, pooling, fully connected, and benchmarking can be demonstrated in real-world concrete crack inspection using CNN. The above-mentioned calculation was carried out using the Python programming language and the Pytorch package.

#### *3.1.1 Dataset*

In this example, the input images were gathered from Kaggle, the world's most well-known data science community. Kaggle allows access to thousands of public datasets covering a wide range of topics, including medical, agriculture, and construction [9]. By searching "concrete crack" in Kaggle datasets module, 12 datasets were found. The "SDNET2018" dataset was chosen from among them since it comprises sufficient and clean concrete surface images with and without cracks [10]. In "SDNET2018", 56,096 images were captured in the Utah State University Campus using a 16-megapixel Nikon digital camera, including 54 bridge decks, 72 walls, and 104 pavements. In this example, only images of walls and pavements were used to demonstrate the comparison analysis between manual inspection and CNN-based automatic inspection. Therefore, 42,472 images were used as training and testing dataset. Among them, 6459 cracked concrete surfaces are considered as positive class. The captured cracks are as narrow as 0.06 mm and as wide as 25 mm, while 36,013 uncracked concrete surfaces are considered as negative class. Images in this dataset contain a range of impediments, such as shadows, surface roughness, scaling, edges, and holes. The diverse photographing backgrounds contribute to ensuring the robustness of the designed CNN architecture. At a ratio of 80/20, the cracked and uncracked concrete photos were randomly separated into training and testing datasets. The input images' pixels were standardized to 227 227 3 for AlexNet, and 224 224 3 for VGG16. **Table 4** shows the details of the input images. **Figure 6** shows the examples of the input images.

#### *3.1.2 CNN architecture*

In this section, two pre-trained CNN networks, AlexNet and VGG16, were introduced to illustrate CNN computation process. AlexNet was designed as an eight-layer architecture. VGG16 has a depth that is two twice that of AlexNet. According to [11, 12], the depth of CNN network has a significant impact on model performance.


#### **Table 4.** *Details of prepared dataset.*

*Computer Vision-Based Techniques for Quality Inspection of Concrete Building Structures DOI: http://dx.doi.org/10.5772/intechopen.104405*

**Figure 6.** *Examples of cracked and non-cracked surface.*

Therefore, by training and testing the prepared dataset with AlexNet and VGG16, the comparison of network depth to prediction performance and computation cost can be further highlighted.

#### 1.AlexNet architecture

The AlexNet architecture, developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton in 2012, is considered one of the most influential CNN architectures [13]. AlexNet consists of five convolution layers and three fully-connected layers. The max-pooling layers follow the first, second, and fifth convolution layers. AlexNet was designed to predict 1000 object classifications. 1.2 million images with a pixel size of 2,242,243 were used as input images. As a result, 60 million parameters and 650,000 neurons are included in the computation process. The details of the AlexNet architecture are shown in **Figure 7**.

In the first convolution stage, 96 convolution filters with size of 11 11 were applied; they move with a stride with four pixels. The size of pooling filters is 3 3. The pooling filters move with a stride of two. It is worth noticing that the error rate can be reduced by applying overlapping pooling technique (the size of pooling filters is smaller than its stride). In the second convolution stage, the size of convolution filters becomes smaller from 11 11 to 5 5 while its number becomes larger from 96 to 256. The convolution filters in the third and the fourth convolution stage keep minimizing, from 5 5 to 3 3, while its number keeps increasing from 256 to 384. In the last convolution stage, the size of convolution filters remains same as 3 3, and its number turns back to 256. The size and stride of pooling filters also remain the same in the second and fifth convolution stage. Finally, 4096 neurons are included for both first and second fully-connected layers. The final fully-connected layer contains 1000 neurons to output the probabilities of 1000 classifications. The 1000 neurons are activated by softmax function.

The outputs of each convolution and fully-connected layer are activated by a nonlinear function, namely the Rectified Linear Units (ReLU) [14]. It is proved in

**Figure 7.** *Details of AlexNet architecture.*

AlexNet that using ReLU instead of other activation functions effectively solves the overfitting problem and improves computation efficiency. Especially for the larger architectures trained on larger datasets. The local response normalization [15] technique (LRN) is also applied following ReLUs to reduce the error rate. Moreover, to avoid overfitting, drop-out techniques [16] are also applied in the first two fullyconnected layers. The dropout criteria was set at 0.5.

AlexNet was computed using SGD. The batch size, momentum [17], and weight decay [18] were set as 128, 0.9, and 0.0005, respectively. The learning rate was set as 0.00001. AlexNet was computed for roughly 90 periods in NVIDIA GTX 580 3GB GPUs. As a result, the error rate of AlexNet on test set of top-1 and top-5 achieved 37.5% and 17.0%, which was 10% lower than the out-performed CNN architecture at that time.

### 2. VGG16 architecture

VGG16, designed by Karen Simonyan and Andrew Zisserman in 2015, was developed to investigate the influence of convolution network depth on prediction accuracy in larger datasets [19]. Therefore, VGG16 was designed as a deep architecture with 16 weight layers, including 13 convolution layers and three fully-connected layers. Convolution layers in VGG16 are presented as five convolution blocks. The details of the VGG16 architecture are shown in **Figure 8**.

*Computer Vision-Based Techniques for Quality Inspection of Concrete Building Structures DOI: http://dx.doi.org/10.5772/intechopen.104405*

**Figure 8.** *Details of VGG16 architecture.*

As seen from **Figure 8**, there are two convolution layers in the first two convolution blocks, respectively, and three convolution layers in the following three convolution blocks, respectively. The size of all convolution filters is uniformly 3 3. All the convolution filters move with a stride of one. The number of convolution filters increases gradually from 64 to 128, 256, and 512 in the five convolution blocks. To preserve information about image boundaries as completely as possible, spatial padding is applied [20]. As with AlexNet, ReLU is applied as a non-linearity function for convolution and fully-connected outputs to avoid overfitting problems. However, unlike in AlexNet, LRN is not used in VGG16 because the authors stated that LRN has no influence on model performance and increases memory consumption and computation time.

Five max-pooling layers follow the last convolution layer in each block. The maxpooling filters are uniformed with a size of 2 2, and a stride of two. As with AlexNet, the first two fully-connected layers have 4096 neurons and 1000 output neurons. The output neurons are activated by softmax. To avoid overfitting problems, drop-out technique is also applied in the first two fully-connected layers. The dropout ratio is set at 0.5. It can be concluded that the most important novelty of VGG16 compared with AlexNet are: (1) the designed deep architecture; (2) the uniformed and small size convolution filters.

In the training process, the training batch size, momentum, weight decay, and learning rate were set as 256, 0.9, and 0.0005, 0.0001, respectively. As a result, the top-1 and top-5 errors of VGG16 achieved 24.4% and 7.2%, which is 13% and 9.8% lower than AlexNet. The result proved that the deep architecture and small convolution filters have positive influences on CNN performance.

#### *3.1.3 Training and benchmarking*

Finally, the prepared dataset mentioned in Section 3.3.1 was used to train and test AlexNet and VGG16, respectively. The training and testing process was conducted in Kaggle kernels [21]. Kaggle kernel, provided by Kaggle community, is a virtual environment equipped with NVIDIA Tesla K80, a dual GPU design, and 24GB of GDDR5 memory. This high computing performance enables 5–10 times faster training and testing processes than CPU-only devices. Both AlexNet and VGG16 were trained using SGD. Batch size was and learning rate set as 64, 0.0001, respectively. To avoid overfitting problem, dropout was applied at the fully-connected stage, dropout probability was set as 0.5.

Python was used to program the computing process. Pytorch library was imported. The whole computation time of AlexNet was roughly 2 h, and 4 h for VGG16. The model's performance in the training and testing datasets is shown in **Figures 9** and **10**,

#### **Figure 9.**

*Training loss and accuracy of AlexNet and VGG16.*

**Figure 10.** *Testing loss and accuracy of AlexNet and VGG16.*

#### *Computer Vision-Based Techniques for Quality Inspection of Concrete Building Structures DOI: http://dx.doi.org/10.5772/intechopen.104405*

respectively. The training and testing loss and accuracy values are represented on the vertical axis, while the processing epochs are represented on the horizontal axis. Since the loss and accuracy variation remained consistent after the 60th epoch overtraining the model could lead to an overfitting problem [22]. The training epoch was set to 60 epochs.

As shown in **Figure 9**, both AlexNet and VGG16 converged successfully. The training loss for AlexNet reduced steadily from 0.43 to 0.05 in the 58th epoch and then remained constant in subsequent epochs. Similarly, at the 58th epoch, AlexNet's training accuracy increased from 0.85 to 0.98. At the 35th epoch, the training loss for VGG16 dropped from 0.42 to 0.01 and subsequently stayed steady at approximately 0.008–0.01 in following epochs. At the 34th epoch, the training accuracy of VGG16 increased from 0.85 to 0.99 and then remained at 0.99. The results revealed that VGG16 performed better during the training procedure. VGG16's convergence speed is roughly two times that of AlexNet. VGG16's minimum training loss is 0.04 lower than AlexNet's, while its maximum accuracy is 0.01 times higher. It is observed that deeper CNN designs assist in the faster processing of larger datasets, which contributes to producing more trustworthy weights and biased matrices. These results are in accordance with those proposed by [23].

**Figure 10** shows the loss and accuracy variations of AlexNet and VGG16 in the testing dataset. The testing loss and accuracy consist of the fluctuation tendency of training loss and accuracy. It indicated that neither AlexNet nor VGG16 had overfitting or underfitting problems. VGG16 also out-performed AlexNet in the testing process. AlexNet and VGG16 have minimum testing losses of 0.01 and 0.00003, respectively. AlexNet's maximum accuracy was 0.98, and VGG16's was 0.99. In the testing dataset, VGG16 converges at the 34th epoch, which is nearly 2 times faster than AlexNet.

The confusion matrix of AlexNet and VGG16 is shown in **Table 5**. It can be shown that the accuracy scores of AlexNet and VGG16 are nearly identical, indicating that AlexNet and VGG16 have similar prediction abilities for cracked and uncracked concrete surfaces. VGG16 has a precision and recall of 96.5% and 89.6%, respectively, which is nearly 1% and 5% greater than AlexNet. The results show that VGG16 outperforms AlexNet for predicted positive variables (cracked surfaces). Meanwhile, more cracked images from actual datasets can be correctly identified by applying VGG16. AlexNet and VGG16 have F1-scores of 89.6% and 92.9%, respectively, indicating that the VGG16 model is more robust.


#### **Table 5.**

*Confusion matrix of AlexNet and VGG16.*

In conclusion, VGG16 demonstrates better performance. Since it is important to avoid ignoring any cracked surfaces, the model with the highest recall and F1-score is more worthwhile. Meanwhile, AlexNet is also a preferable option when the number of cracked and uncracked images is balanced because it shows a similar accuracy score as VGG16 and has a lower computation cost.

#### **3.2 Comparison of CNN and manual inspection**

During on-site construction quality management process, quality control managers (QCM) or registered inspectors (RI) are responsible for personally inspecting and reporting quality problems with forms, reports, and photocopies. According to the Mandatory Building Inspection Scheme (MBIS) and related contract regulations, QCMs and RIs are obliged to examine cracks and other defects in building components visually or with non-destructive equipment [24]. For example (1) cracks on the structural components, e.g., structural beam, column, (2) cracks on the external finishes, e.g., tiling, rendering, and cladding, (3) cracks on the fins, grilles, windows, curtain walls.

When using computer-vision-based inspection techniques, differently, there is no necessity for QCMs and RIs to conduct the aforementioned inspection tasks on-site. Instead, their primary responsibilities may switch to (1) taking photos or videos of building components, and (2) inputting the images and videos into pre-trained CNN models. To highlight the differences between manual and computer-vision-based crack inspection, an experiment was set up to calculate and compare inspection time and cost.

The layout of the experiment is shown in **Figure 11**. Suppose this experiment case is a 15 m 15 m 2 m residential building that is located in San Bernardino. The inspection items include cracks on slab, internal walls, and external walls. According to Dohm, John Carl [25], the total manual inspection time for 1600–2600ft<sup>2</sup> home in San Bernardino is around 13.65 h, including inspection items of building slab, shear walls, etc. The manual inspection service cost is around \$85.9 per hour.

Referring to the computer-vision-based inspection process described above, the total inspection time includes the time of taking images or videos and CNN processing. Assume that the input videos are obtained with handheld camera devices while QCMs or RIs are by means of walking. Then, the time of taking videos can be considered as the time of walking.

**Figure 11.** *Layout of the experiment case.*

*Computer Vision-Based Techniques for Quality Inspection of Concrete Building Structures DOI: http://dx.doi.org/10.5772/intechopen.104405*

#### **Figure 12.** *Walking path of the inspectors.*


#### **Table 6.**

*Time and cost of manual and computer-vision based crack inspection.*

Normally, the average walking speed between the age of 20–49 is around 1.42 m/s [26]. Considering the time delays of taking videos, the walking speed can be considered as 0.1 m/s. Suppose the walking path follows an S-curve, shown in **Figure 12**.

According to [27], the universally accepted frame rate is 24 FPS per second. Suppose the inspector begins to record video while taking the first step. Then the time of captured video equals the time of walking. The number of the input images that converted from the captured video can be calculated as 2390 s 24FPS = 57,360. According to the testing time of the textbook examples mentioned in Section 3.1, and the study outcomes of [28], the time of CNN processing is around 100 images per second. Then, the time of CNN processing can be calculated as 57,360/100 = 573.6 s. Therefore, the cost of computer-vision based crack inspection can be calculated as (2390 s + 573.6 s) (85.9/3600) = \$70.7.

**Table 6** summarizes the calculation process of time and cost of manual and computer-vision-based crack inspection. It can be seen that using CNN-based technique can effectively reduce inspection time and cost. The inspection time decreases from 13.65 to 0.8 h in total, the inspection cost decreases from \$1172.5 to \$70.7.

### **4. Conclusion**

To facilitate automatic building quality inspection and management, this study introduced a computer-vision-based automated concrete crack inspection technique. In order to demonstrate the computing and benchmarking process, the mathematical understanding of one of the most essential computer vision algorithms, convolution neural network, was first detailed.

The theoretical foundation was then explained using a textbook example. In this case, the input dataset "SDNET2018" was obtained from the Kaggle community. A digital camera was used to acquire the 56,096 photos from the Utah State University campus. To train the input images, the two most basic CNN architectures, AlexNet and VGG16, were chosen. The Pytorch library was used to carry out the training process in the Kaggle kernel. The model's performance was evaluated using a confusion matrix. The results revealed that the prediction accuracy of AlexNet and VGG16 is nearly identical. However, VGG16's precision and recall are higher than AlexNet's, indicating that VGG16 has a stronger capacity to identify cracked surfaces. VGG16's F1 score is also greater than AlexNet's, signifying that VGG16 is more robust. VGG16 is deemed to have a better significance since it has higher precision, recall, and F1 score, which is crucial when distinguishing cracked and uncracked surfaces. When the ratio of cracked and uncracked images is almost the same, however, AlexNet is a feasible alternative because of its high accuracy score and low computation cost. It's worth noting that, when compared to shallow CNN architectures, deeper and broader CNN architectures outperform shallow CNN architectures for larger datasets.

Next, an experimental case was designed to compare manual and computer-visionbased crack inspection in terms of time and cost. The results showed that the efficiency and cost-effectiveness can be effectively improved when adopting computer-vision-based techniques. The inspection time and cost or the designed case can nearly decrease from 13.65 to 0.8 h, and from \$1172.5 to \$70.7, respectively.

The findings help to demonstrate the computer-vision-based quality inspection technique in both theory and practice. Although the recently developed computervision-based technology improves the efficiency, cost-effectiveness, and safety of human quality inspection, it still relies primarily on the collected image quality. Some concrete surface images are difficult to capture in real-life situations, including among others high-rise buildings, component corners, and buildings in extremely harsh environments. To address this issue, robotics techniques are growing rapidly as a means of upgrading computer-vision-based quality inspection [29]. Previous research has begun to use mobile robots, such as UAVs in order to gather surface images [30–32]. Some studies have focused on exploring robotic inspection systems to raise the automatic level of quality inspection [33, 34]. Therefore, merging robotics and computer vision approaches may be considered as a worthwhile future research direction to improve the efficiency and accuracy of manual quality control and management.

### **Acknowledgements**

The authors highly appreciate the full support funding of the full-time PhD research studentship under the auspice of the Department of Building and Real Estate, The Hong Kong Polytechnic University, Hong Kong. The authors would like to express their deepest gratitude to Prof. Heng Li for his guidance. Finally, the authors would like to acknowledge the research team members (Mr. King Chi Lo, Mr. Qi Kai) and anyone who provided help and comments to improve the content of this article.

#### **Conflict of interest**

All authors declare that they have no conflicts of interest.

*Computer Vision-Based Techniques for Quality Inspection of Concrete Building Structures DOI: http://dx.doi.org/10.5772/intechopen.104405*
