**1. Introduction**

Throughout the entire construction life cycle, quality assessment plays an important role in ensuring the safety, economy, and long-term viability of construction activities. Construction products that have been completely inspected and certificated by quality inspectors are more inclined to be chosen by developers and buyers. Typically, the structural work is considered as an essential aspect for quality assessment because structural problems directly influence the construction stability and integrity. Among the construction structural forms, concrete structures are adopted as the most common and basic construction structure. Therefore, exploring advanced

technologies that enable effective concrete defect inspection can be deemed a worthwhile endeavor.

Normally, the types of concrete defects include blistering, delamination, dusting, etc. Among them, concrete cracks, usually caused by deformation, shrinkage, swelling, or hydraulic, appear most frequently in concrete components. Concrete cracking is considered the first sign of deterioration. As reported by the BRE Group [1], cracks up to 5 mm in width simply need to be re-decorated because they only affect the appearance of the concrete. However, cracks with a width of 5–25 mm have the possibility to trigger structural damage to concrete structures [2]. A 40-year-old oceanfront condo building collapsed on June 27, 2021, in Florida because of the neglect of cracks. Experienced engineers noticed the cracked or crumbling concrete, the interior cracks, and the cracks at the corners of windows and doors are the significant and earliest signs of this tragedy. Therefore, in order to prevent potential failures that may pose a loss to society, crack problems should be thoroughly examined and resolved.

In general, construction works are divided into two categories: new building works and existing building works. The new works refer to a building that will be constructed from scratch. The existing building works mean that a building has existed for many years and residents are living inside. In Hong Kong, quality assurance and control should be conducted by full-time quality managers on-site for both new and existing buildings. Normally, the quality managers visually inspect implied build quality and by appointing a score to the building's quality in accordance to the Building Performance Assessment Scoring System (PASS) for new buildings, the Mandatory Building Inspection Scheme (MBIS), and the Mandatory Window Inspection Scheme (MWIS) for existing buildings. Meanwhile, to ensure a continuous and in-depth inspection, Non-destructive (NDT) methods e.g., eddy current testing, ultrasonic testing are also commonly applied in the quality inspection process.

Quality managers are commonly obliged to work 8 hours per day. Their salary ranges from HKD 30,000 to HKD 50,000 per month. In PASS, more than 300 quality assessment items are related to cracking-related problems. Cracks in all building components, including floors, internal and external walls, ceilings, and others are required to be strictly inspected during both structural and architecture engineering stages. Therefore, both manual and NDT inspections are considered time-consuming, costly, and dangerous, especially for large-scale and high-rise structures. To tackle this issue, computer-vision technique is increasingly introduced for automated crack inspection. For example, various convolutional neural network (CNN) architectures have been developed and implemented to increase the efficiency of manual crack inspection [3, 4].

Considering the aforementioned context, computer-vision-based automated crack inspection techniques were introduced by the authors in 2022. To achieve this, the theoretical background of CNN networks is firstly explained in the context of convolution, pooling, fully-connected, and benchmarking processes. AlexNet and VGG16 models were then implemented and tested to detail and illustrate the calculation steps. Meanwhile, a practical case study is used to compare the difference between manual and computer-vision-based crack inspection. The future directions of combining robotics and computer-vision for automated crack inspection are discussed. This study gives a comprehensive overview and solid foundation for a computer-vision-based automated crack inspection technique that contributes to high efficiency, cost-effectiveness, and low-risk quality assessment of buildings.

*Computer Vision-Based Techniques for Quality Inspection of Concrete Building Structures DOI: http://dx.doi.org/10.5772/intechopen.104405*

#### **2. Computer vision-based automated concrete crack inspection**

The term *computer vision* is defined as an interdisciplinary field that enables computers to recognize and interpret environments from digital images or videos [5]. Computer vision techniques are rapidly being used to detect, locate, and quantify concrete defects to reduce the limitations of manual visual inspection. By automatically processing images and videos, computer vision-based defect detection technologies enable efficient, accurate, and low-cost concrete quality inspection. Various techniques in the computer vision field, such as semantic segmentation and object detection, have been developed and applied to date [6]. Among them, image classification is considered the most basic computer vision technique and has been introduced most frequently to predict and target concrete defects.

The motivation of image classification is to identify the categories of input images. Different from human recognition, an image is first presented as a three-dimensional array of numbers to a computer. The value of each number ranges from 0 (black) to 255 (white). An example is shown in **Figure 1**. The crack image is 256 pixels wide, 256 pixels tall, and has three color channels RGB (Red, Green, and Blue). Therefore, this image generates 256 � 256 � 3 = 196,608 input numbers.

The input array is then computed using computer vision algorithms to transform the numbers to a specific label that belongs to an assigned set of categories. One of the computer vision algorithms is CNN, which has become dominant in image classification tasks [7]. CNN is a form of a deep learning model for computing grid-shaped data. The central idea of CNN is to identify the image classification by capturing its features using filters. The features are then output to a specific classification by a trained weight and biases matrix.

There are three main modules included in a CNN model: convolution, pooling, and fully connected layer. The convolution and pooling layers are used to extract image features. The fully connected layer is used to determine the weight and biases matrix and to map the extracted features into specific labels.

Convolution layer is the first processing block in CNN. During the convolution process, a set of convolution filters is used to compute the input array <sup>Α</sup> <sup>¼</sup> *aij <sup>m</sup>*�*<sup>n</sup>*,

$$\text{If } m, n \in \left( \text{width}\_{\text{image}}, \text{height}\_{\text{image}} \right). \text{ After computing, a new image A\* } = \left( a^\* \right)\_{n \times n}, \text{ is}$$


**Figure 1.** *An example of the input number array.*

output and passed to the next processing layers. The size of the output image can be calculated with Eq. (1). The values of output image pixels can be calculated with Eq. (2). The output images are known as convolution feature map.

$$n = ((m - f + 2p)/s) + 1\tag{1}$$

Here: *n* refers to the size of output image, *m* refers to the size of input image, *f* refers to the size of convolution filter, *p* refers to the number of pooling layer, *s* refers to the stride of convolution filter.

$$\mathbf{A}\_o^\* = f\left(\sum\_k \mathbf{W}\_o \times \mathbf{A}\_o + b\_o\right) \tag{2}$$

Here: Α<sup>∗</sup> *<sup>o</sup>* refers to the pixels of output image, *f* refers to an applied non-linear function, *Wo* refers to the values of convolution filter matrix, *k* refers to the number of convolution filters.Α*<sup>o</sup>* refers to the pixels of input image, and *bo* is an arbitrary real number.

An example of a convolution process is shown in **Figure 2**. In this example, both the width and height of the input image is 5. The pixels of the image are shown in **Figure 2**. The convolution filter is in a shape of 3 � 3. In this example, only one filter is used. The initial value of the convolution filter is set randomly. The filter matrix is adjusted and optimized in the following backpropagation process. In this example, the non-linear function, padding layer is not used, and the biases value *bo* is set as 0. The stride of convolution filter is set as 1. The convolution filter moves from left to right, and from top to bottom. The size and value of the output feature map can be computed using Eqs. (1) and (2). The detailed calculation process of the example feature maps value and size is shown in **Table 1**. Seen from **Figure 2**, the value of size of input image, size of filter is 5, 3, respectively. Suppose the number of the pooling layer, the convolution stride is 0, 1, respectively.

A pooling layer is used to refine the feature maps. After pooling, the dimensions of the feature maps can be simplified. In doing so, the computation cost can be effectively decreased by reducing the number of learning parameters, whilst allowing only the essential information of feature maps to be presented. Usually, pooling layers follow behind convolution layers. Average pooling and maximum pooling are the main pooling operations. Similar to convolution layers, pooling filters are used to refine feature maps. For maximum pooling, the maximum value from the regions in feature map that is covered by pooling filters is extracted. For average pooling, the average value of the regions in feature maps covered by pooling filters is computed. The pooling filters slide in the feature map from top to bottom, and from left to right. The output of the pooling process is new feature maps that contain the most

**Figure 2.** *An example of convolution process.*


**Figure 3.** *An example of max pooling and average pooling.*

prominent features or average features. An example of maximum pooling and average pooling is shown in **Figure 3**.

After extracting image features, the fully connected layers are applied to map these features with classification labels. The relationship between input feature maps and output classifications is calculated using an artificial neural network (ANN). The ANN is structured into input layers, hidden layers, and output layers. A group of neurons is included in the three layers. The neurons connect to one another in a processed weight matrix. The weights present the importance of input feature maps to classification labels. Therefore, the relationships between inputs and outputs can be obtained by calculating a weight matrix that connects image feature neurons and classification neurons.

To achieve this, the cube-shaped feature maps are first flattened into onedimension vectors. The values of transformed vectors represent the values of input neurons. Then Eq. (3) is applied to calculate the value of new neurons that connect with input neurons. The initial weights and biases values are chosen at random.

$$\mathcal{Y}\_j(\mathbf{x}) = f\left(\sum\_{i=1}^n w\_j \mathbf{x}\_i + b\right) \tag{3}$$

Here: *y <sup>j</sup>* refers to the weights of output neurons, *w <sup>j</sup>* refers to the weights that connect different neurons, *xi* refers to the values of input neurons, *b* refers to the biases.

A Back-Propagation algorithm (BP) is commonly used to train and modify weights and biases. BP updates weights and biases by computing the gradient of loss function. In doing this, the optimal weights and biases matrix that enable the minimum loss between model outputs and actual value are identified. For now, various loss functions are developed and applied. For example, the mean square error (MSE), shown in Eq. (4), is one of the most frequently used loss functions to calculate loss value. Stochastic gradient descent (SGD) is then processed to determine updated weights and biases using the gradient of loss function, shown as Eq. (5).

$$Loss = \frac{1}{n} \sum\_{i=1}^{n} \left( y\_i - \hat{y\_i} \right)^2 \tag{4}$$

Here: *Loss* refers to the loss value of output neuron and actual value, *n* refers to the number of neurons that connect to one specific output neuron, *y* refers to the actual value, ^*y* refers to the value of one output neuron.

*Computer Vision-Based Techniques for Quality Inspection of Concrete Building Structures DOI: http://dx.doi.org/10.5772/intechopen.104405*

$$w' = w - \eta \frac{\partial L}{\partial w} \tag{5}$$

$$b' = b - \eta \frac{\partial L}{\partial b}$$

Here: *w*<sup>0</sup> , *b*<sup>0</sup> refers to updated weights and biases, *η*, *η* refers to former weights and biases, *η* refers to the learning rate, *<sup>∂</sup><sup>L</sup> <sup>∂</sup>w*, *<sup>∂</sup><sup>L</sup> <sup>∂</sup><sup>b</sup>* refers to the partial score of the loss function for weights and biases, respectively.

An example of feature map updating using BP is explained. **Figure 4** depicts an example of a fully connected process. The initial weights and biases in this process are determined randomly. Suppose the value of w11, w12, w21, w22, w5, w6 is 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, respectively. The value of x1, x2, actual output value is 5, 1, 0.24. The detailed calculation of the updated weights, biases, feature map is shown in **Table 2**.

In conclusion, during the convolution and pooling processes in CNN, the features of the input image are extracted first. The pooled feature maps are then flattened and considered as input neurons in fully connected process. After several training periods, the appropriate weights and biases can be determined using BP. The classifications of input images can be predicted automatically and reliably using the optimal weights and biases.

A confusion matrix is a table structure that permits the viewing of CNN performance [8]. Each row of the matrix records the number of images from actual classes, while each column records the number of images from predicted classes. There are four type indicators in the matrix: (1) True positive (TP) represents the images that are predicted correctly as the actual class; (2) False positive (FP) represents the images that are wrongly predicted; (3) True negative (TN) represents the images that are correctly predicted as another actual class; (4) False negative (FN) represents the images that are wrongly predicted as another actual class. TP, FP, TN, FN can be expressed in a 2 � 2 confusion matrix, shown in **Figure 5**.

Based on TP, FP, FN, and TN, four typical CNN performance evaluation indexes: accuracy, precision, recall, and F1-score can be calculated using Eqs. (6)–(9). For the crack inspection problem, accuracy shows how many images can be predicted correctly. The percentage of actual cracked photos to all predicted cracked images is shown by precision. CNNs with a high precision score indicate a better inspection ability of cracked images. Recall shows the ratio of predicted cracked images to all actual cracked images. CNNs with a high recall score indicate a better distinguishing capacity between cracked and uncracked images. F1-score shows the comprehensive

**Figure 4.** *An example of a fully connected process.*


*Computer Vision-Based Techniques for Quality Inspection of Concrete Building Structures DOI: http://dx.doi.org/10.5772/intechopen.104405*


#### **Figure 5.**

*An example of a fully connected process.*

performance of precision and recall. A CNN with a high F1-score indicates stronger robustness.

$$Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \times 100\tag{6}$$

$$Precision = \frac{TP}{TP + FP} \times 100\tag{7}$$

$$Recall = \frac{TP}{TP + FN} \times 100\tag{8}$$

$$F1 - score = 2 \times \frac{Precision \times Recall}{Precision + Recall} \times 100\tag{9}$$

For example, the prepared dataset contains 10,000 photos, with 32,000 and 7000 cracked surface images and uncracked surface images, respectively. After CNN processing, 2700 images are correctly predicted as cracked surfaces, 300 images out of the 3000 real cracked surfaces are wrongly predicted as uncracked surfaces. 6500 images are correctly predicted as uncracked surfaces, and 500 images out of the 7000 uncracked surfaces are wrongly predicted as cracked surfaces. Then, based on abovementioned concepts, the values of TP, FN, FP, TN is 2700, 300, 500, 6500, respectively. **Table 3** shows the details of the accuracy, precision, recall, and F1 score calculations.


**Table 3.**

*Detailed calculation process of accuracy, precision, recall, and F1 score.*
