**3. Case study**

#### **3.1 Case study on COVID-19 classification**

In the context of this chapter, chest X-ray images were selected for both model development and validation. It was driven by the cost-effectiveness and greater accessibility of chest X-rays compared to CT scans, particularly in communities with limited medical resources. Additionally, chest X-rays offer a swift imaging solution, making them particularly attractive for large-scale patient screening during the COVID-19 pandemic. Therefore, utilizing chest radiography for screening COVID-19 patients is considered a practical, efficient, and rapid approach [54, 55]. For validation purposes, this chapter utilized a comprehensive chest X-ray dataset "*COVIDx*" [23]. This dataset comprises a vast collection of 18,543 chest radiography images from 13,725 unique cases. It is important to note that when evaluating the distribution of classes between the training and testing datasets, a significant dissimilarity in class distribution becomes apparent.

This chapter employs two types of deep learning models: Convolutional Neural Networks (CNNs) and ResNet. CNNs have played a significant role in advancing various visual processing tasks like image classification [56], object detection and tracking [57, 58], and semantic segmentation [59]. The progress of CNNs has been facilitated by large datasets like ImageNet [56] and YouTube-BoundingBoxes [60], which provide ample training data for building large-scale models. The general architecture of CNN for image classification is depicted in **Figure 6**. SOTA CNN architectures such as AlexNet [56], VGG [50], and GoogleNet [61] have propelled advancements in image classification. These architectures leverage millions of annotated samples from large datasets to successfully estimate appropriate parameters. Furthermore, CNNs have been enhanced by combining them with other deep learning models. For example, Wang et al. [62] combined CNN with Recurrent Neural Networks (RNNs) for multi-label image classification. Additionally, CNNs combined with autoencoders [63, 64] have demonstrated effectiveness in tasks like face detection. In this chapter, three small CNNs were trained and tested on a subset of the COVIDx dataset to demonstrate a proof-of-concept experiment.

ResNet [51] is an artificial neural network architecture inspired by the structure of pyramidal cells in the cerebral cortex. It introduces skip connections, or shortcuts, which allow the network to bypass certain layers. The concept behind ResNet is that training a network to learn a residual mapping is simpler than training it to directly learn the underlying mapping. This is achieved using residual blocks, as depicted in **Figure 7**. A crucial modification in ResNet compared to a standard CNN is the "skip connection"

**Figure 6.** *Convolutional neural network architecture for image classification.* *Effective Screening and Face Mask Detection for COVID Spread Mitigation Using Deep… DOI: http://dx.doi.org/10.5772/intechopen.113176*

**Figure 7.** *General architecture of the residual block of ResNet [51].*

for identity mapping. This identity mapping has no parameters; it simply adds the output from the previous layer to the next layer. However, the dimensions of *x* and *F x*ð Þ might differ. Since convolutions usually reduce spatial resolution, e.g., a 3 � 3 convolution on a 32 � 32 image results in a 30 � 30 image, the identity mapping is expanded using a linear projection *W* to match the channels of the residual. This allows the input *x* and *F x*ð Þ to be combined as input to the subsequent layer. Given the effectiveness of ResNet, various ResNet architectures will be employed in this chapter to screen COVID-19 using the complete COVIDx dataset.

This chapter examines the proposed models from two distinct viewpoints. The first perspective involves assessing their capability to effectively identify COVID-19 cases from a limited dataset, using compact CNNs. The second perspective entails investigating whether these models can utilize the ResNet architecture to identify COVID-19 cases within an extensive dataset, all without relying on transfer learning techniques. For the small dataset scenario, a subset of 350 images was extracted from the original COVIDx dataset [23] for training and testing. Among these, 300 images were allocated for training, while the remaining 50 were reserved for testing. Three small CNNs were employed for this evaluation. The obtained training and testing accuracies for these three models are depicted in **Figure 8**. The results indicate that the initial shallow CNN (referred to as CNN1) encountered issues with under-fitting. However, as additional layers were incorporated to extract more intricate features, CNN3 exhibited superior accuracy performance.

In the context of a large dataset, all images encompassed within the COVIDx dataset were harnessed to establish a classifier for COVID-19 identification. The initial step encompassed data preprocessing, involving the compression of images. The

**Figure 8.** *Performance on training and testing accuracy for three small CNNs.*

original X-ray images within the dataset measured 1024 1024 3 in dimensions. To expedite the training process, the image size was compressed to 64 64 3. Subsequently, training was initiated using this preprocessed dataset. The outcomes of the training and testing phases are depicted in **Figure 9**.

The findings unveil the existence of an optimal configuration, specifically ResNet-34, which outperforms alternative models. Beyond ResNet-34, a consistent better performance becomes apparent as the number of ResNet layers increases to 152. This can be attributed to the phenomenon where, with the progressive augmentation of layers, the models tend to overfit while the available data remains inadequate to effectively train the model. It is noteworthy that although the training time slightly extended with the inclusion of additional layers in the ResNet models, the performance gains were limited. Furthermore, the results underscore the accomplishment of this chapter in achieving commendable performance by training ResNet models from scratch, devoid of reliance on transfer learning techniques.

While supervised deep learning demonstrates impressive performance in classifying COVID-19 images, its practical application is hindered by the need for a substantial volume of annotated medical images for training. Given the limitations in available COVID-19-related data resources and the significant costs associated with labeling medical images, this approach becomes less feasible, further exacerbated by labeling inaccuracies that may arise [65]. To address this challenge, the focus has shifted toward semi-supervised deep learning, which has garnered considerable attention due to its capacity to enhance model generalization by leveraging both labeled and unlabeled data [66–69]. This paradigm involves training deep neural networks through the simultaneous optimization of a standard supervised classification loss on labeled samples and an unsupervised loss on unlabeled data [67, 69]. Semi-supervised learning models aim to amplify the information derived from unlabeled data [70] or impose regularization on the network to enforce smoother and more consistent classification boundaries [68].

In the realm of COVID-19 research, particularly in tasks like COVID-19 image classification and image segmentation, semi-supervised learning has emerged as a solution to mitigate the scarcity of labeled data [71–76]. However, within the domain of COVID-19 image classification, the studies conducted by Zhou et al. [76], Calderon et al. [72], and Paticchio et al. [74] have not thoroughly examined model performance

**Figure 9.** *Performance comparison on training and testing.*

*Effective Screening and Face Mask Detection for COVID Spread Mitigation Using Deep… DOI: http://dx.doi.org/10.5772/intechopen.113176*

using large-scale X-ray image datasets such as *COVIDx* [23]. Furthermore, they have not conducted comprehensive comparisons against state-of-the-art methods, particularly in scenarios where labeled data is severely limited, constituting less than 10% of the dataset. In response, this chapter introduces a semi-supervised deep learning model for COVID-19 image classification, systematically evaluating its performance on the *COVIDx* dataset [23].

Using the ResNet architecture, this chapter devised a two-path semi-supervised learning ResNet (referred to as SSResNet), which comprises three key components: a shared ResNet, a supervised ResNet, and an unsupervised ResNet. These two paths are formed by coupling the shared ResNet with either a supervised ResNet or an unsupervised ResNet. Both labeled and unlabeled data are leveraged in the computation of the unsupervised loss, utilizing the mean squared error loss (MSEL). Conversely, only labeled data contributes to the calculation of the supervised loss, employing the cross-entropy loss (CEL). To counterbalance data imbalance, a weighted cross-entropy loss (WCEL) was ingeniously crafted, assigning greater weight to the COVID-19 class. The primary objective of minimizing MSEL lies in enhancing image representation, while the reduction of WCEL aims to boost classification performance. For a comprehensive outline of the methodology, refer to Algorithm 1. The efficacy of the proposed model was thoroughly assessed using the extensive X-ray image dataset *COVIDx*. The experimental findings distinctly establish that the proposed model excels in the realm of COVID-19 image classification. Notably, even when trained with a notably limited quantity of labeled X-ray images, the model showcases remarkable performance.

#### **Algorithm 1**. Learning of Semi-supervised ResNet (SSResNet).

*Require:* training sample *xi*, the set of training samples *S*, label *yi* for *xi* (*i*∈ *S*)

1.**for** *t* in [1, num epochs] **do**


$$\text{4.} \qquad z\_{i \in B}^{sup} \gets f\_{\theta\_{\text{up}}}(z\_{i \in B}) \rhd \text{supervised representation}$$

$$\text{5.} \qquad z\_{i \in B}^{unup} \gets f\_{\theta\_{unup}}(z\_{i \in B}) \rhd \text{unsupervised representation}$$

$$\text{6.} \qquad l\_{i \in B}^{\text{WCEL}} \leftarrow -\frac{1}{|\mathcal{B}|} \sum\_{i \in B} \log \phi \left( z\_i^{\text{sup}} \right) \left[ \boldsymbol{\mathcal{y}}\_i \boldsymbol{\omega}\_i \right] \rhd \text{supervised loss component}$$

$$\text{7.} \qquad l\_{i \in B}^{\text{MSEL}} \leftarrow \frac{1}{\text{C} |B|} \sum\_{i \in B} ||z\_i^{\text{sup}} - z\_i^{\text{unsup}}||^2 \succ \text{unsup} \text{reversible loss component}$$

$$\text{8.} \qquad \text{Loss} \gets l\_{i \in B}^{\text{WCE}} + \lambda \times l\_{i \in B}^{\text{MSEL}} \rhd \text{total loss}$$

9. update *θshared*, *θsup*, *θunsup* using optimizer, e.g., ADAM

**return** *θshared*, *θsup*, *θunsup*

Upon analyzing the class distribution disparities between our training and testing datasets, a noteworthy distinction came to light. Consequently, we undertook the task of reconstructing the data structure. This involved partitioning the dataset into distinct training and testing subsets, ensuring that their class distributions closely aligned. Specifically, 70% of the data was allocated for the training dataset, while the remaining 30% constituted the testing dataset. For a comprehensive breakdown of the reconstructed dataset, including sample distribution details, please refer to **Table 2**.

It is evident that the distribution of samples in our dataset is highly skewed, particularly in relation to the COVID-19 class. This imbalance presents a significant hurdle in achieving a high-performing classifier. To surmount this issue, our proposed model employs a weighted cross-entropy loss. This innovative approach involves according greater weight to the minority class (COVID-19) throughout the training process. For a comprehensive understanding of this technique, please refer to section two.

In the experiment, the key hyper parameters for training the proposed model are: Minibatch size: 256, Number of epoch: 50, Optimizer: Adam optimizer, and Initial Learning rate: 0.1. They are determined by trial and error. Moreover, the details of the model architecture are illustrated in **Table 3**. We employ COVID-Net<sup>1</sup> as a baseline supervised model to present the state-of-the-art performance of COVID-19 image classification for comparison. Furthermore, we compared the proposed model with SRC-MT that is the state-of-the-art semi-supervised learning since it outperformed Π model and mean teacher model in the area of medical image classification.

**Table 4** showcases a comprehensive comparison of the classification performance between SRC-MT and the newly introduced model (SSResNet). On the whole, the overall accuracies attained by SRC-MT surpass those achieved by the proposed SSResNet. Nonetheless, an interesting observation emerges when considering scenarios where merely 5% of labeled samples were utilized for training. In this specific scenario, the MacroF metric of SSResNet surpasses that of SRC-MT. This outcome indicates that the proposed model exhibits greater efficacy in identifying COVID-19 samples. Essentially, the utilization of the unsupervised path in SSResNet appears to remarkably enhance data representation, thereby leading to a more pronounced improvement in COVID-19 classification performance compared to SRC-MT.

Furthermore, this study delved into a granular examination of the performance for each class, elucidating the outcomes through the employment of confusion matrices, as depicted in **Figure 10**. A noteworthy observation emerges from this analysis: The proposed model outperforms SMC-TC in terms of recognizing COVID-19 cases. This observation underscores the capability of SSResNets to glean more effective features from unlabeled data, resulting in a heightened ability to discern COVID-19 samples. Notably, as the ratios of labeled data are incrementally augmented, there is a marked improvement in the accuracy of COVID-19 recognition. This underscores the proposition that the inclusion of an unsupervised path serves to elevate image representations, consequently bolstering classification performance. In essence, the integration of unlabeled data distinctly contributes to a substantial enhancement in COVID-19 classification performance, primarily attributable to the amplification of image representations facilitated by the unsupervised path within the SSResNet.

<sup>1</sup> https://github.com/lindawangg/COVID-Net

*Effective Screening and Face Mask Detection for COVID Spread Mitigation Using Deep… DOI: http://dx.doi.org/10.5772/intechopen.113176*


#### **Table 2.**

*Sample distribution in different classes for training and testing datasets.*


#### **Table 3.**

*The proposed network architecture.*


**Table 4.**

*Comparing performance between SRC-MT and proposed model (semi-supervised ResNet (SSResNet)).*

#### **3.2 Face mask detection**

This chapter also introduces the process of constructing a mobile model designed for face mask detection on edge devices. The primary goal of this endeavor is to develop an intelligent Internet of Things (IoT) device capable of performing real-time video processing to ascertain whether an individual is wearing a face mask in public settings. This technology finds practical application in scenarios such as enforcing mask-wearing within facilities or buildings. In this context, a smart IoT camera functions as a vigilant observer, detecting instances where individuals are not complying with the mask mandate and triggering alarms as necessary. To achieve this, mobile devices like FPGAs, integrated into the camera system, execute the face mask

**Figure 10.**

*Comparison of confusion matrix generated with SRC-MT and SSResNets trained on different ratios of labeled data.*

detection models. For a visual representation of this concept, refer to **Figure 11**, which presents an illustrative diagram depicting the potential of employing face mask detection to control access to a door.

Instead of embarking on the construction of models from scratch, the approach taken involves the utilization of pre-trained models that have undergone training using more extensive datasets across broader classes. This pre-trained ensemble comprises four distinct convolutional neural networks (CNNs): MobileNet V2, Inception V3, VGG 16, and ResNet 50. Leveraging transfer learning, as opposed to constructing models from scratch, often leads to superior performance. These pre-trained models have been honed on the comprehensive ImageNet dataset [56]. The salient attributes extracted from these pre-trained models serve as the foundation, conveyed to a novel classifier positioned at the terminal end of the network. Subsequently, a mask detection classifier is trained on top of the pre-trained model. A pivotal distinction discernible among these models is their input size. Inception V3 adopts an image size of 299 299 3, while the other three models employ a size of 224 224 3. The efficacy of these models is rigorously verified through inference runs on mobile devices, encompassing the NVIDIA Jetson TX2 and NVIDIA Jetson Nano platforms.

Dataset and experiment setup are presented below.

• Dataset: We utilized publicly available datasets<sup>2</sup> , which consist of 3890 images categorized into two classes: with face and without face. Among these, 1916 images featured individuals wearing masks, while 1930 images depicted individuals without masks, after removing redundant entries. This dataset was deliberately structured to maintain a balanced distribution for our classifier. However, it does present some exceptional cases, such as images containing multiple faces or instances where faces are partially occluded by other body parts.

<sup>2</sup> https://github.com/chandrikadeb7/Face-Mask-Detection

*Effective Screening and Face Mask Detection for COVID Spread Mitigation Using Deep… DOI: http://dx.doi.org/10.5772/intechopen.113176*

#### **Figure 11.**

*A diagram of a face mask alarm system. A face mask alarm mounted on the door consists of a camera, a face mask detector, and a convertor. The camera will send the personal image to the face mask detector consisting of mobile GPUs and artificial neural networks. Then, the detector will detect the face mask on the image by running the neural networks on the mobile GPUs and sent the detection result. Finally, the convertor will show a speech reminder such as "Face Mask Required" back to the person if the detection result indicates there is no face mask on the face.*

• Experiment setup: A learning rate of 0.0001 was employed, while the experiment consistently utilized a batch size of 10. This modest batch size was deliberately selected to facilitate extended training while operating within memory limitations. Furthermore, the experimentation spanned 100 epochs. The loss function of choice was binary cross-entropy.

The experiment involves the utilization of publicly available datasets, sourced from https://github.com/chandrikadeb7/Face-Mask-Detection. This dataset comprises a total of 3890 images that include both images with faces and images without faces. Within this collection, there are 1916 images depicting individuals wearing masks and 1930 images without masks. This dataset design maintains a balanced distribution, which is optimal for classification. It is important to note that certain exceptions are present, such as images containing multiple faces or faces partially covered by other body parts. In configuring the models, a learning rate of 0.0001 is employed. Throughout the experiment, a batch size of 10 is consistently used. This choice is rooted in the goal of facilitating more extensive training while optimizing memory usage. The training process spans 100 epochs, and for the loss function, the binary cross-entropy is selected.

In the model implementation, a diverse set of tools come into play. TensorFlow [77], a widely adopted open-source software library, forms the core framework for machine learning applications, supporting operations across various processing units including CPUs, GPUs, and TPUs. The flexibility it offers, in terms of both architectural design and working with pre-trained models, contributes to its popularity. Keras [78], which functions atop TensorFlow, is another critical component. Designed for expedited execution of deep learning models, Keras enhances the speed of development. For this research, Keras is employed to leverage pre-trained deep learning models, which are then fine-tuned to create the face mask detection classifier through transfer learning.

This study delved into the robustness of the models by subjecting them to training on limited sample sizes. Traditionally, working with small training datasets results in diminished training and testing performance. However, given the unique context of face mask detection in the context of the COVID-19 outbreak, acquiring extensive data for training and testing becomes a challenge. Thus, developing models for face mask detection that excel with small training data becomes pivotal for creating effective applications. Additionally, these models must be locally deployable and capable of achieving tasks at an optimal speed.

To address these requirements, various ratios of training data were employed to assess model performance. The outcomes, presented in **Table 5**, were obtained by running these models on mobile NVIDIA GPUs. The experimental results yield noteworthy insights. MobileNet V2 emerges as the swiftest model, processing nearly 40 frames per second (FPS) on the Jetson TX2 platform. On the other hand, VGG 16 attains the highest accuracy among the models. Notably, when trained using just 1% of the available training data, Inception demonstrates superior performance.

Observations indicate that training accuracy surpasses 90% for each model, indicative of overfitting given the markedly lower testing scores. As the proportion of training data is gradually increased to 5, 10, and 20%, the performance of all models exhibits enhancement. This suggests that augmenting the dataset samples contributes to alleviating overfitting issues. VGG secures the highest accuracy, and while training on 20% of the data, MobileNet attains the lowest accuracy. ResNet and Inception showcase comparable performance levels throughout this experimentation.

Additionally, it is observed in **Table 6** that the pretrained models performed better on recognizing images containing masks regarding the precision, recall, and Fscore. Additionally, these pretrained models can achieve promising performance in terms of Fscore.

In summary, we leveraged pre-trained models such as MobileNet V2, ResNet 50, Inception V3, and VGG 16. These models were chosen due to their established performance in various applications. When considering model complexity, MobileNet V2 stands out for its efficiency, boasting the lowest complexity among the aforementioned


**Table 5.**

*Performance comparison on face mask detection on Jetson TX2 and Jetson Nano. FPS refers to the number of images processed per second. Rtr and Ntr denote the ratio of training data and the number of training images.* *Effective Screening and Face Mask Detection for COVID Spread Mitigation Using Deep… DOI: http://dx.doi.org/10.5772/intechopen.113176*


#### **Table 6.**

*Comparison of precision, recall, and Fscore.*

models. On the other hand, VGG-16 adheres to a classical architecture characterized by a substantial number of parameters, leading to an overall higher complexity. InceptionV3, with its emphasis on capturing multi-scale features, adopts a more intricate architecture than MobileNet V2. In the context of speed and accuracy, as showcased in the performance comparison table labeled "Performance comparison on face mask detection on Jetson TX2 and Jetson Nano," MobileNet V2 outperforms the others in terms of speed. In contrast, VGG-16 exhibits the slowest processing speed, aligning with the principle that more complex models tend to operate at a slower pace. Furthermore, it is noteworthy that VGG-16 achieves optimal performance relative to the other models, driven by its expansive model capacity. This insight underscores the trade-off between complexity and performance, where VGG-16's higher capacity contributes to its superior performance despite the trade-off in processing speed.
