Applications of Deep Learning and Reinforcement Learning

#### **Chapter 3**

## IoT Device Identification Using Device Fingerprint and Deep Learning

*Prashant Baral, Ning Yang and Ning Weng*

#### **Abstract**

The foundation of security in IoT devices lies in their identity. However, traditional identification parameters, such as MAC address, IP address, and IMEI, are vulnerable to sniffing and spoofing attacks. To address this issue, this paper proposes a novel approach using device fingerprinting and deep learning for device identification. Device fingerprinting is generated by analyzing inter-arrival time (IAT), round trip time (RTT), or IAT/RTT outliers of packets used for communication in networks. We trained deep learning models, namely convolutional neural network (CNN) and CNN + LSTM (long short-term memory), using device fingerprints generated from TCP, UDP, ICMP packet types, ICMP packet type, and their outliers. Our results show that the CNN model performs better than the CNN + LSTM model. Specifically, the CNN model achieves an accuracy of 0.97 using the IAT device fingerprint of ICMP packet type, and 0.9648 using the IAT outlier device fingerprint of ICMP packet type on a publicly available dataset from the crawdad repository.

**Keywords:** Internet of Things, deep learning, device identification, security, device fingerprinting

#### **1. Introduction**

IoT is used in varied industries including automobile, manufacturing, agriculture, and medicine, etc. With the increase in the usage of IoTs in varied fields, the data transfer between edge devices over the network has also increased. While IoT bridges the gap between the digital and physical world, compromised IoT devices can bring dangerous consequences. Wireless networks are more at risk than wired networks. Frames are encrypted in wireless communication, but the management and control frames are not encrypted as per IEEE 802.1 standards. This causes the wireless device identity prone to spoofing and denial of service attacks. Node forgery, once the adversaries get hold of the security credentials, can cause a major security threat.

Adversaries may use a compromised node to send incorrect data. For example, if an IoT device sending the temperature in the industry gets compromised, it will ruin the product, and the owner must bear a great loss. Many cryptography techniques, such as WEP and WPA, can be easily compromised. IP address, MAC address, or IMEI number could be used for device identification, but there are scenarios where these

addresses got spoofing [1]. The gravity of the impact the breach in IoT has on varied fields is substantial, and we need to come up with an appropriate security mechanism to reduce the risk of data being compromised by IoT device forgery.

Different metrics can be used for device identification such as IP address, MAC address, IMEI address, and other network parameters such as transmission time, transmission rate, inter-arrival time, and medium access time. Comparisons of different metrics for device identification are in **Table 1**. The parameters, such as MAC address and IP address, are easier to spoof, so the study has been made on finding out the important parameters that can distinguish the devices. In [2], transmission time, transmission rate, inter-arrival time, and medium access time have been compared. IAT and transmission time outperform the other parameters in device identification.

In this paper, we worked on the deep learning approach for device identification. Device fingerprint is created from the parameters extracted during the communication of a device with router. This device fingerprint is used to train the deep learning model and device identification.

We fingerprint a device using IAT, RTT, and its outliers and feed them to deep learning models for device identification. These parameters are easier to extract and are not spoofed that easily after creating the device fingerprint with them. Timestamps (from which IAT and transmission time are extracted) are generated at the receiver side, which makes it harder to sniff and spoof. The adversaries need to change their own behavior to get a hand on these parameters. IAT and RTT are varied for different devices due to different CPU configuration and clock frequency. IAT and RTT depend on cache configurations, data cache, instruction cache, clock frequency, busses, and NIC card. These hardware configurations have an impact on the packet transfer rates. The attackers might try to emulate the signature using different techniques such as (1) introduce delays in packets, (2) change the data rate, and (3) make a customized operating system. Even while considering such techniques for an attack, an attacker is not successful in emulating the device. The attacker must consider a spoofing a signature along with hiding its original signature.

We use deep learning to extract knowledge from the data. It allows us to better understand the system model and simulate. CNN learns the semantic in the images and patterns in the image graph. Similarly, LSTM is recognized as a good algorithm for the classification of time series data. We use these two deep learning algorithms for the classification of devices. In earlier research, mathematical tools such as Mann-Whitney U-Test were used, but these algorithms require much time invested in


#### **Table 1.**

*Comparison of parameters for fingerprinting.*

preprocessing of data. The machine learning approach also requires us to prepare structured data before feeding it to ML algorithms.

Deep learning algorithms have an advantage over these consequences as it learns through the unstructured data while going through each layer in deep learning algorithms. The key factor for using deep learning is its time for making a prediction. Deep learning has its parameters calculated while training, which is why, when we provide our fingerprint of the device, the prediction is made quickly. This would take more time if we had used ML or any other mathematical tools.

We use the dataset generated from our own setup, as well as a publicly available dataset for training the model. We use TCP, UDP, ICMP packets, ICMP packets, and outliers of those packets for creating the device fingerprint. A new method [3] of device identification by collecting the information of the device to generate a fingerprint of the hardware, which can be used for device identification has been introduced. They use four different types of packets, namely probe request, ping, TCP, and UDP packets, to generate IAT graphs and have lower accuracy using CNN for classification.

In our work, we fingerprint two devices: Samsung A20 and Samsung J5 Prime. We plot IAT and RTT of the packets (probe request for IAT and ping for RTT) of each device and used those as datasets and feed them to deep learning models for device identification. We also use the publicly available dataset from the crawdad repository [4] introduced by Radhakrishnan et al. [5] to verify our results. This dataset provides the IAT information collected actively and passively from different wireless devices using wire side observations in a local network. They captured traffic from 30 different devices including iPads, iPhones, netbooks, Google phones, IP cameras, Kindles, and IP printers, etc., from various applications and protocols such as TCP, UDP, Skype, ICMP, SCP, and Iperf. Our main contribution consists of:


The remainder of the paper is organized as follows. Section 2 briefly discusses related work. Section 3 describes the device fingerprinting, setup, methods for extracting data, and creating image graphs, and preparation of datasets are also discussed in this section. This dataset is fed to deep learning models described in Section 4 for classification of device. Experimental results are presented in Section 5, and the paper is concluded in Section 6.

#### **2. Related work**

The use of IP address, MAC address, and IMEI number for device identification brings significant risks of critical information, and the device itself is compromised. This alerts the researcher to produce a flexible and effective technique for

device identification [1, 6, 7]. For example, a new stack [1] for the identity of IoT is proposed as it differs from the traditional identity of network devices and survey on attribute-based authentication for the identity of IoT devices.

Neumann et al. [2] surveys different features of the MAC layer such as transmission rate, transmission time, and inter-arrival time, and evaluated them on two criteria for effectiveness, fingerprint similarity at different time, and fingerprint dissimilarity of two different devices. In [2], authors use the IAT packets from wireless devices for creating digital fingerprints and created a histogram where each bin specifies the frequency of IATs in a specified range. Here, histogram is the fingerprint used for the classification of the the device and used to identify known and unknown devices from the database. The author tested the scenario where a malicious user tries to emulate the known device by introducing delay to the packets. The author concluded that different software and hardware make it difficult to emulate the hardware. In [2], authors use a passive approach for fingerprinting and, Radhakrishnan et al. [5] extended the work [2] using active approach for device fingerprinting. In the passive approach, we just observe the wireless communication to/from the device and use the important features of packets. Instead, in the active approach, we inject the signal to get a response from the device to get useful features. Sandhya et al. [8] used CNN but considered all types of packets flowing from devices to AP for device classification. This might be practical, but a lower accuracy of 86% may be problematic from a security point of view.

In [5], the author used a ping application to communicate between a device on campus. In [9], the author used IAT of probe request to fingerprint the device and used Mann-Whitney U-test for the analysis if two samples are of the same distribution. Miettinen [10] used 23 features such as ICMP, TCP, HTTP, and size from different layers (data link layer, transport layer, network layer, and application layer, etc.). The work collects these features of 23 for 12 packets and used a random forest algorithm for classification. The accuracy for 17 out of 27 was obtained 95% and 50% for the rest devices (10).

Robyns et al. [11] introduce the idea of noncooperative MAC layer fingerprinting, which does not require cooperation with the device as it uses some adversary nodes at the monitoring station to capture and monitor the bits of MAC frames without the user's permission. This hampers the privacy of the user but provides security from attacks from outside. The accuracy, when used for classification of 50 to 100 devices, was between 67–80%, but the accuracy decreases rapidly from 33–15% when device numbers were increased.

Kohno et al. [12] used the clock skew for fingerprinting devices. The work measured the timestamp by time difference of the time stamps using the traces from Tcpdump. The work considered the scenario where IP addresses were changed during data collection. Maurice [13] used a probe request and response for fingerprinting, but the results were not that promising for similar devices. Cunche et al. [14] used probe request from an AP and in response got the list of wireless networks. The work used this vulnerability to identify people from the list of networks connected. Francois et al. [15] made use of behavioral fingerprinting and automatically disconnects the device, which has suspicious activity and asks it to reconnect based on the behavioral fingerprint. Sun et al. [16] use the fingerprinting method for localization of devices connected to Wi-Fi AP indoor or outdoor.

Xu [1] studied the challenges and opportunities in digital fingerprinting for both wired and wireless devices. The author extracted the features from the physical and MAC layers such as clock skew, IAT, transmission time, SSID, and frequency. The

work concluded IAT and transmission time as good parameters for device classification based on accuracy.

Kulin et al. [17] used different algorithms such as k-NN, decision tree, logistic regression, and neural networks for device classification using publicly available datasets. The performance of k-NN, decision tree, and logistic regression was good, but neural networks performed poorer than other classification algorithms with an average precision of 0.47 and recall value of 0.46. It is a common understanding that neural networks should perform better than others, but this was not the case in this work.

#### **3. Device fingerprinting**

We set up the devices in the lab for extracting the information about the devices. First, we set up Raspberry Pi as a router. Next, we use Samsung A20 and Samsung J5 Prime as an edge device (target IoT devices). Wireless communication between the edge devices and router was recorded. In the sniffing applications, Wireshark captures the packets incoming and outgoing on Raspberry Pi. These captured packets are used to calculate IATs/RTTs of packets and plot IAT, RTT, and IAT outlier graphs. These graphs are used as datasets to train and test the model. Python program is used to plot, label, and split the dataset. A split training set trains the deep learning model, and the testing dataset validates it. Our overall methodology is depicted in **Figure 1** and explained in detail in a subsection of this section and Section 4.

#### **3.1 Our setup**

Our setup has Raspberry Pi as a router and phones as the edge devices. Raspberry Pi (acts as a router) broadcasts an access point. The packets sent from edge devices are captured at the router side, which has a packet sniffing tool installed. Wireshark is installed in Raspberry Pi which inspects, deciphers, and keeps track of all incoming and outgoing packets to/from it. As there might be many packets coming to the router, we use the filter to find the required packet. We collected the data in two ways: 1. Probe request and response and 2. Ping request and response.

Probe requests are the packets broadcast by wireless devices, which consist of supported data rates and their capabilities. The access point receives these requests and responses with packets consisting of SSID, supported data rates, and encryption type, etc. We used a sniffing tool, Wireshark, to passively sniff the packets at the router level and use those packets for making IAT graphs.

Ping sends the ICMP echo request packet to any device on the network and waits for the response from the target device. In our setup, we ping the edge device, and the edge device responds to the router. This packet communication of ICMP is passively observed and recorded by Wireshark. This data is used for making RTT graphs.

#### **3.2 Analysis of data and create image graph**

The data collected by a sniffing tool and must be processed to obtain IAT and RTT. We obtain data using a snipping tool in Raspberry Pi. These data are timestamps of incoming and outgoing packets. We process timestamps to calculate the IAT and RTT of the packet.

After we obtain the value of IAT and RTT of packets, we write a Python program to plot the graph and download it. IAT and RTT graph is plotted as a line graph of 100

**Figure 1.** *Methodology.*

IATs/RTTs. The plot of IAT and RTT is shown in **Figures 2**–**4**. We use IAT and RTT separately for device identification.

### **3.3 Preparation of data**

The image obtained by plotting the graph must be labeled before we use that data for training and testing the model based on different metrics. We label the data using Python. For two phones, 0 represents Samsung A20 and 1 represents Samsung Prime. We split the total images into training data and testing data. For each IAT and RTT, we use 75 images for training and 30 for testing for each device (total 150 for training and 60 for testing). After creating an image and labeling it, we apply CNN and CNN + LSTM algorithms for image classification.

**Figure 2.** *IAT graph from our setup.*

**Figure 3.** *IAT graph from verification dataset.*

We use the dataset of IAT from crawdad, which was developed by Ulugac et al. [4]. The dataset is the collection of IATs of different devices. We use four devices: two iPad and two Dell notebooks for the verification of models. First, we use ICMP packets for generating the IAT graph. Since we are comparing the classification using a single packet type, multiple packet types, and an outlier, we also use TCP, UDP, and ICMP packets for generating the IAT graph and outliers. We plot a graph using 100

**Figure 4.** *RTT graph from our setup.*

IATs. As in our setup, we similarly label zero for Dell notebook1, one for Dell notebook2, two for ipad1, and three for ipad2 and split it into a training and testing dataset.

#### **4. Deep learning model for classification**

We use a convolutional neural network (CNN) and a combination of a convolutional neural network and long short-term memory (CNN + LSTM) for device classification. Since we are using the image of time series data, we consider CNN due to its large breakthrough in image recognition. Moreover, CNN is very cost-effective due to the reduced number of parameters without losing the quality. Furthermore, due to the recognition of LSTM for time series data and our consideration of converting time series data to images and using the image for classification, we experiment if the combination of CNN + LSTM could give better results than CNN alone.

#### **4.1 Convolutional neural network for device classification**

The created image is colored, but for this classification problem, we convert the image into grayscale and reduce the image size to 256 \* 256. Initially, it was 800 \* 800. Then we split the labeled data into training and testing datasets and use the training set to train the CNN model. Our CNN model has the first convolution layer with 32 filters and a kernel size of 5 \* 5. The input size of this layer was set to 256 \* 256 \* 1. Next, we use max-pooling with stride length 2; this helps in reducing the parameters by selecting the maximum from four (2 in x-direction and 2 in the y-direction). The next convolution layer in our model has 64 filters and a kernel size of 3 \* 3. The input

*IoT Device Identification Using Device Fingerprint and Deep Learning DOI: http://dx.doi.org/10.5772/intechopen.111554*


#### **Figure 5.**

*CNN model summary.*

to this layer is set by Keras. We again use max-pooling with stride length 2. The third convolution layer has 128 layers and a kernel size of 2 \* 2, and we max-pooled with a stride length of 2 for this layer as well. For all these convolution layers, we use Rectified linear Unit (ReLU) as an activation function. Next, we use a flattened layer and two dense layers with 128 and 64 nodes followed by a dense layer with four nodes with softmax as activation function. **Figure 5** shows the model summary of CNN. The model is compiled using categorical cross-entropy for calculation of loss and Adam as the optimizer. We use both IAT and RTT data for training the CNN model and check how good was its classification using different metrics. Furthermore, we use an outlier of IAT data for classification. While training for different datasets, the number of nodes and epochs is changed.

#### **4.2 Combination of CNN and LSTM for device classification**

We combine CNN and LSTM using the concept of TimeDistributed layer. We provide n images at a time to the first TimeDistributed convolution layer; this applies

**Figure 6.** *CNN + LSTM model.*

the same filter to the n images. We use the same three identical CNN layers but TimeDistributed. This is illustrated in **Figure 6**. The input to the first layer is n \* 256 \* 256 \*1. Another input size is managed by Keras. This model has an additional LSTM layer with 32 nodes after CNN layers. The output of Maxpool2D is flattened to get one single vector. This is a feed to LSTM and a dense layer. **Figure 7** shows the model


**Figure 7.** *CNN + LSTM model summary.*

summary of CNN + LSTM. LSTM makes use of chronological data and previous frame data to find what is useful in prediction. The model is compiled using categorical cross-entropy for calculation of loss and Adam as the optimizer. We use a combination of CNN and LSTM and observe how good the prediction the model can make. While training for different datasets, the number of nodes and epochs is changed.

#### **4.3 Metrics for model evaluation**

Evaluation of the model is an important task in data science. We need to make sure our model is not overfitted. Overfitting is a modeling error in statistics, which occurs due to the model aligning too closely to the limited data points. There are different techniques to prevent overfitting. Some of the techniques that we use are: reduce learning rate and dropout Layer. While training the model, we can monitor the validation accuracy and if it does not increase for a certain epoch, we reduce the learning rate by a certain factor. Below is the snippet of reducing learning rate where we monitor the validation loss and reduce the learning rate by a factor of 0.1 when for 3 consecutive epochs validation loss is increased.

*tf.Keras.callbacks.ReduceLROnPlateau(monitor = "val.*

*loss," factor = 0.1,patience = 3,verbose = 0, andmin lr = 1e-6).*

Similarly, the dropout rate can be specified to the layer as the probability of setting each input to the layer to zero. Below is the code for adding the dropout layer. The rate is set to 0.3, which drops 0.3 of input units.

*model.add(Dense(128, activation = 'relu')).*

*model.add(Dropout(0.3)).*

The most common metric used for the evaluation of the algorithm is classification accuracy. Classification accuracy is equal to the number of correct predictions made divided by the total number of predictions made.

In our case, we use categorical cross-entropy for the calculation of loss, which makes the use of the probability of belonging to a class for the calculation of loss.

$$\text{Classification loss} = -\sum\_{i=1}^{output} \wp\_i \text{logf}(s)\_i \tag{1}$$

Where, *yi* is the class and *f s*ð Þ*<sup>i</sup>* is the probability of belonging to that class. We also need to control the number of times we train the model. This is called epoch. Too much training can result in network overfitting to the training data. While training a model for certain epochs if validation error increases but the training loss decreases or remains constant, we can conclude that our model is overfitting as shown in **Figure 8**.

#### **5. Results**

Our setup has the phone Samsung A20, and Samsung Prime communicating with Raspberry Pi. As Section 3.2, we created the IAT graph using probe request and response from these devices to Raspberry Pi and prepared the data for feeding to CNN and evaluated the model. We trained the CNN model as in Section 4. A for 10 epochs and obtained the accuracy of 1.00 and loss of 0.0021 on training data. Accuracy in the validation dataset was 1.00 and loss of 0.0021. Using the IAT graph for classification and CNN + LSTM model and running for 30 epochs, the accuracy and loss were 1 and

**Figure 8.** *Model loss.*

0.0015 in the training dataset and 1 and 0.0011 in the validation dataset. Similarly, we created the RTT graph using ping as in Section 3.2 and trained for 10 epochs while feeding to CNN and 40 epochs while feeding to CNN + LSTM and achieved 100% accuracy in classification in both.

We used the dataset of IAT from crawdad, which was developed by Ulugac et al. [4] for verification. We used ICMP packets used by two Dell notebooks and two iPads communicating in the local area network. Using CNN for classification and running for 10 epochs, we achieved the accuracy of 1 and loss of 1.4 \* 10–4 in the training dataset. We achieved an accuracy of 0.97 and a loss of 0.1326 in the validation dataset.

**Figure 9.** *Accuracy using IAT(ICMP) as parameter from verification dataset using CNN.*

*IoT Device Identification Using Device Fingerprint and Deep Learning DOI: http://dx.doi.org/10.5772/intechopen.111554*

**Figure 10.** *Loss using IAT(ICMP) as parameter from verification dataset using CNN.*

**Figures 9** and **10** show the learning curve of the CNN model. Using CNN + LSTM for classification and running for 35 epochs, we achieved an accuracy of 0.9463 and a loss of 0.1906 in the training dataset. We achieved an accuracy of 0.9060 and a loss of 0.3115 in the validation dataset. **Figures 11** and **12** show the learning curve of the CNN + LSTM model.

**Figure 11.** *Accuracy using IAT(ICMP) as parameter from verification dataset using CNN + LSTM.*

**Figure 12.** *Loss using IAT(ICMP) as parameter from verification dataset using CNN + LSTM.*

After analyzing the IAT graph, we found that there is a regular pattern of outliers and considered if the outliers in the IAT graph can better classify a device using these deep learning algorithms. We utilized the outliers in the IATs of the verification dataset for four devices: two Dell notebooks and two iPads. There lies inter-burst latency between the IAT packets, and we utilize these for classification. We plotted the outlier graph for four devices considering their own threshold for each. We plotted outlier graphs and used CNN and CNN + LSTM algorithms for classification. We used the same CNN configurations ranging from convolution layers, input size, activation function, and number of layers, etc., for the classification using the IAT outlier graph. We ran the model for 10 epochs. We achieved the accuracy and loss of 0.9981 and 0.0079 and validation accuracy and loss of 0.9648 and 0.1397, respectively. **Figures 13** and **14** show the learning curve of the CNN model using an outlier dataset for training. We also used the same CNN + LSTM configurations ranging from convolution layers LSTM layer, activation function, and number of layers, etc., for the classification using the IAT outlier graph. We ran the model for 15 epochs. We achieved the accuracy and loss of 0.9870 and 0.0520 and validation accuracy and loss of 0.9574 and 0.1422, respectively. **Figures 15** and **16** show the learning curve of the CNN + LSTM model using an outlier dataset for training.

To validate the improvement of classification using single type packets (ICMP/ probe request) in our work, we also classified the devices using TCP, UDP, and ICMP packet types from the same dataset of IAT from crawdad for classification as in [8]. The IAT graphs generated for these packet types were together used for training and testing the model. We trained the CNN model for 16 epochs and put the dropout layer after flattened layer to prevent overfitting. We used 18,000 training images and 6000 testing images and obtained an accuracy of 0.9656 and a loss of 0.0894; the validation accuracy and validation loss were 0.9290 and 0.3073, respectively. **Figures 17** and **18**

#### *IoT Device Identification Using Device Fingerprint and Deep Learning DOI: http://dx.doi.org/10.5772/intechopen.111554*

**Figure 13.** *Accuracy using IAT(ICMP) outlier graph from verification dataset using CNN.*

**Figure 14.**

*Loss using IAT(ICMP) outlier graph from verification dataset using CNN.*

show the learning curve of CNN model using image graphs of IAT generated using TCP, UDP, and ICMP packet types from the verification dataset.

Again, for this different type of packet, we considered the outliers and classified them using the outliers of IAT. We trained the CNN model for 20 epochs and put the

**Figure 15.** *Accuracy using IAT (ICMP) outlier graph from verification dataset using CNN + LSTM.*

**Figure 16.** *Loss using IAT (ICMP) outlier graph from verification dataset using CNN + LSTM.*

dropout layer after flatten layer to prevent overfitting. We used 5440 training images and 1700 testing images and obtained an accuracy of 0.8888 and a loss of 0.2704; the validation accuracy and validation loss were 0.8504 and 0.4344, respectively.

*IoT Device Identification Using Device Fingerprint and Deep Learning DOI: http://dx.doi.org/10.5772/intechopen.111554*

**Figure 17.** *Accuracy using IAT (TCP, UDP, ICMP) as parameter from verification dataset using CNN.*

**Figure 18.** *Loss using IAT (TCP, UDP, and ICMP) as parameter from verification dataset using CNN.*

**Figures 19** and **20** show the learning curve of the CNN model using image outlier graphs of IAT generated using TCP, UDP, and ICMP packet types from the verification dataset.

**Figure 19.** *Accuracy using IAT(TCP, UDP, and ICMP) outlier graph from verification dataset using CNN.*

**Figure 20.**

*Loss using IAT(TCP, UDP, and ICMP) outlier graph from verification dataset using CNN.*

#### **5.1 Comparison of models and parameters for IAT outlier graphs and IAT graphs from verification dataset**

The summary of the model and parameters is shown in **Table 2**. When we used IAT graphs, the validation accuracy is 0.97 for CNN, which is better than


**Table 2.**

*Performance of models in terms of validation accuracy and validation loss using verification dataset.*

CNN + LSTM, in which case the validation accuracy is 0.9060. When we used the IAT outlier graph, the validation accuracy is 0.9648 for CNN and 0.9574 for CNN + LSTM. We observe that classification accuracy is similar in the case of CNN irrespective of the IAT graph or IAT outlier graph used in classification, but in the case of CNN + LSTM, the accuracy is lower, while using IAT graph for classification than IAT outlier graph.

We noticed that the results of the combination of CNN and LSTM cannot outperform the CNN alone model. The first reason is that the input of LSTM is a flattened version of CNN's output instead of a specific time series; therefore, the time dependence captured by LSTM may not reflect the relationship among input images. The second reason is that the used LSTM layer in the experiments has a small output size. In this case, some valuable information may be lost.

#### **6. Conclusion**

In this work, we classified devices using two parameters, namely inter-arrival time (IAT) and round-trip time (RTT), and two deep learning algorithms, namely CNN and a combination of CNN and LSTM. We used the IAT and RTT image graph as device fingerprint and model using two deep learning algorithms. We captured the packets using the packet snipping tool at Raspberry Pi(router) for two different setups. IAT and RTT were recorded for each device by snipping tool in real time. The security threat posed by adversaries once they forge the IoT device makes device identification a fundamental problem. The dynamic parameters that we used depend on hardware and software (CPU cache, data cache, and clock frequency, etc.), which makes it harder for intruders to create the fingerprint of a device. We used deep learning to extract the knowledge from data. The widespread recognition of CNN as a good algorithm for image classification encouraged us to use it. Moreover, as LSTM has made its name for the classification of time series data, we used a combination of CNN and LSTM because we were using an image graph of time series data for training the model. Our approach can be used to detect the malicious user if we store the fingerprint and match the fingerprint of the device trying to connect to the network before allowing it to connect. Our approach brings the alternative of using IMEI, IP and MAC address, cryptography security, and a digital certificate for device identification, which are prone to spoofing.

We used two different parameters and obtained good accuracy in our real setup. We also verified our model using the dataset available in public for a single ICMP packet and were able to achieve validation accuracy of 0.97 for CNN and 0.9060 for CNN + LSTM. We compared two deep learning algorithms for device identification. Both models were good when we used a dataset that was generated from our setup, but while using the dataset from crawdad, CNN was more accurate in classification than CNN + LSTM. We further used IAT outlier graphs for classification and achieved validation accuracy of 0.9648 for CNN and 0.9574 for CNN + LSTM. To validate the improvement in classification accuracy using ICMP packet, we also classified the devices using TCP, UDP, and ICMP packet types from the verification dataset. We achieved good accuracy in using a single ICMP packet type for classification.

We collected RTT data in our setup and achieved good accuracy in classification. In the future, we can collect RTT data in a real scenario with many devices and use it for classification.

#### **Acknowledgements**

This work is supported in part by the US National Science Foundation under Grant CC-2018919. Beside NSF grant support, Dr. Yang's work is also supported in part by the new hire startup fund from Southern Illinois University Carbondale.

### **Conflict of interests**

The authors declare that there are no conflicts of interest regarding the publication of this article.

### **Author details**

Prashant Baral1†, Ning Yang<sup>2</sup> and Ning Weng<sup>3</sup> \*

1 Advanced Micro Devices, Inc., Austin, TX, USA

2 Information Technology Program in the School of Computing, Southern Illinois University Carbondale, IL, USA

3 School of Electrical, Computer, and Biomedical Engineering, Southern Illinois University Carbondale, IL, USA

\*Address all correspondence to: nweng@siu.edu

† These authors contributed equally.

© 2023 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

*IoT Device Identification Using Device Fingerprint and Deep Learning DOI: http://dx.doi.org/10.5772/intechopen.111554*

#### **References**

[1] Xu Q, Zheng R, Saad W, Han Z. Device fingerprinting in wireless networks: Challenges and opportunities. IEEE Communications Surveys & Tutorials. 2015;**18**(1):94-104

[2] Neumann C, Heen O, Onno S. An empirical study of passive 802.11 device fingerprinting. In: 2012 32nd International Conference on Distributed Computing Systems Workshops. Macau, China: IEEE; 2012. pp. 593-602

[3] Bratus S, Cornelius C, Kotz D, Peebles D. Active behavioral fingerprinting of wireless devices. In: Proceedings of the First ACM Conference on Wireless Network Security. New York, NY, USA: ACM; 2008. pp. 56-61

[4] Uluagac AS. "A. selcuk uluagac, crawdad dataset gatech/fingerprinting (v. 2014-06-09). 2014. Available from: https://crawdad.org/gatech/ fingerprinting/20140609.

[5] Uluagac AS, Radhakrishnan SV, Corbett C, Baca A, Beyah R. A passive technique for fingerprinting wireless devices with wired-side observations. In: 2013 IEEE Conference on Communications and Network Security (CNS). Washington, D.C., USA: IEEE; 2013. pp. 305-313

[6] Hamad SA, Zhang WE, Sheng QZ, Nepal S. Iot device identification via network-flow based fingerprinting and learning. In: 2019 18th IEEE International Conference on Trust, Security and Privacy In Computing and Communications/13th IEEE International Conference on Big Data Science and Engineering (TrustCom/ BigDataSE). Rotorua, New Zealand: IEEE; 2019. pp. 103-111

[7] Mazhar N, Salleh R, Zeeshan M, Hameed MM. Role of device

identification and manufacturer usage description in iot security: A survey. IEEE Access. 2021;**9**:41 757-41 786

[8] Aneja S, Aneja N, Islam MS. Iot device fingerprint using deep learning. In: 2018 IEEE International Conference on Internet of Things and Intelligence System (IOTAIS). Bali, Indonesia: IEEE; 2018. pp. 174-179

[9] Desmond LCC, Yuan CC, Pheng TC, Lee RS. Identifying unique devices through wireless fingerprinting. In: Proceedings of the First ACM Conference on Wireless Network Security. New York, NY, USA: ACM; 2008. pp. 46-55

[10] Miettinen M, Marchal S, Hafeez I, Asokan N, Sadeghi A-R, Tarkoma S. Iot sentinel: Automated device-type identification for security enforcement in iot. In: 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS). Atlanta, USA: IEEE; 2017. pp. 2177-2184

[11] Robyns P, Bonné B, Quax P, Lamotte W. Noncooperative 802.11 mac layer fingerprinting and tracking of mobile devices. Security and Communication Networks. 2017;**2017**: 1-21

[12] Kohno T, Broido A, Claffy KC. Remote physical device fingerprinting. IEEE Transactions on Dependable and Secure Computing. 2005;**2**(2):93-108

[13] Maurice C, Onno S, Neumann C, Heen O, Francillon A. Improving 802.11 fingerprinting of similar devices by cooperative fingerprinting. In: 2013 International Conference on Security and Cryptography (SECRYPT). Reykjavik, Iceland: IEEE; 2013. pp. 1-8

[14] Cunche M. I know your mac address: Targeted tracking of individual using wi-fi. Journal of Computer Virology and Hacking Techniques. 2014;**10**(4):219-227

[15] François J, State R, Engel T, Festor O. Enforcing security with behavioral fingerprinting. In: 2011 7th International Conference on Network and Service Management. Paris, France: IEEE; 2011. pp. 1-9

[16] Sun L, Chen S, Zheng Z, Xu L. Mobile device passive localization based on ieee 802.11 probe request frames. Mobile Information Systems. 2017;**2017**: 1-10

[17] Kulin M, Fortuna C, De Poorter E, Deschrijver D, Moerman I. Data-driven design of intelligent wireless networks: An overview and tutorial. Sensors. 2016; **16**(6):790

#### **Chapter 4**

## MultiRes Attention Deep Learning Approach for Abdominal Fat Compartment Segmentation and Quantification

*Bhanu K.N. Prakash, Arvind Channarayapatna Srinivasa, Ling Yun Yeow, Wen Xiang Chen, Audrey Jing Ping Yeo, Wee Shiong Lim and Cher Heng Tan*

#### **Abstract**

Global increase in obesity has led to alarming rise in co-morbidities leading to deteriorated quality of life. Obesity phenotyping benefits profiling and management of the condition but warrants accurate quantification of fat compartments. Manual quantification MR scans are time consuming and laborious. Hence, many studies rely on semi/automatic methods for quantification of abdominal fat compartments. We propose a MultiRes-Attention U-Net with hybrid loss function for segmentation of different abdominal fata compartments namely (i) Superficial subcutaneous adipose tissue (SSAT), (ii) Deep subcutaneous adipose tissue (DSAT), and (iii) Visceral adipose tissue (VAT) using abdominal MR scans. MultiRes block, ResAtt-Path, and attention gates can handle shape, scale, and heterogeneity in the data. Dataset involved MR scans from 190 community-dwelling older adults (mainly Chinese, 69.5% females) with mean age—67.85 7.90 years), BMI 23.75 3.65 kg/m<sup>2</sup> . Twenty-six datasets were manually segmented to generate the ground truth. Data augmentations were performed using MR data acquisition variations. Training and validation were performed on 105 datasets, while testing was conducted on 25 datasets. Median Dice scores were 0.97 for SSAT & DSAT and 0.96 for VAT, and mean Hausdorff distance was <5 mm for all the three fat compartments. Further, MultiRes-Attention U-Net was tested on a new 190 datasets (unseen during training; upper & lower abdomen scans with different resolution), which yielded accurate results. MultiRes-Attention U-Net significantly improved the performance over MultiResUNet, showed excellent generalization and holds promise for body-profiling in large cohort studies.

**Keywords:** MultiRes attention, deep learning, fat compartments, abdomen, subcutaneous fat compartments, visceral fat

#### **1. Introduction**

Obesity is a globally growing epidemic which has affected more than 2 billion adults, and many teens (18 years plus) are overweight, of which 650 million are obese [1]. Anthropometric measurements, waist-to-hip ratio, body mass index (BMI), waist circumference, does not explicitly distinguish fat mass, and quantity of fat present in visceral, and subcutaneous compartments. Literature, highlights that accumulation of fat leads to insulin resistance, oncologic and cardiovascular diseases [2–4] affecting the quality of life. Hence, body composition analysis to determine the amount of adipose and muscle tissue is of medical importance for obesity risk analysis. Magnetic resonance imaging (MRI) and computed tomography (CT) can characterize fat and non-fat tissues [5]. Among the imaging modalities, MR is more efficient in tissue characterization compared to CT for quantification of body fat volume [6, 7]. By quantifying different fat compartments from the imaging scans, we can perform body composition analysis. Manual quantification of fat and muscle volumes from the imaging scans is tedious and time-consuming, leading to loss clinical man-hours.

Anatomically, the subcutaneous adipose tissue compartments (superficial: SSAT and deep: DSAT) are separated by thin fascia, whereas the visceral adipose tissue (VAT) is found in-between internal and external abdominal boundaries. VAT is around the internal organs and discontinuous whereas SAT (SSAT+DSAT) is continuous. Fat depots are irregular in shape, lack texture, and vary across abdominal profile as demonstrated in **Figure 1** making it a challenging medical image segmentation task. Several semi-automated methodologies have been developed to reduce time and reduce bias [8–12]. These methodologies are less reliable and offer low accuracy as they depend on expert knowledge for fine-tuning image parameters.

Deep learning for image segmentation [13] has found many applications in medical image analysis and one such application is abdominal fat compartment segmentation. Several fat quantification studies use single contrast DIXON MR scan and 2D/3D U-Net architecture [14, 15] for SAT and VAT segmentation. Enhancement versions of Standard U-Net such as Competitive Dense Fully Convolutional Network (CDFNet), nnUNet, and Dense Convolutional Network (DCNet), which can handle complex image features, have been used for adipose tissue segmentation [16–18]. Attention gate model [AG] in 2D and 3D U-Net [19] has gained popularity in adipose tissue segmentation task as AG focuses on target structures of varying shapes and sizes by suppressing irrelevant regions and highlighting useful salient features [20, 21]. Ibtehaz et al. proposed a MultiRes block to address multiscale issues and ResPath to

#### **Figure 1**

*Illustration of fat depots of SSAT (red), DSAT (green), and VAT (blue) varying shape, size across the abdominal profile.*

*MultiRes Attention Deep Learning Approach for Abdominal Fat Compartment Segmentation… DOI: http://dx.doi.org/10.5772/intechopen.111555*

reduce adverse learning of features which might lead to a false prediction by skip connection of U-Net [22].

#### **1.1 Study proposition**

In our previous work on adipose fat depot segmentation, we had proposed patchbased 3D-ResUNet Attention [23] for fat depot segmentation, The patch-based framework failed to handle (i) different body compositions like lean, and moderately obese due to fixed patch sizes, and (ii) generalize to unseen abdominal region segmentation due to cataphoric forgetting of network, anatomical differences, and class imbalance. **Figure 2** illustrates a few failed cases from our previous work. Hence to overcome these drawbacks, we focused on the enhancement of MultiResUNet [23] by proposing a MultiRes-Attention U-Net architecture, with

i. a hybrid loss function to handle class imbalance, and

ii. attention gates for focused learning and improved prediction accuracy.

In this study, we also compare the performance of the proposed architecture against standard U-Net and MultiResUNet.

### **2. Materials and methods**

#### **2.1 MR data acquisition**

Data sets of 190 elderly Asians (aged >50 years, residing within the community) who participated in characterization of early sarcopenia to assess functional decline

#### **Figure 2**

*Illustration of failed cases of our previous work on patch-based 3D-ResUNet attention vs. proposed architecture.*

study was used in our study [24]. The MR abdominal scans were acquired using a 3D modified breath-hold T1-weighted Dixon sequence. Subjects were advised a 20 s breath hold during the scans. The scans were performed on a 3T Siemens Magnetom Trio MRI scanner with TR/TE/FA/Bandwidth: 6.62 ms, 1.225 ms, 100, and 849 Hz/ pixel, respectively. The study group consisted of mainly Chinese (91.6%) ethnicity having mean age was 67.85 7.90 years, BMI 23.75 3.65 kg/m<sup>2</sup> , and predominantly female (69.5%) subjects. As the study subjects were elderly, many had common comorbidities such as hypertension, diabetics, and hyperlipidemia. National Healthcare board reviewed the cohort study with written consent from all subjects.

Data set can be considered as heterogeneous as it included (i) subjects from different ages (ii) scans covering different anatomical regions—thoracic, lumbar, and sacral (iii) variations in fat accumulation in different compartments based on body composition and (iv) acquisitional variations like—image dimensions, slice thickness, breathing/motion artifacts, etc.

Manual (radiology experts) ground truths were generated in 26 data sets out of 190 scans covering L1-L5 regions. The data with ground truths were subjected to MRacquisition based data augmentation to scale the number from 26 to 130 to create training data sets.

#### **2.2 Fat segmentation**

A 3-stage segmentation framework was envisaged to quantify abdominal fat depots (i) Preprocessing stage which included (a) arm region removal, (b) data augmentation to increase the number of data sets, and (c) conversion of 3D MR images into 2D slices; (ii) Segmentation stage—"MultiRes-Attention U-Net" architecture for segmentation of abdominal regions into SSAT/DSAT/VAT (three class) regions and (iii) Postprocessing stage—image reconstruction 2D to 3D and fat depot quantification.

#### **2.3 Preprocessing**

All the training/testing data were subjected to quality check to assess motion artifacts originating from breathing, and fat-water swaps. Auto-check was developed to ensure training dataset slices match with the marked ground-truth slices. Arm region artifacts were removed automatically using the projection method [21]. Four different data augmentations were performed once before training these included (i) Random Noise (ii) Random Ghosting (iii) Random Bias Field (iv) Blur augmentation [23] to increase the total number of datasets. Finally, 3D MR scans were converted to 2D slices for training/testing the proposed deep learning architecture.

#### **2.4 MultiRes-attention U-Net**

In standard deeper convolutional network, input data goes through multiple convolutions to obtain salient spatial features leading to vanishing gradient problem. The architectures like ResNet [25] adopt summation of connect of all preceding feature maps leading to high memory demanding network. DenseNet [26] introduces "dense connections", where each layer in the network is connected to every other layer, instead of only connected to previous layers as in standard network architecture but fail to handle multi-scale issue. To handle multi-scale issue of fat depots which vary in shape, size, and improve semantic segmentation which is memory efficient.

*MultiRes Attention Deep Learning Approach for Abdominal Fat Compartment Segmentation… DOI: http://dx.doi.org/10.5772/intechopen.111555*

We proposed MultiRes-Attention U-Net which is a modified version of MultiResUNet with attention which contains (i) MultiRes block, (ii) ResAtt-Path, and (iii) Attention gate model.

#### **2.5 MultiRes block**

Two sequential convolutional layers at each level in U-Net [24] are substituted with a proposed MultiRes block (similar to dense block in denseNet [26]) with the residual path, (as in ResNet [25]) as shown in **Figure 3**. multiRes block contains Inception-like modules with parallel convolution filters of 33, 55, and 77 to capture spatial features from different scales. However, they are not memory efficient. To reduce the memory, we factorized a large filter into a sequence of 33 filters with a gradual increase in the number of filters at each layer as shown in **Figure 3**.

#### **2.6 ResAtt-path**

Skip connections of standard U-Net are modified as ResAtt-Path by including non-linear convolution filters of 33 and a residual path with 11 filters. The number of convolution filters (33) reduces in each level of the encoding section of U-Net as shown in **Figure 4**. These ResAtt-Path overcomes the drawback of U-Net short connections by merging of low and high levels features at the decoder.

#### **Figure 3.**

#### **Figure 4.**

*Description of (a) MultiRes block, (b) ResAtt-path and (c) attention gated block of MultiRes-attention U-Net architecture.*

The ResAtt path connects the U-Net encoder at each level to the attention modules in the decoding section of U-Net.

#### **2.7 Self-attention**

Soft attention gates (AGs) proposed by Oktay et al. [20] assist the model to focus on regions of interest by suppressing irrelevant location-based feature activations. AGs ensure that only salient spatial information is carried across skip connection which improves the network performance in false positives reduction. Soft attention gates (AGs), as shown in **Figure 3**(c), and illustrated in Eq. (1) contains two inputs (i) *Ip*—lower-level block input and, (ii) *IR*—ResAtt-Path from the proposed skip connection layer. *Ip* input is fed into 1�1 convolution filter for upsampling to match the dimensions of the inputs as illustrated in Eq. (2). The dimension matched inputs *xattention and xupsampled* are combined and passing through a ReLU activation function and sigmoid activation functions to yield a coefficients with values between 0 and 1.

Finally, these coefficients are upsampled through trilinear interpolation to generate the soft attention feature map. Which is then multiplied by the ResAtt-Path's skip connection to produce the final output as shown in Eq. (3)

$$\propto\_{\text{attention}} = \text{SoftAttention}\left(I\_p, I\_R\right) \tag{1}$$

$$
\propto\_{upsample} = \text{Upsample}(I\_p) \tag{2}
$$

$$output = ConvBlock(concat(\textit{x}\_{attention}, \textit{x}\_{upsampled})) \tag{3}$$

#### **2.8 Loss function**

Segmentation model performance not only depends on the architecture of the network but also on the choice of the loss function [27] particularly in the scenario where there is a high-class imbalance. As we observed imbalance in SSAT, DSAT, and VAT distributions, we identified focal dice loss function as an appropriate loss function that handles class imbalance issues. The focal dice loss incorporates the focal loss where *γ* ¼ 0*:*5 Eq. (4) and dice loss Eq. (5) together making it a robust loss function for the imbalanced class problems. It makes use of weighted components for each class based on their representation.

$$\text{Focal loss} = -(\mathbf{1} - \rho\_t)^{\gamma} \log(\rho\_t) \tag{4}$$

$$\text{Dice loss} = \mathbf{1} - \text{dice coefficient} = \mathbf{1} - \frac{\mathbf{2} \ast (A \cap B)}{A + B} \tag{5}$$

#### **2.9 Post processing**

Fat sub-region volumetric analysis & sub-region volume percentage is computing using Eqs. (6) and (7)

$$\mathcal{V}r = (T p\_{sat} + T P\_{dsat} + T P\_{vat}) \ast I r \ast 1000 \tag{6}$$

where *TPssat*, *TPdsat*, *TPvat* correspond to predicted voxel count of SSAT, DSAT and VAT classes & *Ir* corresponds to each subject's voxel resolution. Sub-regions volumes percentage is computed using Eq. (7), where *TPi* is the true positive volume of class *i*, and P*TPv* is the total volume of the fat region.

*MultiRes Attention Deep Learning Approach for Abdominal Fat Compartment Segmentation… DOI: http://dx.doi.org/10.5772/intechopen.111555*

$$\text{\%Vc} = \frac{\text{TPi}}{\sum \text{TPv}} \ast \text{100} \tag{7}$$

#### **2.10 Training parameters**

Single contrast fat-only 3D MR Dixon scans were converted to 2D slices for training (approximately 8000, 2D slices). Training was conducted on ubuntu 18.04 LTS operating system with NVIDIA Titan X GPU card with code written using TensorFlow framework [28] with hyperparameters of MultiRes-Attention U-Net is shown in **Table 1**.

#### **2.11 Performance analysis**

Multiclass Dice ratio (DR) & Hausdorff distance were two performance matrices used to evaluate the fat subregions segmentation which comprising of SSAT, DSAT and VAT regions.

The similarity between predicted and ground truth segmentation results is assessed by measuring the overlap using multiclass Dice score as illustrated in Eq. (8).

$$\text{DSI}\_k = \frac{\sum(\text{Ipred}[\text{Ig} == k] == k) \* 2.0}{\sum(\text{Ipred}[\text{Ipred} == k] == k) + \sum(\text{Igt}[\text{Ig} == k] == k)} \tag{8}$$

where *DSIk* is the subclass DSI value ranging between 0 and 1, where 1 means complete overlop of subregion, *Ipred* is the predicted output, *Igt* is the ground truth, and *k* is the number of classes.

Hausdorff Distance (HD) measures as the distance between two compact nonempty subsets of a metric space [30]. In order to find similarity between predicted (Pred) and ground truth (GT) HD measure between two closed and bounded subsets A and B of a given metric space M is defined as.

$$HD(Pred, GT) = \max(h(Pred, GT), h(GT, Pred))\tag{9}$$

$$h(Pred, GT) = \max(dist(aPred, GT))\tag{10}$$

$$dist(aPred, GT) = \min(\mu(aPred, GT))\tag{11}$$


#### **Table 1**

*Illustrating the hyperparameters values in training MultiRes-attention U-Net.*

where *HD Pred* ð Þ , *GT* is the direct distance between Predicted region and ground truth, *dist*ð Þ *αPred*, *GT* is the distance from point to region GT and *μ α*ð Þ , *GT* is a point distance in the metric space. The smaller HD(*Pred*, *GT*) indicates better segmentation accuracy i.e., less mismatch area.

#### **3. Results**

Accurate fat depot segmentation plays a significant role in evaluating fat distribution which can be used as biomarkers to assess metabolic syndrome and obesity. **Table 2** illustrates the training and testing Dice statistical index (DSI) (Mean � SD) for MultiRes-Attention U-Net, MultiResUNet, and standard U-Net's 3-class (Class 1: Superficial Fat, Class 2: Deep-Superficial Fat, Class 3: Visceral fat) segmentation accuracies with trained on focal dice loss functions.

Dice score (**Table 1**) indicated that all the models show improved segmentation accuracy when trained under focal dice loss function.

#### **4. Discussion**

The removal of the arm region is an important step in pre-processing as it contains SAT, which may interfere with automatic segmentation. MR-based data augmentation techniques were used to increase the training samples and improve the generalization of the model. In this study, we have proposed a MultiRes-attention U-Net for the segmentation of the three abdominal fat compartments namely superficial subcutaneous fat, deep subcutaneous fat and visceral fat.. Algorithm took about 5 s to accurately segment and quantitate all the 3 different fat compartments thus reducing the time significantly. This enables the usage of our algorithm for clinical routines and large clinical trials.

Based on **Table 1**, the proposed algorithm performs better and provides a more accurate segmentation output than MultiResUNet due to the introduction of the AG module. Introduction of the attention module improved the identification of significant


#### **Table 2.**

*Performance comparison of models.*

#### *MultiRes Attention Deep Learning Approach for Abdominal Fat Compartment Segmentation… DOI: http://dx.doi.org/10.5772/intechopen.111555*

features such as fascia boundary and smaller VAT components around the spine and preventing the network from learning false positive information. Focal dice loss function was found to be more appropriate in improving the overall segmentation results compared to cross-entropy (CE) loss and dice. Experimental results showed that focaldice loss function could handle inherent class imbalance (amount of SSAT/DSAT/VAT in different slices) where cross-entropy or dice loss functions failed. The mean focal dice loss DSI for the test dataset was about 97.81% for SSAT, 97.18% for DSAT, and 97.11% for VAT, which is a significant improvement by 7%, 11%, and 23% respectively when compared to standard U-Net results. AHD of the proposed architecture is slightly better than MultiResUNet and when compared to standard U-Net, it is significantly better for 3 classes (SSAT, DSAT, and VAT). In addition, the model was able to separate SAT into SSAT and DSAT in lean subjects (broken or invisible fascia) and obese subjects (multiple fasciae). As shown in **Figure 5**, the model was also able to differentiate between VAT and bones, especially in the spine and pelvic regions. Further, MultiRes-Attention U-Net was tested on a new 190 data sets (unseen during training; upper & lower abdomen scans with different resolution) as illustrated in **Figure 6** which yielded accurate results for SSAT and DSAT but had few false positives in sacrum region VAT.

#### **Figure 5.**

*Shows comparison of predicted results of U-Net, MultiResUNet, and MultiRes-attention U-Net (loss function: Focal dice) on low-medium and high-fat subjects.*

#### **Figure 6.**

*Illustration of the predicted result of MultiRes-attention U-Net on a few selected samples of new 190 data sets (unseen during training; upper & lower abdomen scans with different resolution).*

#### **5. Conclusion**

In this study, we propose MultiRes-Attention U-Net with hybrid loss function for segmentation of superficial and deep subcutaneous adipose tissue (SSAT & DSAT), and visceral adipose tissue (VAT) from abdominal MR scans. MultiRes block, ResAtt-Path, and attention gates can handle shape, scale, and heterogeneity in the abdominal data. Model performance is also dependent on the loss function, especially when there is data imbalance. In this research work, focal dice loss function compared to crossentropy (CE) loss and dice were found to be more appropriate in improving the overall segmentation results. The proposed pipeline contains pre-processing, data augmentation, and automatic segmentation of fat compartments and fat quantification. The proposed algorithm takes less than 5 s for segmentation and quantification of 3 fat compartments are provided more generalizable results where the model was able to separate SAT into SSAT and DSAT in lean subjects (broken or invisible fascia) and in obese subjects (multiple fasciae) and also differentiate small VAT tissue from bones making it feasible for use in large clinical trials and clinical routine.

#### **Author details**

Bhanu K.N. Prakash<sup>1</sup> \*, Arvind Channarayapatna Srinivasa<sup>1</sup> , Ling Yun Yeow<sup>1</sup> , Wen Xiang Chen<sup>2</sup> , Audrey Jing Ping Yeo<sup>3</sup> , Wee Shiong Lim3 and Cher Heng Tan2

1 Bioinformatics Institute (BII), Agency of Science, Technology and Research (A\*STAR), Singapore, Republic of Singapore

2 Department of Diagnostic Radiology, Tan Tock Seng Hospital, Singapore, Republic of Singapore

3 Department of Geriatric Medicine, Tan Tock Seng Hospital, Singapore, Republic of Singapore

\*Address all correspondence to: bhanu\_prakash@bii.a-star.edu.sg

© 2023 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

*MultiRes Attention Deep Learning Approach for Abdominal Fat Compartment Segmentation… DOI: http://dx.doi.org/10.5772/intechopen.111555*

#### **References**

[1] Web page: who news: https://www. who.int/news-room/fact-sheets/deta il/obesity-and-overweight.

[2] Tremmel M, Gerdtham UG, Nilsson PM, Saha S. Economic Burden of Obesity: A Systematic Literature Review. International Journal of Environmental Research and Public Health. 2017 Apr 19; **14**(4):435. DOI: 10.3390/ijerph14040435

[3] Brons C, Grunnet LG. Mechanisms in endocrinology: Skeletal muscle lipotoxicity in insulin resistance and type 2 diabetes: A causal mechanism or an innocent bystander? European Journal of Endocrinology. 2017;**176**:R67-R78. DOI: 10.1530/EJE-16-0488

[4] St-Pierre J, Lemieux I, Vohl MC, Perron P, Tremblay G, Despres JP, et al. Contribution of abdominal obesity and hypertriglyceridemia to impaired fasting glucose and coronary artery disease. The American Journal of Cardiology. 2002; **90**:15-18

[5] Chan JM, Rimm EB, Colditz GA, Stampfer MJ, Willett WC. Obesity, fat distribution, and weight gain as risk factors for clinical diabetes in men. Diabetes Care. 1994;**17**:961-969

[6] Seabolt LA, Welch EB, Silver HJ. Imaging methods for analyzing body composition in human obesity and cardiometabolic disease. Annals of the New York Academy of Sciences. 2015; **1353**:41-59. DOI: 10.1111/nyas. 12842

[7] Baum T, Cordes C, Dieckmeyer M, Ruschke S, Franz D, Hauner H, et al. MRbased assessment of body fat distribution and characteristics. European Journal of Radiology. 2016;**85**:1512-1518. DOI: 10.1016/j.ejrad.2016.02.013

[8] Schar M, Eggers H, Zwart NR, Chang Y, Bakhru A, Pipe JG. Dixon water-fat separation in PROPELLER MRI acquired with two interleaved echoes. Magnetic Resonance in Medicine. 2016;**75**:718-728. DOI: 10.1002/mrm.25656

[9] Positano V, Gastaldelli A, Sironi AM, Santarelli MF, Lombardi M, Landini L. An accurate and robust method for unsupervised assessment of abdominal fat by MRI. Journal of Magnetic Resonance Imaging. 2004;**20**:684-689. DOI: 10.1002/jmri.20167

[10] Demerath EW, Ritter KJ, Couch WA, Rogers NL, Moreno GM, Choh A, et al. Validity of a new automated software program for visceral adipose tissue estimation. International Journal of Obesity. 2007;**31**:285-291

[11] Kullberg J, Angelhed JE, Lonn L, Brandberg J, Ahlstrom H, Frimmel H, et al. Whole-body T1 mapping improves the definition of adipose tissue: Consequences for automated image analysis. Journal of Magnetic Resonance Imaging. 2006;**24**:394-401. DOI: 10.1002/jmri.20644

[12] Chew J, Yeo A, Yew S, Tan CN, Lim JP, Hafizah Ismail N, et al. Nutrition Mediates the Relationship between Osteosarcopenia and Frailty: A Pathway Analysis. Nutrients. 2020 Sep 27;**12**(10): 2957. DOI: 10.3390/nu12102957

[13] Kn BP, Gopalan V, Lee SS, Velan SS. Quantification of abdominal fat depots in rats and mice during obesity and weight loss interventions. PLoS One. 2014;**9**:e108979. DOI: 10.1371/journal. pone.0108979

[14] McBee MP, Awan OA, Colucci AT, Ghobadi CW, Kadom N, Kansagra AP, et al. Deep Learning in Radiology.

Academic Radiology. 2018 Nov;**25**(11): 1472-1480. DOI: 10.1016/j. acra.2018.02.018

[15] Grainger AT, Krishnaraj A, Quinones MH, Tustison NJ, Epstein S, Fuller D, et al. Deep learning-based quantification of abdominal subcutaneous and visceral fat volume on CT images. Academic Radiology. 2021; **28**(11):1481-1487. DOI: 10.1016/j. acra.2020.07.010 Epub 2020 Aug 6

[16] Nandakumar G, Srinivasan G, Kim H, Pi J. Comprehensive End-to-End Workflow for Visceral Adipose Tissue and Subcutaneous Adipose Tissue quantification: Use Case to improve MRI accessibility. In: 2020 IEEE 20th International Conference on Bioinformatics and Bioengineering (BIBE), Cincinnati, OH, USA, 2020. pp. 1060-1064. DOI: 10.1109/ BIBE50027.2020.00179

[17] Estrada S, Lu R, Conjeti S, Orozco-Ruiz X, Panos-Willuhn J, Breteler MM, et al. FatSegNet: A fully automated deep learning pipeline for adipose tissue segmentation on abdominal Dixon MRI. Magnetic Resonance in Medicine. 2019; **83**:1471-1483

[18] Nowak S, Theis M, Wichtmann BD, Faron A, Froelich MF, Tollens F, et al. End-to-end automated body composition analyses with integrated quality control for opportunistic assessment of sarcopenia in CT. European Radiology. 2022 May;**32**(5): 3142-3151. DOI: 10.1007/s00330-021- 08313-x

[19] Küstner T, Hepp T, Fischer M, Schwartz M, Fritsche A, Häring HU, et al. Fully Automated and Standardized Segmentation of Adipose Tissue Compartments via Deep Learning in 3D Whole-Body MRI of Epidemiologic Cohort Studies. Radiol Artif Intell. 2020

Oct 28;**2**(6):e200010. DOI: 10.1148/ ryai.2020200010

[20] Oktay O, Schlemper J, Folgoc LL, Lee MJ, Heinrich MP, Misawa K, et al. Attention U-Net: Learning where to look for the pancreas. ArXiv abs/1804.03999. 2018

[21] Kafali SG, Shih SF, Li X, Chowdhury S, Loong S, Barnes S, et al. 3D Neural Networks for Visceral and Subcutaneous Adipose Tissue Segmentation using Volumetric Multi-Contrast MRI. Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC). 2021 Nov;**2021**:3933- 3937. DOI: 10.1109/EMBC46164.2021. 9630110

[22] Ibtehaz N, Rahman MS. MultiResUNet: Rethinking the U-Net architecture for multimodal biomedical image segmentation. Neural Networks: The Official Journal of the International Neural Network Society. 2020;**121**:74-87

[23] Bhanu PK, Arvind CS, Yeow LY, Chen WX, Lim WS, Tan CH. CAFT: a deep learning-based comprehensive abdominal fat analysis tool for large cohort studies. MAGMA. 2022 Apr;**35** (2):205-220. DOI: 10.1007/s10334-021- 00946-9

[24] Ronneberger O, Fischer P, Brox T. U-Net: Convolutional networks for biomedical image segmentation. ArXiv 1505.04597. 2015

[25] He F, Liu T, Tao D. Why ResNet works? Residuals generalize. IEEE Transactions on Neural Networks and Learning Systems. 2020;**31**:5349-5362

[26] Cao Y, Liu S, Peng Y, Li J. DenseUNet: Densely connected UNet for electron microscopy image segmentation. IET Image Processing. 2020;**14**:2682-2689

*MultiRes Attention Deep Learning Approach for Abdominal Fat Compartment Segmentation… DOI: http://dx.doi.org/10.5772/intechopen.111555*

[27] Sudre CH, Li W, Vercauteren T, Ourselin S, Cardoso MJ. Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Springer; 2017. pp. 240-248

[28] Braiek HB, Khomh F. TFCheck : A TensorFlow Library for Braiek, Houssem Ben and Foutse Khomh. TFCheck : A TensorFlow Library for Detecting Training Issues in Neural Network Programs. In: 2019 IEEE 19th International Conference on Software Quality, Reliability and Security (QRS). 2019. pp. 426-433

[29] Kingma DP, Ba J. Adam: A method for stochastic optimization. ArXiv 1412.6980. 2015

[30] Andreev A, Kirov N. Hausdorff distances for searching in binary text images. Serdica Journal of Computing. 2009;**3**(1):23-46

#### **Chapter 5**

## Deep Learning for Natural Language Processing

*Yuan Wang, Zekun Li, Zhenyu Deng, Huiling Song and Jucheng Yang*

#### **Abstract**

With the constantly growing number of topical or sentiment-bearing texts and dialogs on the Web, the demand for automatic language or text analysis algorithms continues to expand. This chapter discusses about advanced deep learning techniques for classical and hot research directions in the field of natural language processing, including text classification, sentiment analysis, and task-oriented dialog systems. In text classification, we focus on tasks of multi-label text classification and extreme multi-label text classification, which allow for automatically annotates the texts with the most relevant labels. In sentiment analysis, we look into aspect-based sentiment analysis that makes automatic extraction of fine-grained sentiment information from texts, and multimodal sentiment analysis that classifies people's opinions or attitudes from multimedia data through fusion techniques. In dialog system, we introduce how deep learning techniques work in pipeline mode and end-to-end mode for taskoriented dialog system. In this chapter, the rapidly evolving state of the research on the three topics is reviewed. Furthermore, trends in the research on deep learning for natural language processing are identified, and a discussion about future advances is provided.

**Keywords:** deep learning, text classification, sentiment analysis, task-oriented dialog system, tasks and models

#### **1. Introduction**

Deep learning becomes increasingly important due to the fast growing of internet contents and the urgent needs of big data in natural language processing (NLP).

The text classification task is one of the most fundamental scenarios in natural language processing (NLP), where the user enters the text and the model divides the input text into defined categories. Text classification tasks can be divided into multiclass text classification, multi-label text classification, hierarchical text classification and extreme multi-label text classification. In the multi-class text classification settings, there are two or more label categories in the label set, and each sample has only one relevant label. In the multi-label text classification (MLTC) settings, a sample may have one or more relevant labels. The hierarchical text classification is a special multiclass text task or multi-label task, where the labels have a hierarchical relationship

between them. The extreme multi-label text classification task (XMTC) is annotating the most relevant labels for the text from a large label set with millions, or even billions, of labels. It is a limitation of traditional models that words are treated as independent features out of context. Deep learning methods have had great success in other related fields by automatically extracting context-sensitive features from raw text. Text classification techniques can be applied into problem classification [1], topic classification [2], and emotion classification [3]. Text classification tasks can be divided into the recommendation system domain, the legal domain, and the ad placement domain depending on the target domain. In the field of recommendation systems, predicting how much a user prefers a particular item. In the legal field, MLTC questions are used to predict the final outcome of bills. In the field of ad placement, personalized ads are tailored to users by inferring their characteristics and personal interests on social media.

Sentiment analysis refers to mining people's opinions and emotional attitudes toward various matters through modal information such as texts and images. In the early days, sentiment analysis was mainly used to analyze user reviews of products sold online, and thus confirm user preferences for purchasing products. With the popularity of self-publishing nowadays, sentiment analysis is more often used to identify the sentiment analysis of topic participants, to mine the value of topics, and to analyze related public opinion. Sentiment analysis has important application value for both society and individuals.

The dialog system relies on deep learning technology to act as an assistant to talk or chat with people to people. Task-oriented dialog system is used to solve specific problems in specific fields, such as movie ticket reservation, restaurant table reservation, etc. Because of its huge commercial value, it has attracted more and more people's attention.

This chapter is organized as follows: Section 2 discusses advancement in text classification, Section 3 outlines the sentiment analysis, Section 4 presents the task-oriented dialog system, and finally, Section 5 concludes the chapter.

#### **2. Advancement in text classification**

#### **2.1 Multi-label text classification**

There are three problems in MLTC settings. The process of obtaining comprehensive supervisory information is time-consuming and labor-intensive. The lack of theoretical support for the interpretability aspect of deep learning is also an issue that needs to be addressed. Modeling label dependencies is a major difficulty (**Figure 1**).

Multi-label text classification includes text pre-processing, text representation work using feature engineering, and classifier. Text pre-processing is a series of processes on the original text including word segmentation, cleaning, normalization, and so on. Text representation processes words into vectors or matrices so that computers can process them. Feature engineering is divided into heuristics, machine

**Figure 1.** *Deep learning in multi-label text classification.* learning-based methods, and deep learning-based methods. Deep learning-based approaches can be divided into text-based representations [4] and interactive representations [4] based on text and labels, depending on whether the model introduces labels information to represent the text.

#### *2.1.1 Text representation*

Deep learning-based approaches can be divided into text-based representations [4] and interactive representations [4]. Text-based representations focus on converting text into machine-understandable form for subsequent natural language processing tasks. Interactive representations, on the other hand, focus on modeling dialog history and context to better understand the current dialog by considering different sentences in the dialog history and changes in user intent. It should be noted that text-based and interactive representations are not mutually exclusive but can be used in combination. In some tasks, text-based representations can be used first to convert individual texts into representation vectors, and then considered in conjunction with interactive representations to take into account contextual information for more accurate and comprehensive text comprehension and processing. For text-based representations, TextCNN [5] applies convolutional neural networks and uses multiple kernels of different sizes to extract key information in sentences. For interactive representations, LEAM [6] establishes the semantic interaction matrix between texts and labels to obtain the attention weight, so as to obtain the most relevant labels.

#### *2.1.2 Deep learning models*

Deep learning-based text representation works to automatically acquire textual information, including word vector models and neural network models.

Word vector models based on distributed representations map vectors in highdimensional space to low-dimensional space, alleviating the problem of feature sparsity. Commonly used word vectors include static word vectors word2vec [7], global vectors for word representation (Glove) [8], dynamic word vector models such as embedding from language models (ELMo) [9], and bidirectional encoder representations from transformers (BERT) [10] models. Word2vec can further subdivided into CBOW [7] and skip-gram. The input to the CBOW [7] is a vector of neighboring words of a central word, and the output is a vector of words of that central word. The input to the skip-gram model is a vector of central words, and the output is a vector representation of the surrounding words of that central word. This is generally better than CBOW. Glove [8] statistical co-occurrence matrix and sliding window, taking into account both local and global information. Firstly, the co-occurrence matrix is constructed by using the corpus, and secondly, the relationship between the word vector and the co-occurrence matrix is constructed. ELMo [9] has a three-layer structure, with the first layer being the word2vec or Glove, and the next two layers being the two bidirectional long- and short-term memory (Bi-LSTM) extracting word contextual features to effectively solve the problem of multiple meanings of words. BERT uses transformer as the main framework for capturing bidirectional relations in utterances and constructs mask language model and next sentence prediction as targets for multi-task training in terms of training tasks.

Common neural network models include convolutional neural networks (CNN) [11], recurrent neural network (RNN) [12], long- and short- term memory network (LSTM) [1], and attention mechanisms [13]. CNN sets different convolutional kernels to extract local contextual information of the text and deepens the multi-layer convolutional and pooling layers to capture deeper textual information. In detail, the input layer obtains low-dimensional word vectors. The convolution layer extracts the local information of the text and the pooling layer reduces the feature dimension and prevents overfitting. Finally, the text and label dimensions are unified by the fully connected layer. The softmax layer is normalized to obtain the probability. RNN uses time series memory history information to obtain a representation of text content information by accepting text sequences of arbitrary length and generating a fixedlength vector. Gradient vanishing or explosion prevents RNN from effectively learning long-term dependencies and correlations. LSTM, in order to solve the problem of RNN on long-term dependency, adds forgetting gates, input gates, and output gates units to RNN to avoid gradient vanishing or explosion. The methods above assign the same weight to words and cannot distinguish the importance of words. Inspired by human attention, the attention mechanism is introduced to focus on key information and key contents, making it easy for models to focus on the weighted part and improve the classification accuracy. The attention mechanisms are usually divided into three categories, namely local attention, global attention, and self-attention mechanisms. Global attention considers entire text of words, assigning weights between 0 and 1 to obtain the text representation. Local attention assigns a weight of either 0 or 1 to each word, discarding some irrelevant items directly. Self-attention assigns weights based on the interaction of input words, which has advantage of parallel computing in long text classification.

In conclusion, both word vector models and neural network models are important components of deep learning-based text representation techniques, and they each have their own advantages and can be selected according to the needs of specific tasks. Word vector models focus more on the static representation of words, while neural network models are better able to capture the dynamic information of the context. Word vector models are relatively fast to train, while neural network models usually require larger computational resources and longer training time. Neural network models may perform better on some complex tasks, but for some simple tasks, word vector models are effective enough.

#### **2.2 Extreme multi-label text classification**

Extreme multi-label text classification learns a classifier that labels the most relevant subset of labels for a document from a very large set of labels. The main challenge is the millions of labels, features, and training points. The current research architectures in extreme multi-label text classification can be divided into four main categories, namely one-vs-all models, embedding-based models, tree-based models, and deep learning models. Due to the high computational costs brought by large-scale labels, the existing MLTC techniques have difficulty solving the XMTC problem. It can be seen that the extreme label text classification task is trapped in a large label space and feature space, leading to two pressing problems. The first problem is the power-law distribution, where long-tailed labels have very little data associated with them, making it difficult to obtain dependencies between labels, presenting data sparsity and scalability in extreme text classification work. The second problem is that computation is expensive, and the same results can be obtained at less cost using data augmentation techniques. One-vs-all models train a separate classifier for each label on the entire datasets. The one-vs-all models usually classifies well and with high accuracy; however, it assumes that the individual labels are independent of each other

#### *Deep Learning for Natural Language Processing DOI: http://dx.doi.org/10.5772/intechopen.112550*

and uncorrelated, resulting in a cost that grows linearly with the number of labels. Embedded models typically use the relationships between labels to map labels from a high-dimensional space to a low-dimensional space using a linear matrix mapping approach as a way to reduce the total number of parameters in the model and reduce the training time required for the model. The limitation of the embedding method is that it ignores the correlation between input and output, resulting in an unaligned embedding of the two. Tree-structured models are trained to produce instance or labeled trees to make predictions, such as decision trees, random forests, Hoffman trees, etc. Traditional tree-based approaches can harm performance due to large tree height and large cluster size.

All three types of models mentioned above are based on bag-of-words representations of text, where words are treated as independent features out of context and cannot capture deep semantic information. In contrast, deep learning models can automatically extract implicit contextual features from raw text for extreme multilabel text classification.

Typical work, such as XML-CNN [14], first explored the application of deep learning to XMTC, proposing a series of CNN models for XMTC, modeling convolutional neural networks and dynamic maximum pooling layers to extract semantic features of text, and introducing hidden bottleneck layers to reduce model parameters and accelerate training; however, XML-CNN [14] cannot capture the most important subtext of each label. Therefore, AttentionXML [15] solves this problem with two techniques. Firstly, a multi-label attention mechanism is introduced to capture the most relevant parts of text for each label. Secondly, a shallow and wide probabilistic label tree is built to handle millions of labels. Lightxml [16] adopts BERT as an encoder for text and obtains a better text representation, which is the stateof-the-art extreme multi-label text classification model. DeepXML [17] designed a framework to decompose XMTC into four subtasks using this framework. These four subtasks are optimized by selecting different components to generate a series of algorithms, including Astec [17], DECAF [18], GalaXC [19], and ECLARE [20]. Astec [17] needs to use label clustering to obtain intermediate feature representations. DECAF [18] jointly learn model parameters and feature representation to get label metadata. GalaXC [19] introduces a label attention mechanism to make more accurate predictions based on the multi-resolution embedding of nodes given by the graph. ECLARE [20] allows collaborative learning using label-label correlations.

In summary, one-vs-all models are simple and intuitive and can be used flexibly with a variety of binary classification algorithms but ignore the correlation between labels, which may lead to inaccurate classification. Embedding-based models capture semantic information but do not directly model the correlation between labels. Treebased models are able to handle high-dimensional and nonlinear data and can capture correlations between nested features and labels. Deep learning models are capable of learning complex feature representations and contextual correlations and are suitable for large-scale data and complex tasks.

#### **3. Advancement in sentiment analysis**

This section will introduce the aspect-based sentiment analysis (ABSA) and multimodal sentiment analysis in the sentiment analysis task, which is a classical task in the field of natural language processing, and we will mainly introduce the deep learning techniques for sentiment analysis since they have better performance than

the past machine learning methods and are the mainstream methods in the field of sentiment analysis.

#### **3.1 Aspect-based sentiment analysis**

The concept of ABSA was first introduced in 2010 by Thet et al. [21], and further, Liu [22] gave a definition of viewpoint in 2012; sentiment analysis and opinion mining refers to the field of research that analyzes people's opinions, sentiments, evaluations, attitudes, and emotions from written language. From 2014 to 2016, SemEval, an international semantic evaluation conference, has included the ABSA task as one of its subtasks and provided a series of benchmark datasets [23, 24], which have all been manually annotated. In recent years, the aspect-based sentiment analysis task has been receiving attention from many scholars, especially after the rapid application of deep learning and other related technologies in the fields of data mining, information retrieval, and intelligent question and answer. Therefore, research related to aspectbased sentiment analysis based on deep learning has also continued to achieve breakthroughs [25–29], and the ABSA task has gradually become one of the popular research topics in the field of NLP (**Figure 2**).

The advantage of aspect-based sentiment analysis is mainly that text sentiment analysis is fine-grained. Coarse-grained sentiment analysis can often only capture onesided single sentiment tendency and cannot analyze detail from each attribute level. A review text often contains sentiment views for different evaluation objects, for example, "the service of this restaurant is good, but the taste is bad." The text of this review evaluates the two aspects of "service" and "taste" separately, and the document-level and sentence-level sentiment analysis cannot mine each aspect separately. Therefore, aspect-based sentiment analysis is needed for re view texts that contain multiple aspects [30, 31].

Sentiment analysis methods based on deep learning can be divided into fourmain types: sentiment analysis methods with a single neural network, sentiment analysis methods with a hybrid neural network, sentiment analysis with the introduction of attention mechanisms, and sentiment analysis using pre-trained models.

The main methods for sentiment analysis of single neural networks are introducing a series of neural network models [32, 33] (e.g., CNN, RNN, etc.). CNN is mainly used to extract local features of text data, abstract low-dimensional vectors into vector

**Figure 2.** *The working effect of ABSA.*

*Deep Learning for Natural Language Processing DOI: http://dx.doi.org/10.5772/intechopen.112550*

representations with high-level semantics after operations such as convolutional pooling, and then process the coded representations and output the results. Lu et al. [34] made full use of syntactic relations and sentiment dependency information and proposed an aspect-gated graph convolutional network (AGGCN) to implement aspect-based sentiment analysis work. Liang et al. [35] made full use of the dependency syntactic knowledge and designed a dependency-embedded graph convolutional network applied to end-to-end sentiment analysis. Wang et al. [36] proposed a new unified location-aware convolutional neural network (UP-CNN) to solve the problem of difficult to fully utilize aspect location information.

In ABSA tasks, attention mechanisms have received a lot of attention and have been actively used in aspect-based sentiment analysis tasks because of the different importance of information in different parts of the text for aspect-based sentiment analysis tasks, and attention mechanisms have ability to adaptively identify key information and enhance attention to it [37–40]. Liao et al. [41] use a two-way transformer-based RoBERTa model to extract features from text and aspect word strings and use a cross-attention mechanism to add attention to the most relevant features for a given aspect category.

#### **3.2 Multimodal sentiment analysis**

With the rapid development of information and network technology and the widespread use of mobile terminals, people are gradually showing a trend of diversifying the content they publish. The messages they publish for different events and topics are no longer limited to a single text form, but tend to publish multimodal content combining text and images to express their feelings and opinion aspect-based. This situation and trend have attracted academic attention to multimodal sentiment analysis research, and by analyzing the sentiment tendency implied by these multimodal data, it has great application value in box office prediction, product marketing, political election, product recommendation, mental health analysis, etc. Therefore, multimodal sentiment analysis has become a hot research topic in recent years [42, 43]. Multimodal sentiment analysis is the process of combining documents that describe the same thing in different forms (e.g., sound, image, text, etc.) to enrich our perception of the thing and analyze the sentiment it expresses. The term modality is generally associated in academic research with the sensory modalities that represent our primary communication and sensory channels, and when a research question or data set contains multiple modalities, it is characterized as a multimodal task or multimodal data set. In general, academics have focused on (but not limited to) three modalities: (1) natural language, both spoken and textual, (2) visual signals, often represented by images or videos, and (3) acoustic signals, such as intonation and audio. Multimodal learning is a dynamic multidisciplinary field that is breaking new ground in many tasks such as multimodal sentiment analysis, cross-modal retrieve, image caption, audiovisual speech recognition, and visual question and answer, visual speech recognition, and other tasks (**Figure 3**).

Multimodal sentiment analysis makes full use of data from different modalities for accurate sentiment prediction. In 2016, a cross-modality consistent regression (CCR) model was proposed in the literature [44]. The authors of this paper concluded that the overall sentiment of text and image unimodal, as well as multimodal is the same with respect to representation of modality, text including descriptions and captions of images, and learning visual features using CNNs, which outperformed the unimodal model. In the same year, work [45] proposed a tree-structured recursive neural

**Figure 3.** *The working effect of MSA.*

networks (TreeLSTM) that use a tree structure and incorporates visual attention mechanisms. The system builds a structured structure based on sentence parsing aimed at aligning text words and image regions for accurate analysis and incorporates LSTM and attention mechanisms to learn a robust joint visual text representation with contemporaneous optimal results. In addition, the problem of image text mismatch and defects in social media data such as spoken words, misspellings, and lack of punctuation, pose a challenge to the task of sentiment analysis of multimodal data, and to address this challenge, in 2017, Xu et al. constructed different multimodal sentiment analysis networks, such as the hierarchical semantic attentional network (HSAN) [46] and multimodal deep semantic network (MultiSentiNet) [47]. HSAN focused on image captions and proposed a hierarchical semantic network model based on image captions in a multimodal sentiment analysis task using image captions to extract visual semantic features as additional information for text. MultiSentiNet, on the other hand, extracts image features from both objects and scenes and proposes a visual feature-guided attentional long- and short-term memory network to extract words that contribute to the understanding of text sentiment and aggregates these words with visual semantic features, objects and scenes. In 2018, co-memory network [48] proposed a novel co-memory network (CoMN), which models the interdependence between vision and text through memory networks to fully consider the interrelationship between multimodal data. In 2020, multi-view attentional network (MVAN) [49] utilizes a continuously updated memory network to obtain deep semantic features of images and texts. The authors found that existing datasets for multimodal sentiment analysis generally labeled only positive, negative and neutral sentiment polarities, and lacked graphical multimodal datasets for more detailed sentiment classification, so the authors constructed a large-scale image text multimodal dataset (TumEmo) based on social media multimodal data. Cheema proposed a simple and effective multimodal neural network (Sentiment Multi-Layer Neural Network, Se-MLNN) [50] model that used RoBERT to extract text features containing contextual features and multiple high-level image features from multiple perspectives to accurately predict the overall sentiment after fusing the features.

#### **4. Advancement in task-oriented dialog system**

This chapter introduces the task-oriented dialog system, including pipeline mode and end-to-end mode (**Figure 4**).

**Figure 4.** *Task-oriented dialog system.*

#### **4.1 Pipeline mode**

Task-oriented dialog system aims to process user messages accurately and puts forward fairly requirements for response constraints. Therefore, a pipeline method is proposed to generate responses in a controllable way. It is mainly divided into four parts: natural language understanding, dialog state tracking, dialog strategy learning, and natural language generation. The natural language understanding module converts the original user messages into semantic slots and classifies the domain and user intentions. Dialog status tracking module iteratively calibrates the dialog status based on the current input and dialog history. The dialog state includes relevant user actions and slot value pairs. The dialog strategy learning module tracks the calibrated dialog state according to the dialog state and decides the next action of the dialog agent. Finally, the natural language generation module converts the selected conversation actions into natural language for feedback to users. For example, in the movie ticket reservation task, the agent interacts with the movie knowledge base to retrieve movie information with specific constraints [51], such as movie name, time, cinema, etc.

#### *4.1.1 Natural language understanding*

Natural language understanding has a significant impact on the response quality of the whole system, which converts the user generated natural language messages into semantic slots and classified them. There are three tasks involved: domain classification, intention detection and slot filling. Domain classification aims to determine to which particular domain or topic the user input belongs. It categorizes the user's text into predefined domains, such as hotel booking, flight enquiry, weather information, etc. By identifying the subject domain to which the input relates, it can be passed to the appropriate processing module for further parsing. Intention detection refers to determining the user's intent or purpose in a particular domain. It focuses on the purpose behind the user's input rather than just the input text itself. For example, in the domain of hotel booking, a user may have different intentions, such as finding a hotel, booking a hotel, canceling a booking, etc. The goal of intent recognition is to identify the specific intent of the user so that the system can take the appropriate action or provide the correct response. Slot filling is the process of identifying and extracting key information from user input that is relevant to a specific domain. Slots are usually parameters or variables related to the intent, such as date, location, person's name, price, etc. Through slot filling, the system can capture and record the specific information provided by the user in a particular domain. For example, in a

hotel reservation domain, slots may include check-in date, check-out date, location, room type, etc.

Domain classification and intent detection belong to the same classification task. The problem of domain intent and classification of dialog is solved through deep learning, including building a deep convex network [52], which combines the prediction of a prior network with the current dialog as the overall input of the current network. In order to solve the difficulty of using depth neural networks to predict fields and intentions, some scholars used restricted Boltzmann machines and depth belief networks to derive the parameters of the initialized depth neural networks [53]. In order to take advantage of the advantages of recurrent neural networks (RNN) in sequence processing, some work used recurrent neural networks as dialog encoders and predicted intentions and domain categories [54]. Some scholars have proposed a short text intention classification model. Due to the lack of information in a single conversation turn, it is difficult to identify the intention of phrases. Using RNN or CNN structure to fuse the dialog history, and obtain the context information as the additional input of the current turn information [55]. This model has achieved good performance in intention classification tasks. Recently, by pre-trained task-oriented dialog BERT, this method has achieved high accuracy in intention detection tasks. The proposed method can effectively alleviate the problem of data shortage in specific areas.

Slot filling, also known as semantic tagging problem, is a sequence classification problem. This model needs to predict multiple targets at the same time. Deep belief network shows good ability in deep structure learning. Some scholars built a sequence marker based on deep belief network. In addition to the named entity recognition input features used in traditional markers, they also combined part of speech and syntactic features as part of the input. Recurrent structures are beneficial to sequence marking tasks because they can track information along past time steps to maximize the use of sequence information. Some scholars first proposed that RNN language models can be applied to sequence tagging rather than simply predicting words [56]. At the output end of RNN, the sequence labels corresponding to the input words are not normal words. Some scholars further studied the impact of different recurrent structures on slot filling tasks and found that all RNN models are superior to the simple conditional random field method [57]. Because the shallow output representation of traditional semantic annotation lacks the ability to express structured dialog information, the slot filling task is regarded as a template based tree decoding process by iteratively generating and filling templates [58].

#### *4.1.2 Dialog status tracking*

Dialog state tracking (DST) is the first module of the dialog manager. According to the entire dialog history, each turn tracks the user's goals and relevant details, providing the strategy learning module with the information needed for decision-making. There is a close relationship between natural language understanding and dialog state tracking. Both of them need to fill slots of dialog information [59]. However, they actually play two different roles. The natural language understanding module attempts to classify current user messages, such as intention recognition and domain recognition, and slots to which each message character belongs.

The first flow can be considered as a multi-class classification task. For multi-class classification DST, the tracker predicts to select the correct class from multiple values. Some scholars used RNN as a neural tracker to obtain the perception of dialog context *Deep Learning for Natural Language Processing DOI: http://dx.doi.org/10.5772/intechopen.112550*

[60]. The tracker finally makes a binary prediction of the current slot value pair based on the dialog history. The second flow of neural tracker with unfixed slot names and values attracts more attention because it not only reduces the model and time complexity of DST tasks but also helps to train task-oriented dialog systems end-to-end. Some scholars proposed the belief span, that is. the text corresponding to the dialog context spans to a specific slot [61]. They built a two-stage CopyNet to copy and store the slot value history storage slot in the dialog to prepare for neural response. The belief span promotes the end-to-end training of the dialog system and improves the tracking accuracy outside the vocabulary. Based on this, some scholars proposed the minimum belief span, which is not scalable to generate belief state domains from scratch when the system interacts with APIs from different sources [62]. Some scholars proposed a trade model. The model also applies the replication mechanism and uses a soft-gated pointer generator to generate the slot value dialog context based on the domain slot pair and coding [63].

#### *4.1.3 Natural language generation*

Natural language generation is the last module in the pipeline mode of taskoriented dialog system. It tries to convert the dialog actions generated by the dialog manager into the final natural language representation. The standard flow of the defined natural language generation module is composed of four components, and its core components are content determination, sentence planning, and surface implementation.

The deep learning method is applied to further enhance the NLG performance, and the pipeline is folded into a single module. The generation of end-to-end natural languages has made gratifying progress and is the most popular way to implement NLG. Some scholars believed that natural language generation should be completely data-driven and not rely on any expert rules [64]. They proposed a statistical language model based on RNN, which uses semantic constraints and syntax trees to learn response generation. In addition, they also used CNN re-ranked to further select better answers. Similarly, some scholars used LSTM model to learn sentence planning and surface implementation at the same time. Some scholars used GRU to further improve the generation quality on multiple domains [65]. The proposed generator always generates high-quality responses on multiple domains. To improve the adaptability of the domain recurrent model, some scholars proposed to first train the recurrent language to model the data synthesized from the data sets outside the domain, and then fine-tune the relatively small data sets within the domain. This training strategy has proved to be effective in human assessment [66].

#### **4.2 End-to-end mode**

In the process of building an end-to-end task-oriented dialog system, a complex neural network model is used to implicitly represent key functions, and all modules are integrated into one module. The research of task-oriented end-to-end neural network model mainly focuses on training methods or model architecture, which is the key to response correctness and quality [67]. An incremental learning framework is proposed to train their end-to-end task-oriented system. The main idea is to establish an uncertainty estimation module to evaluate the confidence of the generated response. If the confidence is higher than the threshold value, the response will be accepted. If the confidence score is lower, the manual response will be introduced.

Recent works often do not build end-to-end systems to apply in a pipeline manner. Instead, they use complex neural models to implicitly represent key functions and integrate modules into one. Task-oriented end-to-end neural model research focuses on training methods or model architecture, which is the key and quality of response correctness. Some scholars proposed an incremental learning framework to train their end-to-end learning task-oriented system [61]. The main idea is to establish an uncertainty evaluation module to evaluate the confidence of the generated appropriate response. If the confidence score is higher than the threshold, then the response will be accepted, while if the confidence score is very low. The agent can also use online learning to learn from human responses. Some scholars use model agnostic meta learning (MAML) to jointly improve adaptability and reliability [68]. In real life online service tasks, there are only a few training samples. Similarly, some scholars also used MAML to train the end-to-end neural model to promote domain adaptation, which enables the model to train rich resource tasks first, and then train limited new task data [59]. Other scholars trained an inconsistent order detection module in an unsupervised manner [63]. The module detects whether the command discourse generates a more coherent response.

#### **5. Conclusions**

Most existing shallow and deep learning models have structures that can be used for text classification, including integrated approaches. BERT learns a form of linguistic representation that can be used to fine-tune many downstream NLP tasks. The main approaches are to add data, increase computational power, and design training programs to obtain better results. The trade-off between data and computational resources and predictive performance is worth investigating. Due to the inability to collect data with full supervisory information, so MLTC is gradually turning to the problem of classification with limited supervised information. Since the excellent performance of AlexNet in 2012, deep learning has shown great potential. How to leverage the powerful learning capabilities of deep learning to better capture the label dependencies is key to solving MLTC tasks.

With the development of deep learning technology in the application of emotion analysis tasks, the performance of emotion analysis has been greatly improved. However, some tasks and scenarios still need more abundant data sets to evaluate the model more accurately.

Although deep learning has achieved remarkable results in the dialog system, in the pipeline mode, if accurate and fast access to user intentions is still the demand of the industry, in the end-to-end mode, controllability, and interpretability also need to be further studied.

*Deep Learning for Natural Language Processing DOI: http://dx.doi.org/10.5772/intechopen.112550*

#### **Author details**

Yuan Wang\*, Zekun Li, Zhenyu Deng, Huiling Song and Jucheng Yang College of Artificial Intelligence, Tianjin University of Science and Technology, China

\*Address all correspondence to: wangyuan23@tust.edu.cn

© 2023 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### **References**

[1] Graves A. Long short-term memory. In: Supervised sequence labelling with recurrent neural networks. Berlin: Springer; 2012. pp. 37-45

[2] Sakai Y, Matsuoka Y, Goto M. Purchasing behavior analysis model that considers the relationship between topic hierarchy and item categories. In: International Conference on Human-Computer Interaction. Cham: Springer; 2022. pp. 344-358

[3] Chen Z, Qian T. Transfer capsule network for aspect level sentiment classification. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Washington: ACL; 2019. pp. 547-556

[4] Li Q, Peng H, Li J, Xia C, Yang R, Sun L, et al. A survey on text classification: From traditional to deep learning. ACM Transactions on Intelligent Systems and Technology (TIST). 2022;**13**(2):1-41

[5] Chen Y. Convolutional neural network for sentence classification. [Master's thesis], University of Waterloo. 2015

[6] Wang G, Li C, Wang W, Zhang Y, Shen D, Zhang X et al. Joint embedding of words and labels for text classification. arXiv preprint arXiv: 1805.04174. 2018

[7] Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. 2013

[8] Pennington J, Socher R, Manning CD. Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing

(EMNLP). Toronto: ACL; 2014. pp. 1532-1543

[9] Sarzynska-Wawer J, Wawer A, Pawlak A, Szymanowska J, Stefaniak I, Jarkiewicz M, et al. Detecting formal thought disorder by deep contextualized word representations. Psychiatry Research. 2021;**304**:114135

[10] Devlin J, Chang M-W, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. 2018. arXiv preprint arXiv:1810.04805

[11] Kalchbrenner N, Grefenstette E, Blunsom P. A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188. 2014

[12] Zaremba W, Sutskever I, Vinyals O. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329. 2014

[13] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Advances in Neural Information Processing Systems. 2017;**30**

[14] Liu J, Chang W-C, Wu Y, Yang Y. Deep learning for extreme multi-label text classification. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM; 2017. pp. 115-124

[15] You R, Zhang Z, Wang Z, Dai S, Mamitsuka H, Zhu S. Attentionxml: Label tree-based attention-aware deep model for high-performance extreme multi-label text classification. Advances in Neural Information Processing Systems. 2019;**32**

[16] Jiang T, Wang D, Sun L, Yang H, Zhao Z, Zhuang F. Lightxml:

*Deep Learning for Natural Language Processing DOI: http://dx.doi.org/10.5772/intechopen.112550*

Transformer with dynamic negative sampling for high-performance extreme multi-label text classification. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. Toronto: AAAI; 2021. pp. 7987-7994

[17] Dahiya K, Saini D, Mittal A, Shaw A, Dave K, Soni A, et al. Deepxml: A deep extreme multi-label learning framework applied to short text documents. In: Proceedings of the 14th ACM International Conference on Web Search and Data Mining. New York: ACM; 2021. pp. 31-39

[18] Mittal A, Dahiya K, Agrawal S, Saini D, Agarwal S, Kar P, et al. Decaf: Deep extreme classification with label features. In: Proceedings of the 14th ACM International Conference on Web Search and Data Mining. New York: ACM; 2021. pp. 49-57

[19] Saini D, Jain AK, Dave K, Jiao J, Singh A, Zhang R, et al. Galaxc: Graph neural networks with labelwise attention for extreme classification. In: Proceedings of the Web Conference 2021. New York: ACM; 2021. pp. 3733-3744

[20] Mittal A, Sachdeva N, Agrawal S, Agarwal S, Kar P, Varma M. Eclare: Extreme classification with label graph correlations. In: Proceedings of the Web Conference 2021. New York: ACM; 2021. pp. 3721-3732

[21] Thet TT, Na J-C, Khoo CSG. Aspectbased sentiment analysis of movie reviews on discussion boards. Journal of Information Science. 2010;**36**(6): 823-848

[22] Liu B, Zhang L. A survey of opinion mining and sentiment analysis. In: Aggarwal, C., Zhai, C. (eds) Mining Text Data. Boston, MA: Springer; 2012

[23] Pontiki M, Galanis D, Papageorgiou H, Manandhar S, Androutsopoulos I. Semeval-2015 task 12: Aspect based sentiment analysis. In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015). Toronto: ACL; 2015. pp. 486-495

[24] Pontiki M, Galanis D, Papageorgiou H, Androutsopoulos I, Manandhar S, Al-Smadi M, et al. Semeval-2016 task 5: Aspect based sentiment analysis. In: International Workshop on Semantic Evaluation. Toronto: ACL; 2016. pp. 19-30

[25] Do HH, Prasad PWC, Maag A, Alsadoon A. Deep learning for aspect-based sentiment analysis: A comparative review. Expert Systems with Applications. 2019;**118**: 272-299

[26] Akhtar MS, Gupta D, Ekbal A, Bhattacharyya P. Feature selection and ensemble construction: A two-step method for aspect based sentiment analysis. Knowledge-Based Systems. 2017;**125**:116-135

[27] Peng H, Ma Y, Li Y, Cambria E. Learning multi-grained aspect target sequence for chinese sentiment analysis. Knowledge-Based Systems. 2018;**148**: 167-176

[28] Tang F, Luoyi F, Yao B, Wenchao X. Aspect based fine-grained sentiment analysis for online reviews. Information Sciences. 2019;**488**:190-204

[29] Liu N, Shen B. Rememnn: A novel memory neural network for powerful interaction in aspect-based sentiment analysis. Neurocomputing. 2020;**395**: 66-77

[30] Xiao D, Ren F, Pang X, Cai M, Wang Q, He M, et al. A hierarchical and parallel framework for end-to-end aspect-based sentiment analysis. Neurocomputing. 2021;**465**:549-560

[31] Zhou J, Zhao J, Huang JX, Qinmin Vivian H, He L. Masad: A large-scale dataset for multimodal aspect-based sentiment analysis. Neurocomputing. 2021;**455**:47-58

[32] Khasanah IN. Sentiment classification using fasttext embedding and deep learning model. Procedia Computer Science. 2021;**189**: 343-350

[33] Basiri ME, Nemati S, Abdar M, Cambria E, Rajendra U, Acharya. Abcdm: An attention-based bidirectional CNN-RNN deep model for sentiment analysis. Future Generation Computer Systems. 2021;**115**:279-294

[34] Qiang L, Zhu Z, Zhang G, Kang S, Liu P. Aspect-gated graph convolutional networks for aspect-based sentiment analysis. Applied Intelligence. 2021; **51**(7):4408-4419

[35] Liang Y, Meng F, Zhang J, Chen Y, Jinan X, Zhou J. A dependency syntactic knowledge augmented interactive architecture for end-to-end aspect-based sentiment analysis. Neurocomputing. 2021;**454**:291-302

[36] Wang X, Li F, Zhang Z, Guangluan X, Zhang J, Sun X. A unified position-aware convolutional neural network for aspect based sentiment analysis. Neurocomputing. 2021;**450**: 91-103

[37] Li Z, Li L, Zhou A, Hongbin L. Jtsg: A joint term-sentiment generator for aspect-based sentiment analysis. Neurocomputing. 2021;**459**:1-9

[38] Qiannan X, Zhu L, Dai T, Yan C. Aspect-based sentiment classification with multi-attention network. Neurocomputing. 2020;**388**:135-143

[39] Chen Y, Zhuang T, Guo K. Memory network with hierarchical multi-head attention for aspect-based sentiment analysis. Applied Intelligence. 2021; **51**(7):4287-4304

[40] Yuming Lin YF, Li Y, Cai G, Zhou A. Aspect-based sentiment analysis for online reviews with hybrid attention networks. World Wide Web. 2021; **24**(4):1215-1233

[41] Liao W, Zeng B, Yin X, Wei P. An improved aspect-category sentiment analysis model for text sentiment analysis based on roberta. Applied Intelligence. 2021;**51**(6):3522-3533

[42] Kaur R, Kautish S. Multimodal sentiment analysis: A survey and comparison. Research Anthology on Implementing Sentiment Analysis Across Multiple Disciplines. IGI Global. 2022. pp. 1846-1870

[43] Soleymani M, Garcia D, Jou B, Schuller B, Chang S-F, Pantic M. A survey of multimodal sentiment analysis. Image and Vision Computing. 2017;**65**: 3-14

[44] You Q, Luo J, Jin H, Yang J. Crossmodality consistent regression for joint visual-textual sentiment analysis of social multimedia. In: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining. New York: ACM; 2016. pp. 13-22

[45] You Q, Cao L, Jin H, Luo J. Robust visual-textual sentiment analysis: When attention meets tree-structured recursive neural networks. In: Proceedings of the 24th ACM International Conference on Multimedia. New York: ACM; 2016. pp. 1008-1017

*Deep Learning for Natural Language Processing DOI: http://dx.doi.org/10.5772/intechopen.112550*

[46] Nan X. Analyzing multimodal public sentiment based on hierarchical semantic attentional network. In: 2017 IEEE International Conference on Intelligence and Security Informatics (ISI). Beijing, China: IEEE; 2017. pp. 152-154

[47] Xu N, Mao W. Multisentinet: A deep semantic network for multimodal sentiment analysis. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. New York: ACM; 2017. pp. 2399-2402

[48] Xu N, Mao W, Chen G. A co-memory network for multimodal sentiment analysis. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. New York: ACM; 2018. pp. 929-932

[49] Yang X, Feng S, Wang D, Zhang Y. Image-text multimodal emotion classification via multi-view attentional network. IEEE Transactions on Multimedia. 2020;**23**:4014-4026

[50] Cheema GS, Hakimov S, Müller-Budack E, Ewerth R. A fair and comprehensive comparison of multimodal tweet sentiment analysis methods. In: Proceedings of the 2021 Workshop on Multi-Modal Pre-Training for Multimedia Understanding. New York: ACM; 2021. pp. 37-45

[51] Masi I, Tran AT, Leksut JT, Hassner T, Medioni G. Do we really need to collect millions of faces for effective face recognition? In: Computer Vision. Cham: Springer; 2016. pp. 579-596

[52] Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. In: International Conference on Learning Representations. New York: ACM; 2014. pp. 46-57

[53] Campagna G, Foryciarz A, Moradshahi M, Lam MS. Zero-Shot Transfer Learning with Synthesized Data for Multi-Domain Dialogue State Tracking. 2020

[54] Chen J, Zhang R, Mao Y, Xu J. Parallel interactive networks for multidomain dialogue state generation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Toronto: ACL; 2020. pp. 17-26

[55] Chen H, Liu X, Yin D, Tang J. A survey on dialogue systems: Recent advances and new frontiers. Acm Sigkdd Explorations Newsletter. 2017;**19**(2): 25-35

[56] Gliwa B, Mochol I, Biesek M, Wawer A. Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. In: Proceedings of the 2nd Workshop on New Frontiers in Summarization. New York: ACM; 2019. pp. 38-49

[57] Wen TH, Gasic M, Kim D, Mrksic N, Su PH, Vandyke D, et al. Stochastic language generation in dialogue using recurrent neural networks with convolutional sentence reranking. In: Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue. Toronto: ACL; 2015. pp. 275-284

[58] Wen TH, Gasic M, Mrksic N, Rojas-Barahona LM, Su PH, Ultes S, et al. Conditional generation and snapshot learning in neural dialogue systems. 2016

[59] Wen TH, Vandyke TH., Mrksic N, Gasic M, Rojas-Barahona LM, Su PH, et al. A network-based end-to-end

trainable task-oriented dialogue system. 2016

[60] Williams J. Multi-domain learning and generalization in dialog state tracking. In: Proceedings of the SIGDIAL 2013 Conference. Toronto: ACL; 2013. pp. 433-441

[61] Williams JD, K. Asadi, G. Zweig. Hybrid code networks:practical and efficient end-to-end dialog control with supervised and reinforcement learning. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Toronto: ACL; 2017. pp. 665-677

[62] Tamar A, Yi W, Thomas G, Levine S, Abbeel P. Value iteration networks. In: Twenty-Sixth International Joint Conference on Artificial Intelligence, New York: ACM; 2017. pp. 246-257

[63] Loni B. A survey of state-of-the-art methods on question classification. In: Proceedings of the 7th Workshop on Ph.D Students. New York: ACM; 2011

[64] Tao C, Mou L, Zhao D, Rui Y. Ruber: An unsupervised method for automatic evaluation of open-domain dialog systems. 2017

[65] Tao C, Wu W, Xu C, Hu W, Yan R. One time of interaction may not be enough: Go deep with an interactionover-interaction network for response selection in dialogues. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Toronto: ACL; 2019. pp. 189-197

[66] Tran VK, Nguyen LM. Semantic Refinement Gru-Based Neural Language Generation for Spoken Dialogue Systems. Singapore: Springer; 2017

[67] Tur G, Hakkani-Tur D, Heck L. What is left to be understood in atis? In: Spoken Language Technology Workshop (SLT), New York: IEEE; 2011. pp. 236-247

[68] Lu C, Xiang Z, Cheng C, Yang R, Kai Y. Agent-aware dropout dqn for safe and efficient on-line dialogue policy learning. In: The 2017 Conference on Empirical Methods on Natural Language Processing, Toronto: ACL; 2017. pp. 127-137

#### **Chapter 6**

## Deep Learning in Medical Imaging

*Narjes Benameur and Ramzi Mahmoudi*

#### **Abstract**

Medical image processing tools play an important role in clinical routine in helping doctors to establish whether a patient has or does not have a certain disease. To validate the diagnosis results, various clinical parameters must be defined. In this context, several algorithms and mathematical tools have been developed in the last two decades to extract accurate information from medical images or signals. Traditionally, the extraction of features using image processing from medical data are time-consuming which requires human interaction and expert validation. The segmentation of medical images, the classification of medical images, and the significance of deep learning-based algorithms in disease detection are all topics covered in this chapter.

**Keywords:** deep learning, medical imaging, segmentation, classification, diagnosis

#### **1. Introduction**

Recently, artificial intelligence (AI) is considered as a revolution across the medical field and one of the main factors of this AI revolution is deep learning (DL). The origin of DL and neural networks dates back to 1950. Yet, with the introduction of medically annotated big data, necessary for training, and the availability of high-performance computing, the recent years seem to mark a turning point for DL in medical imaging.

Accordingly, this branch of AI is recently applied to several healthcare problems such as computer-aided diagnosis, disease identification, image segmentation and classification, etc. Unlike classical tools, the powerful key of DL derives from the ability to automatically learn complex features without the need for human interaction. Nevertheless, many challenges still exist in medical health including privacy and heterogeneity of datasets. In this chapter, we will survey the application of DL in clinical imaging, and we will highlight the main challenges and future directions of this tool.

#### **2. Deep learning-based segmentation in medical imaging**

Deep learning algorithms were used in many medical applications to solve problems with segmentation, image classification, and pathology diagnosis. The manual segmentation process is time-consuming for radiologists because it is typically done slice by slice. Furthermore, segmentation results are susceptible to inter and interobserver variability. To address these limitations, several approaches based on active

contour, level set, and statistical shape modeling [1–3] have been proposed to segment the extent of various pathologies or anatomical geometries. All of the methods mentioned above, however, are still semi-automated and require human interaction [4].

With the advent of DL, a fully automated segmentation of serial medical images is become possible in a few seconds. Several studies in the literature reported that segmentation algorithms based on AI outperformed the other classical models [5, 6]. Convolutional neural networks (CNNs) are the most used architecture to segment medical images. It consists of reducing the spatial dimensionality of the original image data through a series of the network layers by performing convolution and pooling operations. Other DL architectures were also proposed for this task such as deep neural network (DNN), artificial neural network (ANN), fully convolutional network (FCN), ResNet-50, and VGGNet-16 [7–10]. **Figure 1** describes the tasks involved in segmenting cardiac images for various imaging modalities.

The success of DL-based medical image segmentation inspired other studies to reevaluate the traditional approaches to image segmentation and incorporate DL models into their work. Many factors have facilitated the increased use of DL. Among them, we can note the availability of medical data and the evolution of graphics processors' performances.

Each year, large, annotated datasets were published online. These data were collected during many challenges such as medical segmentation decathlon and medical image computing and computer aided interventions (MICCAI). **Table 1** summarizes the largest medical images datasets available online.

Segmentation based on DL were applied in different field of medical imaging [12–14]. In cardiac MRI, several DL models were used to delineate the contours of the myocardium which represent a crucial step to compute useful clinical parameters for the evaluation of cardiac function [15]. DL was also applied for the segmentation of different types and stage of cancer. For breast cancer, the data include mammography, ultrasound, and MRI images [16–18]. Other DL architectures were also proposed in the literature to segment cervical cancer based on Magnetic Resonance Imaging (MRI), computed tomography (CT), and positron emission tomography (PET) scan

#### **Figure 1.**

*Overview of cardiac image segmentation tasks for different imaging modalities [11].*


#### **Table 1.**

*Medical images datasets available online.*

data [19]. Zhao et al. [20] proposed a new model of DL that combined U-net with progressive growing of U-net+ (PGU-net) for automated segmentation of cervical nuclei. In their study, they reported a segmentation accuracy of 92.5%. Similarly, Liu et al. [21] applied a modified U-net model on CT images for clinical target volume delineation in cervical cancer. In their proposed architecture, the encoder and decoder components were replaced with dual path network (DPN) components. The mean dice similarity coefficient (DSC) and the Hausdorff distance (HD) values of the model were 0.88 and 3.46 mm.

Although image segmentation based on DL facilitates the detection, characterization, and analysis of different lesions in medical images, it still suffers from several limitations. First, the problem of missing border regions in medical images should be considered [22]. Furthermore, the imbalanced data available online could significantly affect the segmentation performances. In medical imaging, the collection of balanced data is challenging since images related to controls are largely

available compared to those associated with different pathologies. As a result, some models have been proposed to mitigate this problem. These models include convolutional autoencoders [23] and generative adversarial networks (GAN) [24]. The concept is based on the extraction of information from original images and to generate a similar dataset image based on linear transformation, e.g., reflection, rotation, translation.

#### **3. Deep learning-based classification in medical imaging**

Additionally, DL have demonstrated its superiority in the categorization of medical images, more notably in the distinction of various disorders. The extraction of key features is a step in the classification process that produces a model that can categorize a picture into multiple classes. To extract features using color or texture, several classical classifications have been dressed in the literature [25–27]. Support vector machines (SVMs), logistic regression, closest neighbors, etc. can be mentioned among them. These systems must, however, cope with other challenging problems related to medical imaging. First, the presence of artifacts in medical images may make it more difficult to categorize. Because of this, pre-processing is crucial to improving image quality. The second problem is the complexity of medical content captured by many modalities. The classification of medical images is extremely difficult because each modality has distinct characteristics.

Recently, several researchers used DL for the medical classification task and the results proved the accuracy of their models in comparison with the traditional machine learning approaches [28]. Deep learning's key benefit is its ability to quickly distinguish between various structures in images without the need for manual feature extraction. Recent DL architectures also have the capacity to incorporate a variety of features gathered from many modalities to produce an effective classifier.

Yadav and Jadhav [29] used a DL algorithm based on the transfer learning of VGG16 to classify pneumonia from chest X rays' images. In their study, they showed that the VGG16 outperformed the classical method based on SVM. The accuracy was 0.923 for VGG16 vs. 0.776 for SVM. Similarly, Xu et al. [30] tested a deep CNN in histopathology images to extract new features for the classification of colon cancer. Lai et al. [31] proposed a new architecture that combines coding network with multilayer perceptron (CNMP) with other features extracted from deep CNN.

In their study, they showed an accuracy of 90.2%. Although DL achieved high performance in the classification of medical images, it still suffers from numerous limitations. The major challenge is the reduced number of annotated data needed for the classification of medical images. Labeling data require the intervention of experienced radiologists. A few solutions have been proposed to resolve this issue. Pujitha and Sivaswamy [32] proposed a crowd-sourcing and synthetic image generation for training deep neural net-based lesion detection. In their study, they used color fundus retinal and they proved that crowd-sourcing improves the area under the curve (AUC) by 25%. The generative adversarial networks (GAN) is also another source of generating synthetic images with annotations. Aljohan and Alharbe [33] proposed a new GAN to generate synthetic medical images with the corresponding annotations from different medical modalities. The classification of medical images based on DL has shown good results. However, there are still several issues in medical image processing that need to be addressed with the different DL architectures.

**Figure 2.** *Deep learning for the screening of breast cancer [37].*

#### **4. Disease diagnosis based on deep learning**

Early and precise diagnosis is crucial for the treatment of different diseases and for the estimation of a severity grade. The use of DL for the diagnosis of diseases is a dynamic research area that attracts several researchers worldwide. In fact, DL architectures have been applied to some specific pathologies such as cancer, heart disease, diabetes, and Alzheimer's disease [34, 35]. The increasing number of medical imaging dataset led different researchers to use deep learning models for the diagnosis of different diseases.

DL algorithms have proven their performances in the prediction and diagnosis of cancer diseases. The availability of images derived from MRI, CT, mammography, and biopsy helped several researchers to use these data for early cancer detection. The analysis of cancer images includes the detection of tumor area, the classification of different cancer stages, and the extraction of different characteristics for tumors [36].

Recently, Shen et al. [37] used a modified version of CNNs for the screening of breast cancer using mammography data. The outcomes of their study showed an AUC of 0.95 and a specificity of 96.1%. A CNN was also applied for the classifications of different kinds of cancer and the detection of carcinoma. **Figure 2** depicts the entire image categorization process for breast cancer screening using DL architecture.

Alanazi et al. [38] applied the transfer DL model to detect brain tumor in the early stage by using various types of tumor data. Furthermore, another study used a 3D deep CNN to assess the glioma grade (low or high-grade glioma). In their study, they reported an accuracy of 96.49% [39]. Compared to the classical algorithms, the different studies proved the efficiency of DL in the prediction and analysis of cancer. However, bigger medical data available online are needed for more adequate validation.

#### **5. Conclusion**

As has been shown, using medical image processing techniques in clinical practice is crucial for determining if a patient has a particular disease or not. The field of

medical imaging has been transformed by AI and DL, which enable more precise and automatic feature extraction from medical data. DL has been used to address a variety of healthcare issues, including image segmentation and classification, disease detection, computer-aided diagnosis, and the learning of complex features without human interaction. Despite the advances made, many challenges still exist in medical health including privacy and heterogeneity of datasets.

### **Conflict of interest**

The authors declare no conflict of interest.

### **Author details**

Narjes Benameur1 and Ramzi Mahmoudi2,3\*

1 Laboratory of Biophysics and Medical Technology, Higher Institute of Medical Technologies of Tunis, University of Tunis el Manar, Tunis, Tunisia

2 Faculty of Medicine, Laboratory of Technology and Medical Imaging, University of Monastir, Monastir, Tunisia

3 Gaspard-Monge Computer Science Laboratory, A3SI, ESIEE Paris, Gustave Eiffel University, France

\*Address all correspondence to: ramzi.mahmoudi@esiee.fr

© 2023 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### **References**

[1] Pohl KM, Fisher J, Kikinis R, Grimson WEL, Wells WM. Shape based segmentation of anatomical structures in magnetic resonance images. Computer Visual Biomedical Image Application. 2005;**3765**:489-498. DOI: 10.1007/11569541\_49

[2] Chen X, Williams BM, Vallabhaneni SR, Czanner G, Williams R, Zheng Y. Learning active contour models for medical image segmentation. In: IEEE/ CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA; 2019. pp. 11624-11632. DOI: 10.1109/CVPR.2019.01190

[3] Swierczynski P, Papież BW, Schnabel JA, Macdonald C. A level-set approach to joint image segmentation and registration with application to CT lung imaging. Computerized Medical Imaging and Graphics. 2018;**65**:58-68

[4] Gao Y, Tannenbaum A. Combining atlas and active contour for automatic 3d medical image segmentation. Proceedings of the IEEE International Symposium Biomedical Imaging. 2011;**2011**:1401-1404

[5] Kim M, Yun J, Cho Y, Shin K, Jang R, Bae HJ, et al. Deep learning in medical imaging. Neurospine. 2019;**16**(4):657-668

[6] Vaidyanathan A, van der Lubbe MFJA, Leijenaar RTH, van Hoof M, Zerka F, Miraglio B, et al. Deep learning for the fully automated segmentation of the inner ear on MRI. Scientific Reports. 2021;**11**(1):2885

[7] Zadeh Shirazi A, McDonnell MD, Fornaciari E, Bagherian NS, Scheer KG, Samuel MS, Yaghoobi M, Ormsby RJ, Poonnoose S, Tumes DJ, Gomez GA. A

deep convolutional neural network for segmentation of whole-slide pathology images identifies novel tumour cellperivascular niche interactions that are associated with poor survival in glioblastoma.

[8] Cai L, Gao J, Zhao D. A review of the application of deep learning in medical image classification and segmentation. Annals of Translational Medicine. 2020;**8**(11):713. DOI: 10.21037/ atm.2020.02.44

[9] Malhotra P, Gupta S, Koundal D, Zaguia A, Enbeyle W. Deep neural networks for medical image segmentation. Journal of Healthcare Engineering. 2022;**200**:9580991. DOI: 10.1155/2022/9580991

[10] Alsubai S, Khan HU, Alqahtani A, Sha M, Abbas S, Mohammad UG. Ensemble deep learning for brain tumor detection. Frontiers in Computer Neuroscience. 2022;**16**:1005617. DOI: 10.3389/fncom.2022.1005617

[11] Chen C, Qin C, Qiu H, Tarroni G, Duan J, Bai W, et al. Deep learning for cardiac image segmentation: A review. Frontiers in Cardiovascular Medicine. 2020;**7**:25. DOI: 10.3389/ fcvm.2020.00025

[12] Hesamian MH, Jia W, He X, Kennedy P. Deep learning techniques for medical image segmentation: Achievements and challenges. Journal of Digital Imaging. 2019;**32**(4):582-596. DOI: 10.1007/s10278-019-00227-x

[13] Fu Y, Lei Y, Wang T, Curran WJ, Liu T, Yang X. A review of deep learning based methods for medical image multiorgan segmentation. Physica Medica. 2021;**85**:107-122

[14] Bangalore Yogananda CG, Shah BR, Vejdani-Jahromi M, Nalawade SS, Murugesan GK, Yu FF, et al. A fully automated deep learning network for brain tumor segmentation. Tomography. 2020;**6**(2):186-193

[15] Wang Y, Zhang Y, Wen Z, Tian B, Kao E, Liu X, et al. Deep learning based fully automatic segmentation of the left ventricular endocardium and epicardium from cardiac cine MRI. Quantitative Imaging in Medicine and Surgery. 2021;**11**(4):1600-1612

[16] Abdelrahman A, Viriri S. Kidney tumor semantic segmentation using deep learning: A survey of state-of-the-art. Journal of Imaging. 2022;**8**(3):55

[17] Yue W, Zhang H, Zhou J, Li G, Tang Z, Sun Z, et al. Deep learningbased automatic segmentation for size and volumetric measurement of breast cancer on magnetic resonance imaging. Frontiers in Oncology. 2022;**12**:984626

[18] Caballo M, Pangallo DR, Mann RM, Sechopoulos I. Deep learning-based segmentation of breast masses in dedicated breast CT imaging: Radiomic feature stability between radiologists and artificial intelligence. Computers in Biology and Medicine. 2020;**118**:103629

[19] Yang C, Qin LH, Xie YE, Liao JY. Deep learning in CT image segmentation of cervical cancer: A systematic review and meta-analysis. Radiation Oncology. 2022;**17**(1):175

[20] Zhao Y, Rhee DJ, Cardenas C, Court LE, Yang J. Training deep-learning segmentation models from severely limited data. Medical Physics. 2021;**48**(4):1697-1706

[21] Liu Z, Liu X, Guan H, Zhen H, Sun Y, Chen Q, et al. Development and validation of a deep learning algorithm for auto-delineation of clinical target volume and organs at risk in cervical cancer radiotherapy. Radiotherapy and Oncology. 2020;**153**:172-179

[22] Zambrano-Vizuete M, Botto-Tobar M, Huerta-Suárez C, Paredes-Parada W, Patiño Pérez D, Ahanger TA, et al. Segmentation of medical image using novel dilated ghost deep learning model. Computational Intelligence and Neuroscience. 2022;**2022**:6872045

[23] Gondara L. Medical image denoising using convolutional denoising autoencoders. In: 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW). Barcelona, Spain; 2016. pp. 241-246. DOI: 10.1109/ ICDMW.2016.0041

[24] Gulakala R, Markert B, Stoffel M. Generative adversarial network based data augmentation for CNN based detection of Covid-19. Scientific Reports. 2022;**12**:19186

[25] Shukla P, Verma A, Verma S, Kumar M. Interpreting SVM for medical images using Quadtree. Multimedia Tools and Applications. 2020;**79**(39-40):29353-29373

[26] Tchito Tchapga C, Mih TA, Tchagna Kouanou A, Fozin Fonzin T, Kuetche Fogang P, Mezatio BA, et al. Biomedical image classification in a big data architecture using machine learning algorithms. Journal of Healthcare Engineering. 2021;**2021**:9998819

[27] Rashed BM, Popescu N. Critical analysis of the current medical imagebased processing techniques for automatic disease evaluation: Systematic literature review. Sensors (Basel). 2022;**22**(18):7065

[28] Puttagunta M, Ravi S. Medical image analysis based on deep learning *Deep Learning in Medical Imaging DOI: http://dx.doi.org/10.5772/intechopen.111686*

approach. Multimedia Tools and Applications. 2021;**80**(16):24365-24398

[29] Yadav SS, Jadhav SM. Deep convolutional neural network based medical image classification for disease diagnosis. Journal of Big Data. 2019;**6**:113

[30] Xu Y, Jia Z, Wang LB, Ai Y, Zhang F, Lai M, et al. Large scale tissue histopathology image classification, segmentation, and visualization via deep convolutional activation features. BMC Bioinformatics. 2011;**18**(1):281

[31] Lai Z, Deng H. Medical image classification based on deep features extracted by deep model and statistic feature fusion with multilayer perceptron. Computational Intelligence and Neuroscience. 2018;**2018**:2061516

[32] Pujitha AK, Sivaswamy J. Solution to overcome the sparsity issue of annotated data in medical domain. CAAI Transactions on Intellectual Technology. 2018;**3**:153-160

[33] Aljohani A, Alharbe N. Generating synthetic images for healthcare with novel deep Pix2Pix GAN. Electronics. 2022;**11**(21):3470. DOI: 10.3390/ electronics11213470

[34] Kumar Y, Koul A, Singla R, Ijaz MF. Artificial intelligence in disease diagnosis: A systematic literature review, synthesizing framework and future research agenda. Journal of Ambient Intelligence and Humanized Computing. 2022;**2022**:1-28

[35] Ibrahim A, Mohamed HK, Maher A, Zhang B. A survey on human cancer categorization based on deep learning. Frontiers in Artificial Intelligence. 2022;**5**:884749. DOI: 10.3389/ frai.2022.884749

[36] Tran KA, Kondrashova O, Bradley A, et al. Deep learning in cancer diagnosis,

prognosis and treatment selection. Genome Medicine. 2021;**13**:152. DOI: 10.1186/s13073-021-00968-x

[37] Shen L, Margolies LR, Rothstein JH, et al. Deep learning to improve breast cancer detection on screening mammography. Scientific Reports. 2019;**9**:12495. DOI: 10.1038/ s41598-019-48995-4

[38] Alanazi MF, Ali MU, Hussain SJ, Zafar A, Mohatram M, Irfan M, et al. Brain tumor/mass classification framework using magnetic-resonance-imagingbased isolated and developed transfer deep-learning model. Sensors (Basel). 2022;**22**(1):372. DOI: 10.3390/s22010372

[39] Mzoughi H, Njeh I, Wali A, Slima MB, BenHamida A, Mhiri C, et al. Deep multi-scale 3D Convolutional Neural Network (CNN) for MRI gliomas brain tumor classification. Digital Imaging. 2020;**33**(4):903-915. DOI: 10.1007/s10278-020-00347-9

### *Edited by Jucheng Yang, Yarui Chen, Tingting Zhao, Yuan Wang and Xuran Pan*

Deep learning and reinforcement learning are some of the most important and exciting research fields today. With the emergence of new network structures and algorithms such as convolutional neural networks, recurrent neural networks, and self-attention models, these technologies have gained widespread attention and applications in fields such as natural language processing, medical image analysis, and Internet of Things (IoT) device recognition. This book, *Deep Learning and Reinforcement Learning* examines the latest research achievements of these technologies and provides a reference for researchers, engineers, students, and other interested readers. It helps readers understand the opportunities and challenges faced by deep learning and reinforcement learning and how to address them, thus improving the research and application capabilities of these technologies in related fields.

### *Andries Engelbrecht, Artificial Intelligence Series Editor*

Published in London, UK © 2023 IntechOpen © your\_photo / iStock

Deep Learning and Reinforcement Learning

IntechOpen Series

Artificial Intelligence, Volume 18

Deep Learning and

Reinforcement Learning

*Edited by Jucheng Yang, Yarui Chen,* 

*Tingting Zhao, Yuan Wang and Xuran Pan*