Section 2 Methodologies

#### **Chapter 6**

## Improving Face Recognition Using Artistic Interpretations of Prominent Features: Leveraging Caricatures in Modern Surveillance Systems

*Sara R. Davis and Emily M. Hand*

#### **Abstract**

Advances in computer vision have been primarily motivated by a better understanding of how humans perceive and codify faces. Broadly speaking, progress made in the fields of face recognition and identification has been strongly influenced by the biological mechanisms identified by research in the field of cognitive psychology. Research in cognitive psychology has long acknowledged that human face recognition and identification rely heavily on prominent features and that caricatures are capable of modeling prominent features in a multitude of ways. The field of computer science has done little to no research in the area of application of prominent features to recognition systems. This chapter discusses existing caricature research in cognitive psychology and computer vision, current issues with the practical application of caricatures to face recognition in computer vision, and how caricatures can be used to improve existing surveillance systems.

**Keywords:** face recognition, caricatures, datasets

#### **1. Introduction**

The word "caricature" comes from Italian for "to exaggerate" [1, 2]. As such, caricatures are artistic renderings of a human face that exaggerate prominent features while still maintaining their resemblance to the original, veridical face. Veridical is defined as the ground truth face [3]. An example can be seen in **Figure 1**. Since the 1590s, caricatures have been considered a humorous art form, meant to either entertain or humiliate, depending on context. In the United States, caricatures rose in popularity following the American Civil War, giving rise to our modern-day interpretation of the art form. At first, these images were used to mock political leaders in an effort to humorously instill political ideology [2].

Today, many people consider caricatures to be "fun" drawings. However, the fields of psychology and neuroscience have recognized the potential application of

**Figure 1.**

*A veridical image (photo) and caricature of David Tennant. (a) veridical image of David Tennant, (b) caricature image of David Tennant.*

caricatures for improving automated face verification and identification systems in recent years. Not only can caricatures be identified more often and faster than veridical images by humans [4], but they can improve the accuracy of low-resolution face verification in elderly populations [3]. Studies have also shown that introducing faces using caricatures, rather than veridical images, results in higher recognition and verification overall [5–8]. These works also show that facial exaggeration in a caricature past a certain point can actually decrease recognition performance over veridical images [9]. Each of these factors makes caricatures the ideal model for exploring how humans perceive and codify faces under nonideal conditions. In this chapter, we discuss how machine learning can leverage this biologically inspired recognition mechanism to improve surveillance systems. We discuss data collection methods and possible system architectures and show that caricatures can be used to train more robust face recognition systems.

#### **2. Caricatures in cognitive psychology**

Past advances in automated face recognition and verification have been driven by face perception research advancement in the field of cognitive psychology [1]. Human facial recognition is not negatively impacted by variation in pose, lighting, or resolution, unlike automatic systems [10–12]. Additionally, research has shown that human

*Improving Face Recognition Using Artistic Interpretations of Prominent Features:… DOI: http://dx.doi.org/10.5772/intechopen.106073*

facial recognition of familiar faces is consistently better than automated systems [4]. Thus, we propose using caricature research from the field of cognitive psychology to construct surveillance systems that are robust to changes in angle, lighting, and accidental exaggeration.

The study of facial recognition in cognitive psychology has two different schools of thought: holistic and nonholistic. Each has research to support it, though we argue that the holistic approach has the greatest probability of being applied to automated facial recognition systems, and this is supported by past research [14]. Holistic face recognition research contends that faces are stored in human memory using the relationship between all features in a face, while nonholistic research argues that a single prominent facial attribute is enough to perform face recognition [15]. The difference between the two can be thought of as holistic valuing the sum of the parts of the face to create an overall model, while nonholistic values single prominent features. Because holistic face recognition relies on facial feature relationships, the relationships can be divided into two categories: featural and configural. Featural facial feature attributes look at the general structure of the facial feature, while configural focuses on the relative placement and distance between features. An example of the difference can be seen in **Figure 2**, and is taken from ref. [13].

A literature survey [16] found that most facial recognition performed by humans appears to use a holistic approach, considering each attribute in relation to the other available facial attributes. Other work has looked at face recognition in holistic and nonholistic settings to compare the possible way humans represent faces in memory [15]. They found that participants could more accurately identify unique facial attributes when the attribute was presented with the whole face as context, rather than the isolated facial attribute. For example, when participants viewed a nose on its own, they were less likely to identify that facial feature as prominent. However, if the participant viewed that same nose on the face that it belonged to, they had an easier

**Figure 2.** *An illustration of the difference between configural (top) and featural (bottom) feature relationships taken from ref. [13].*

time identifying the nose as prominent. This supports the theory that humans represent faces holistically, and that they need to compare facial features to each other in order to determine which features are most prominent for that identity. Additionally, they found that if faces were inverted, recognition accuracy drastically decreased whether the part was presented with the face as context or on its own. This implies that representations in memory of human faces are strongly correlated with orientation and configural properties of facial attributes.

In order to test the importance of configural properties, another study [14] manipulated the distance between eyes, and participants were tasked with performing face recognition using distance-altered eyes. Each participant was presented with the altered eyes in (1) the original face, (2) a slight alteration of the original face, and (3) isolation. The authors found that face recognition was best using the original face, followed by the slightly altered original face, and the worst performance was when the distance-altered eyes were presented in isolation. Related to prominent feature recognition, the authors also found the configuration of the original face with the alteredeye distance resulted in lower accuracy rates in recognition of the nose and mouth features, even though the nose and mouth features had not been altered. This further supports the holistic school of thought, that is, humans learn faces holistically, and understanding the face depends both on featural and configural information.

Research has shown that participants are able to more quickly identify faces using simple line drawings of caricatured faces as compared to veridical faces [7]. This same study also suggested that caricatures can be used to better understand how humans represent faces using prominent facial features. Specifically, they found that faces and caricatures are stored in memory using the deviation of a prominent feature from the normal presentation of that feature. The authors call this norm-based coding. Another study found that the improvement in face verification rates is not specific to caricatures but is most likely caused by memory retention of facial features that deviate from the average [3]. This implies that human face encoding is closely affiliated with prominent facial features. In another study, Rhodes performed a series of experiments that tested the possible relationship between configural-based coding and norm-based coding using caricatures [8]. They found that the configural-based coding that is necessary for veridical face recognition is not necessary for caricature recognition. This implies that (1) caricature/veridical face pair recognition relies on a memory mechanism that is independent of the face/facial feature pair recognition, (2) caricature/veridical face pair recognition relies on norm-based coding, and (3) the approach to performing veridical/veridical, caricature/caricature, and caricature/veridical image recognition should be different due to the difference in coding methods.

Research has found that caricatures are accurately identified more quickly and more often than veridical images, with caricatures of familiar faces being recognized with the best accuracy [6]. Additional studies have found that caricatures of unfamiliar faces also improved verification rates by approximately 30%. Furthermore, above a certain rate of exaggeration, caricature verification is actually hindered; in other words, caricatures need to have a reasonable resemblance to the original face [9]. Past work also found that using caricatures led to better recognition of unfamiliar faces across the entire human lifespan, that it improved low-resolution face verification in older adults, and that face verification of other races also improved [3]. Each of these studies indicates that there is a link between human facial recognition, prominent features, and the general configuration of facial features.

Cognitive psychology defines facial features as either internal, such as the eyes, nose, or mouth, or external, such as hair or chin [17]. An example can be seen in

#### *Improving Face Recognition Using Artistic Interpretations of Prominent Features:… DOI: http://dx.doi.org/10.5772/intechopen.106073*

**Figure 3**. Past work has shown that familiar faces are more accurately identified if internal features were used, rather than external [17]. Feature type did not have an effect on identifying unfamiliar faces. The authors argued that the manner in which faces are modeled and stored in memory is different for familiar and unfamiliar faces, and thus, their treatment in facial recognition should be different. However, another study [18] found that internal and external features both activate similar faceselective regions of the brain, though internal features result in a greater response. Both of these works [17, 18] found that internal features were more important for familiar faces. Additionally, the study found that altering just the external features resulted in a decrease in identification accuracy, regardless of whether the face was familiar or unfamiliar. This indicates that the internal and external features interact with each other to create a holistic representation in memory and that internal and external features are likely of similar importance in machine learning applications to understanding faces. This is of particular importance to the application of caricatures to surveillance system construction because most face datasets are constructed of famous individuals; however, fame is not consistent across countries and cultures. For example, Fan Bingbing is a famous actress in China, but she is not nearly as well known in the United States. Since face recognition datasets are often comprised of a variety of celebrities from around the world, approaches to automated face recognition should not assume familiarity with the subjects in the dataset in order to make the approach relatable to human face recognition processes. Put simply, since the normal participant in a cognitive psychology study is unlikely to be familiar with every individual in a face recognition dataset, any automated system built for facial recognition should not assume familiarity with the subject.

Experiments surrounding how faces are learned over time have also been conducted. Research has shown that after a single view of a face, recognition from a different viewpoint was better using internal features rather than external features [19]. Additionally, the study found that after repeated exposure to a face, removing external features that change with high frequency, such as hair, resulted in better identification when a face was viewed from a different viewpoint. These results suggest that providing too much inconstant information to an automated face recognition system can result in a reduction of recognition capability. This means that the

#### **Figure 3.**

*Examples of the difference between internal and external features. The original veridical image is shown in (a), internal features in (b), and external features in (c). (a) cropped full face, (b) internal facial features, (c) external facial features.*

method in which images of the face are cropped, image background, lighting, and even changes in hairstyle and makeup all likely have a significant effect on automated recognition. Other work [20] found that repeated exposure to the same face within different contexts—variations in pose and lighting—resulted in better facial recognition than if subjects simply viewed the same image of the face over and over again. This indicates that exposure to unique images has an effect on how faces are learned and retained in memory, which means that in order for an accurate facial representation to be built, automated systems need to utilize a variety of images for each identity and repeated exposure.

#### **3. Caricatures in computer science**

Today's automated surveillance systems rely on identity matching in some latent space [21]. We argue that the use of caricatures would better allow these systems to describe faces and prominent features, thus allowing for greater variability in pose, lighting, etc. Research shows that current automated face recognition and verification systems perform better than human recognition and verification. However, that same research suggests that automated systems only perform better on carefully curated datasets [4, 21, 22]. In other words, automated systems cannot handle images that are not taken under ideal lighting, pose, and resolution conditions. Humans, on the other hand, are capable of recognizing faces under nonideal conditions. Many surveillance systems require face alignment in order to achieve state-of-the-art results [23, 24]. This means that they work best on frontal facing images [25–27], but humans do not need a frontal facing image to perform recognition. In fact, many caricatures exaggerate face angles, and humans are still able to perform recognition with them. To improve existing automated systems, we discuss using caricatures to construct surveillance systems that are robust to changes in angle, lighting, and accidental exaggeration, and the existing research in computer science that has already leveraged these images.

Past work in automated face recognition in computer science belongs to one of two system types: traditional machine learning or deep learning. Traditional machine learning techniques to perform face recognition include using deep belief networks [28], metric learning [29, 30], and dimensionality reduction via principal component analysis (PCA) and/or linear discriminant analysis (LDA) [31]. With the rise of better hardware and GPU cycling, deep learning has become the standard approach. Typical approaches use convolutional neural networks (CNNs) [32] or autoencoders [33] and may be combined with more traditional methods to increase performance [34]. Most methods try to increase interclass margins while decreasing intraclass margins, so that distinct class clusters are created in high-dimensional feature space [35]. The recognition task is complicated by pose variance, lighting changes, and changes in an individual's appearance [36, 37], as mentioned before. Nguyen *et al.* [38] proposed a representation learning method to overcome the issues caused by recognition under nonideal conditions. They found that the cosine similarity between images can be used to improve face recognition under nonideal conditions. Past research has also shown that soft biometrics, such as the use of prominent facial features or hairstyle, can be used to improve facial recognition technology [31, 39].

Research in the area of feature learning and architecture found that facial recognition methods can be improved by utilizing ResNet CNN architectures, rather than VGG [40]. The same study discusses methods of face detection, facial alignment, and *Improving Face Recognition Using Artistic Interpretations of Prominent Features:… DOI: http://dx.doi.org/10.5772/intechopen.106073*

how to determine what ResNet structure is best for a selected dataset, with significant performance improvements on standardized datasets with wide variance. Unfortunately, the vast majority of work in the area of face recognition is dataset-dependent, and using proposed methods on other datasets results in an unexpected behavior [4, 21, 22], which we discuss in Section 4.

Though cognitive psychology has shown that the use of caricatures improves human recognition, work in computer science using caricatures for face recognition is rather limited. As deep learning representations for face recognition have become more accurate, face generation systems have been proposed, typically using a generative adversarial network (GAN) [42–44]. GANs are a class of deep generative models [10]. To backpropagate loss through the GAN, the input to the system must be differentiable [45]. While using a GAN can be quite successful, it can also lead to mode collapse [46] and vanishing gradient behavior [47]. Additionally, while the initial GAN results appear promising at first glance, the authors typically only report their best results and neglect to show that the vast majority of generated images are nonsensical [48]. An example of an image that is not representative of its target identity is shown in **Figure 4**. One approach to enforcing differentiability, so that better images are generated, is to use a kernel-based moment-matching scheme over a reproducing kernel Hilbert space (RKHS) [49]. This forces the real and generated images to have matched moments in the latent-feature space, and helps combat mode collapse while encouraging images that are descriptive and varied [49].

Despite these limitations, recent work generates a caricature from veridical images using GANs [41, 50–52], but does not try to understand or utilize caricatures to improve verification or recognition. Some work attempts to exploit caricatures to improve verification. However, the dataset is small and does not use modern deep learning methods [53]. Work in verification and recognition improvement using caricatures, rather than caricature generation, is relatively new and not well explored. Past work in the field [54] introduced a method to extract facial attribute features from photos but required manual labeling of facial attribute features on caricatures, which is time-consuming. Furthermore, the study computed feature importance

#### **Figure 4.**

*An example of a poorly GAN-generated caricature produced by WarpGan [41], one of the current state-of-the-art caricature generation systems. Note that this caricature (right) is not identifiable as Helena Bonham Carter (left).*

using genetic algorithms, which are extremely slow compared to deep learning. In the field of cognitive psychology, [55] showed that facial recognition improves when PCA is applied to all of an identity's images and then averaged. This indicates that the (1) human memory holds the average of a person's face after multiple exposures; and (2) PCA is one method that might be applied when creating an automated face recognition system. The most comprehensive published work in automated caricature verification is WebCaricature [56], which provides an end-to-end framework for face verification and identification using caricatures, though we discuss in Section 4 the use of flawed data in their study.

Though caricatures have not been widely used in automated face recognition systems, facial attribute recognition is a well-researched task in the field [57–62]. Recent research has focused on performing attribute recognition, and introducing new datasets and deep learning frameworks [63–66]. The current state-of-the-art facial attribute prediction methods include "Walk and Learn," which pretrains a network on face verification data and then fine-tunes it on attribute recognition [65], as opposed to pretraining on object data [67]. Work has also shown that dataset imbalance, which we discuss in Section 4, can be ameliorated by using a multi-task network with the mixed objective loss [66]. Attribute relationships have also been used within deep neural networks to improve prediction [64]. Unfortunately, current work is focused on facial attribute identification and prediction. To date, there has not been any work in using *prominent* facial features to perform recognition, despite the fact that research in cognitive psychology has shown that human recognition relies on prominent facial features (Section 2). Thus, we argue that existing surveillance systems could be improved by creating systems capable of using prominent facial features so that models are better trained to focus on the same features that humans use to identify faces.

#### **4. Data collection methods**

From the perspective of this chapter, we care about the application of caricature data to improve surveillance systems. Generally speaking, surveillance systems have at least one frame with an un-exaggerated snapshot of identity, similar to a photo. Therefore, caricature research typically constructs datasets by collating a caricature set and a matching identity real photo set. Past research has focused on curating datasets with as many images as possible using web scraping [49, 50, 56]. For real images, there are existing methods to remove duplicate images and images of low quality. Unfortunately the same is not true for caricatures. Since the field is relatively new, systems have not yet been built to recognize image duplicates, and even if one was constructed, it would not handle the issue of under-exaggeration or representation fidelity. Of the previously cited works, none ensure that the image is of acceptable quality, images in the caricature group are actually caricatures and not some other form of art, and that images are actually of the target identity [49, 50, 56]. In some cases, datasets inaccurately incorporate character representations of an identity, rather than the actual identity; for example, the WebCaricature dataset [56] inaccurately labels general images of Harry Potter, a cultural icon, as Harry Potter rather than gathering images of Daniel Radcliffe. This introduces a high degree of variability, as the character "Harry Potter" is not always depicted as Daniel Radcliffe, just as Daniel Radcliffe is not always seen portraying Harry Potter (**Figure 5**). Each of these conditions is critical to creating a dataset that creates an accurate caricature

*Improving Face Recognition Using Artistic Interpretations of Prominent Features:… DOI: http://dx.doi.org/10.5772/intechopen.106073*

#### **Figure 5.**

*An instance of the character representation of a fictional character (Harry Potter) not matching the affiliated actor (Daniel Radcliffe). (a) veridical image of Daniel Radcliffe. (b) pop culture representation of Harry Potter.*

#### **Figure 6.**

*Examples of variation in caricature representation of the same person (Patrick Stewart) taken from [56]. The leftmost image is the photo (veridical face) and subsequent images are caricatures. Note that while there is wide variation in the representation of the veridical image, the identity of each of the caricatures is still obvious.*

representation of supplied identities. In a deep learning system, data quality directly affects train and test performance [67, 68]. This means that ensuring that data is representative of each identity is exceedingly important; otherwise, the recognition task becomes unnecessarily more difficult and possibly more biased.

The caricature recognition task is additionally complicated by the fact that caricatures are artistic renderings. This means that artists may choose to exaggerate some facial features over others (**Figure 6**). Furthermore, as an artistic medium, classifying an image as a "caricature" as opposed to some other art form like painting can be difficult, as shown in **Figure 7**. These two issues highlight an important core issue in data collection for caricature trained systems: (1) caricatures should be caricatures and not some other art form, and (2) the caricature should actually resemble the target identity. Since we propose using caricatures to construct surveillance systems that are robust to changes in angle, lighting, and accidental exaggeration, caricatures should still be fairly exaggerated, and using caricatures with variation in style and degree of exaggeration will improve a surveillance system's ability to accommodate large exaggerations that typically hurt performance in the existing state-of-the-art systems. Because caricature drawing is an artistic medium, the same person can be portrayed with a wide array of variations in facial features that are over or under-exaggerated. An example of this can be seen in **Figure 6**.

Unfortunately, the construction of a good caricature dataset is slow and labor intensive. Each caricature needs to be assessed for quality, and currently, the methods

#### **Figure 7.**

*Images from WebCaricature [56] that are not of acceptable quality to be included in a computer vision dataset. The first row contains images where the identity is not immediately obvious without knowing who the person is. The second row contains images where the image is not a caricature, but rather a painting, drawing, or cartoon.*

to do that are manual. That means checking every caricature of resemblance to the target identity and art style. Additionally, many state-of-the-art surveillance systems are reliant on facial landmarks, which can already be inaccurate in normal photos [69, 70]; this inaccuracy is exacerbated in caricatures, particularly caricatures with a high degree of exaggeration across internal facial features.

It is also critical that datasets are constructed in a way that limits bias as much as possible, as any dataset bias will be trained into a surveillance system. For example, in 2015, Google released an image labeler that had been poorly trained, so that it mislabeled human faces as gorillas [71]. Company representatives later acknowledged that this example of racism was caused by data the system was trained on.

Since many caricatures reflect cultural norms by trying to exaggerate consistent prominent features, racist interpretations are more likely. For example, because "Bruce Lee" is an Asian-American man, many racist caricatures overly exaggerate the degree of eye closure and inferred mouth pout. An exaggeration is considered racist when it does nothing to improve the machine representation of the target identity while enforcing stereotypes that exist in popular culture (**Figure 8**). Additionally, since most surveillance systems see a variety of genders, races, and angles, it is important that the dataset used to train the surveillance system is as representative as possible. The ACLU has pointed out that existing surveillance systems are more prone to misidentifying women and people of color [72]. Additionally, many police departments use mugshots to create their databases, which perpetuates the issue of racism since people of color are up to four times as likely to be arrested for the same crime perpetrated by Caucasian suspects [72]. This means that most police surveillance systems use data sets that are overwhelmingly comprised of citizens of color, making it easier to identify them than Caucasian citizens [72]. Additional work by Buolamwini and Gebru in 2018 found that datasets curated by sources other than law enforcement are composed of overwhelmingly white male subjects. This data imbalance leads to high accuracy in identifying white male subjects, but high rates of misidentification of women and people of color, and especially women of color [73]. The US Department of Commerce later reported findings consistent with Buolamwini and Gebru [74]. In a surveillance system, particularly surveillance systems used by law enforcement,

*Improving Face Recognition Using Artistic Interpretations of Prominent Features:… DOI: http://dx.doi.org/10.5772/intechopen.106073*

#### **Figure 8.**

*An instance of caricaturists exaggerating racist components of Bruce Lee's features that do nothing to enhance the identifiability of the image. (a) veridical image of Bruce Lee. (b) racist representation of Bruce Lee.*

#### **Figure 9.**

*In some cases, veridical image and caricature image identity may be difficult to distinguish, as shown by Katy Perry (left) and Zooey Deschanel (right) in this figure. Thus, it is imperative to introduce a significant amount of quality data to allow a surveillance system to differentiate between similar faces.*

misidentification can have life-long impacts on a suspect's quality of life and likelihood to be reincarcerated.

Previous works have focused primarily on gathering as much data as possible [49, 50, 56], and while there is certainly a data problem in the machine learning field, the issue of bad data is far greater when constructing a system meant to surveil. In order for systems to learn accurate representations, those systems must be trained on accurate representations. That is, if we want a system that recognizes that two similarlooking people, such as Katy Perry and Zooey Deschanel, are different (see **Figure 9**), we need to supply a significant amount of representative data to do that, and that applies to every race, gender, and age. However, datasets gathered using a web scraper en masse tend to use celebrities, and celebrities in western culture are typically young, Caucasian, and attractive. If data were to be gathered with complete disregard for dataset balance, the face representation constructed by that system would likely perform well at identifying young, attractive, Caucasian people, and struggle with images of anyone that does not fit that description [67], and this is supported by past research [75]. While techniques like data balancing exist [64, 76], those techniques are not typically capable of fully handling the bias present in a dataset. Thus, datasets should be constructed with as much balance between gender, age, ethnicities, and image type (caricature vs veridical) as possible to ensure that the trained system is as fair as possible, *especially in systems meant to have any applicability to law enforcement*. This can be difficult to do, depending on the dataset content and availability of applicable data on the internet.

Since caricatures are a unique data source, gathering relevant, representative data is made even more difficult. Currently, the largest publicly available dataset for caricature verification and recognition is WebCaricature [56]. We have already outlined why the mindset of "quantity over quality" is detrimental to creating a fair recognition system. The WebCaricature dataset [56] illustrates this point well. WebCaricature consists of 6,042 caricatures and 5,974 veridical images over 252 identities. At a cursory glance, this dataset seems like a great resource just due to its sheer size. However, we find that there are many quality issues with the dataset itself, examples of which can be seen in **Figure 7**. First, the dataset does not bother checking that images fairly represent the target example. In other words, the target identity of each caricature is not immediately clear. Because these caricatures are not representative, they should not be included in the dataset. Second, both the caricature portion and photo portion of the dataset contain images that are not of their respective type. For example, there are multiple caricatured images that are a drawing, cartoons, or veridical images incorrectly labeled as a caricature. Third, there are many instances where the dataset contains duplicate images or images that are not of the target identity. Fourth, the authors did not collect a dataset that was balanced in terms of gender, ethnicity, or age, making it (and any system trained on it) inherently biased. Fifth, and finally, there are many included identities that have dozens of veridical images and only a single caricature image. This introduces a bias toward photo representations into the dataset and any system that uses it. After careful analysis, it becomes clear that the WebCaricature dataset's focus on quantity has led to a marked decrease in quality that would unduly bias any surveillance system that uses it.

Thus, we propose that the following list of questions be used to construct future caricature datasets:


We concede that most publicly sourced datasets from Western culture will have an easier time collecting images of white individuals. That means that maintaining race balance will restrict the number of images of Caucasian subjects that can be collected since a roughly equal balance is necessary to ensure fairness. In terms of dataset size,

*Improving Face Recognition Using Artistic Interpretations of Prominent Features:… DOI: http://dx.doi.org/10.5772/intechopen.106073*

this means sacrificing quantity for quality in the interest of creating a fair surveillance system.

#### **5. Using caricatures for prominent feature recognition in surveillance systems**

As discussed in Section 3, existing research does not address prominent facial feature recognition, despite the fact that cognitive psychology has been trying to better identify them for decades. The most common approach to designing the system architecture for face recognition is to first detect any faces in the image and then to landmark that image. The landmarks are then given as input to some sort of machine learning algorithms, such as a deep belief network [28], convolutional neural network [56], or genetic algorithm [53]. We believe that this same generalized process can continue to be used, so long as the field addresses the gap between prominent feature research in cognitive psychology and computer vision. This can be accomplished by using caricatures to better model prominent feature exaggeration.

Future work using caricatures should seek to address this critical gap in facial recognition by doing the following:


We address each of these points below, with suggestions for courses of future research.

Future research should use landmarking on caricature and veridical images to measure feature deviation from the average. Controlling for well-known conditions that affect feature size, such as gender and ethnicity, measuring configural and size properties of each facial feature can provide insight into what an image's prominent features are. For example, Helena Bonham Carter's eyes are large when compared to other celebrities, and they are also large in comparison to the rest of her facial features. Landmarks can be used to quantitatively analyze the difference in the size of relative features and the deviation from average in order to identify prominent features. It is worth noting that research that quantitatively analyzes feature size and shape *must* control for gender and ethnicity in order to create better models; past medical research has shown that nose shape, for example, is highly correlated with race [77]. By controlling for variables, such as race and gender, systems trained to recognize prominent features can be better attuned to small differences between features in different subjects. We warn that systems that do not implement this control into their experiments are likely to miss fine-grained feature differences, and may perpetuate bias, which is an obvious downside to the use of prominent facial features if they are not used carefully. Providing landmark data to a simple machine learning model, such as a support vector machine (SVM), or to a deep convolutional neural network should provide a baseline method for prominent facial feature recognition. This baseline should be simple to implement and is a first step in improving prominent feature

recognition and labeling. Preliminary results using simple models may look fairly underwhelming because it is unlikely that they will optimally handle a large amount of landmarking data coming; however, results that are better than chance will indicate that prominent feature usage is worth pursuing.

The developed prominent feature recognition method can be used to train surveillance systems that are capable of leveraging prominent facial features. We argue that by using prominent facial feature labels and this prominent feature methodology, deep learning models used in surveillance can improve their performance. Additionally, prominent feature recognition methods can be used as an additional task in existing surveillance systems, which should ultimately make a more robust, less overfit model [31, 78, 79]. This second step is critical to better mimicking human face perception and should improve most recognition and surveillance systems. Again, if prominent features are used to train a surveillance system, race and gender need to be controlled as part of the proposed multitask network so that the system is not unintentionally biased.

After a model is in place to identify prominent features, the identified features from a veridical image can be used to better train existing GAN models used for caricature generation. This will create caricatures with higher fidelity to the veridical image. These generated images, can, in turn, be used to ameliorate the data quantity problem that most surveillance systems have.

We also note that the use of caricatures to improve existing system architectures may prove difficult at first, especially if collated datasets are relatively small, there may just be a data abundance problem. Therefore, it is imperative that large, wellconstructed datasets be created prior to any landmarking or architecture improvement. Additionally, initial research in caricature usage will likely prove to be slow, since all data will need to be manually landmarked until a proper landmarking model is devised for caricatures. Aside from data abundance and the time necessary to manually create the dataset and landmarking systems, it is also likely that initial systems, no matter how well controlled, may end up slightly biased and better at identifying identities of specific races, genders, and ages. This is simply due to the fact that it is easiest to find existing caricature data through web scraping, and web scraping will inherently lead to more images of celebrities, which will be culturally skewed in favor of one race over another. In order to control this, data augmentation methods appropriate to caricature utilization should be looked at. Finally, we note that the general subjectivity of caricature acceptability, as discussed in Section 4, may lead to a level of variation across curated datasets and unintentional bias.

Given that most computer vision advancements have been made by a better understanding of human perception [1], we argue that utilizing cognitive psychology's findings about caricatures to our advantage will result in more robust computer vision systems, and in turn, more robust surveillance systems. By developing a method to identify prominent features, surveillance systems can better leverage the same mechanisms that human face perception uses to improve recognition.

#### **6. Conclusion**

Human face recognition relies on the use of prominent facial features. We outline past research in cognitive psychology that should be leveraged to improve surveillance systems. In particular, we argue that the use of prominent facial features is critical to better modeling human face perception. In addition, Section 2 discusses the

*Improving Face Recognition Using Artistic Interpretations of Prominent Features:… DOI: http://dx.doi.org/10.5772/intechopen.106073*

importance of using a holistic face model and internal facial features to construct robust recognition systems. In Section 3, we discuss past research in computer science that leverages the use of caricatures. We note that research in this area is hindered by the lack of study of prominent facial features, and that, in most cases, existing research in computer science that uses caricatures is rather limited and of low quality. Next, we discuss the importance of collecting a dataset that is not only large but also balanced (Section 4). We argue that dataset balance in terms of gender, race, age, and image type is critical to limiting bias within surveillance systems trained on these datasets and discuss the past research that supports our stance. Additionally, we provide a series of guidelines for caricature dataset generation, so that future caricature datasets are of acceptable quality for use in surveillance system training. Finally, we outline the ways in which caricatures can be used to improve facial recognition systems. In particular, we argue that improved prominent feature labeling and recognition is critical, so that these features can be used to better train multitask surveillance systems.

#### **Acknowledgements**

This material is based upon work supported by the National Science Foundation under Grant IIS-1909707.

#### **Author details**

Sara R. Davis and Emily M. Hand\* University of Nevada, Reno, Reno, NV, USA

\*Address all correspondence to: emhand@unr.edu

© 2022 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

### **References**

[1] Scheirer WJ, Anthony SE, Nakayama K, Cox DD. Perceptual annotation: Measuring human vision to improve computer vision. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2014;**36**(8): 1679-1686

[2] Wright T. A History of Caricature and Grotesque in Literature and Art. Virtue Brothers; 1865

[3] Dawel A, Wong TY, McMorrow J, Ivanovici C, He X, Barnes N, et al. Caricaturing as a general method to improve poor face recognition: Evidence from low-resolution images, other-race faces, and older adults. Journal of Experimental Psychology Applied. 2019; **25**(2):256-279

[4] Sun YK, Wang X, Tang X. Deep learning face representation from predicting 10,000 classes. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition. 2014. pp. 1891-1898

[5] Michael B. Lewis. Are caricatures special? evidence of peak shift in face recognition. European Journal of Cognitive Psychology. 1999;**11**(1): 105-117

[6] Mauro R, Kubovy M. Caricature and face recognition. Memory & Cognition. 1992;**20**(4):433-440

[7] Rhodes G, Brennan S, Carey S. Identification and ratings of caricatures: Implications for mental representations of faces. Cognitive Psychology. 1987; **19**(4):473-497

[8] Rhodes G, Tremewan T. Understanding face recognition: Caricauture effects, inversion, and the homogeneity problem. Visual Cognition. 1994;**1**(2–3):275-311

[9] Alex H, Hancock PJB, Kittler J, Langton SRH. Improving discrimination and face matching with caricature. Applied Cognitive Psychology. 2013; **27**(6):725-734

[10] Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative Adversarial Networks, 2014

[11] Nguyen A, Yosinski J, Clune J. Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images, 2015

[12] Radford A, Metz L, Chintala S. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. 2015

[13] Maurer D, Le Grand R, Mondloch C. The many faces of configural processing. Trends in Cognitive Sciences. 2002;**6**: 255-260

[14] James W, Sengco JA. Features and their configuration in face recognition. Memory & Cognition. 1997;**25**:583-592

[15] Tanaka J, Farah M. Parts and wholes in face recognition. The Quarterly journal of experimental psychology. A, Human experimental psychology. 1993; **46**:225-245

[16] Tanaka JW, Simonyi D. The "parts and wholes" of face recognition: A review of the literature. Quarterly Journal of Experimental Psychology. 2016;**69**(10):1876-1889

[17] Ellis H, Shepherd J, Davies G. Identification of familiar and unfamiliar *Improving Face Recognition Using Artistic Interpretations of Prominent Features:… DOI: http://dx.doi.org/10.5772/intechopen.106073*

faces from internal and external features: Some implications for theories of face recognition. Perception. 1979;**8**:431-439

[18] Andrews T, Davies-Thompson J, Kingstone A, Young A. Internal and external features of the face are represented holistically in face-selective regions of visual cortex. The Journal of Neuroscience : The Official Journal of the Society for Neuroscience. 2010;**30**: 3544-3452

[19] Christopher A, Liu CH, Young AW. The importance of internal facial features in learning new faces. Quarterly Journal of Experimental Psychology. 2015;**68**(2):249-260

[20] Murphy J, Ipser A, Gaigg S, Cook R. Exemplar variance supports robust learning of facial identity. Journal of Experimental Psychology. Human Perception and Performance. 2015;**41**:4

[21] Novak R, Bahri Y, Abolafia DA, Pennington J, Sohl-Dickstein J. Sensitivity and Generalization in Neural Networks: An Empirical Study 2018

[22] Wang M, Deng W. Deep face recognition: A survey. Neurocomputing, 2021;**429**:215-244

[23] Zhao J, Zhou Y, Li Z, Wang W, Chang K-W. Learning gender-neutral word embeddings. CoRR, abs/ 1809.01496. 2018

[24] Abate AF, Nappi M, Riccio D, Sabatino G. 2d and 3d face recognition: A survey. Pattern Recognition Letters. 2007;**28**:1885-1906

[25] Jourabloo A, Liu X. Large-pose face alignment via cnn-based dense 3d model fitting. In: IEEE Conference on Computer Vision and Pattern Recognition. 2016

[26] Zhu X, Lei Z, Liu X, Shi H, Li SZ. Face alignment across large poses: A 3d solution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. pp. 146-155

[27] Bowyer KW, Chang K, Flynn P. A survey of approaches and challenges in 3d and multi-modal 3d+ 2d face recognition. Computer Vision and Image Understanding. 2006;**101**:1-15

[28] Huang GB, Lee H, Learned-Miller EG. Learning hierarchical representations for face verification with convolutional deep belief networks. CVPR; 2012. pp. 2518-2525

[29] Cai X, Wang C, Xiao B, Xue C, Zhou J. Deep nonlinear metric learning with independent subspace analysis for face verification. In: Proceedings of the 20th ACM International Conference on Multimedia. New York, NY, USA: Association for Computing Machinery; 2012. pp. 749-752

[30] Guillaumin M, Verbeek J, Schmid C. Is that you? metric learning approaches for face identification. In: 2009 IEEE 12th International Conference on Computer Vision. 2009. pp. 498-505

[31] Hao Zhang J. Ross Beveridge, Bruce A. Draper, and P. Jonathon Phillips. On the effectiveness of soft biometrics for increasing face verification rates. Computer Vision and Image Understanding. 2015;**137**:50-62

[32] Taylor GW, Fergus R, LeCun Y, Bregler C. Convolutional learning of spatio-temporal features. In: Daniilidis K, Maragos P, Paragios N, editors. Computer Vision – ECCV 2010. Berlin, Heidelberg: Springer; 2010. pp. 140-153

[33] Vincent P, Larochelle H, Bengio Y, Manzagol P-A. Extracting and

composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning, ICML '08. New York, NY, USA: Association for Computing Machinery; 2008. pp. 1096-1103

[34] Dong Y, Lei Z, Stan ZL. Towards pose robust face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2013

[35] Richard O, Hart PE, Stork DG. Pattern Classification. 2nd ed. New York: Wiley; 2001

[36] Cao Z, Yin Q, Tang X, Sun J. Face recognition with learning-based descriptor. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE; 2010. pp. 2707-2714

[37] Hu J, Lu J, Tan Y-P. Discriminative deep metric learning for face verification in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014. pp. 1875-1882

[38] Nguyen HV, Bai L. Cosine similarity metric learning for face verification. In: Kimmel R, Klette R, Sugimoto A, editors. Computer Vision – ACCV 2010. Berlin, Heidelberg: Springer; 2011. pp. 709-720

[39] Thom N, Hand EM. Facial Attribute Recognition: A Survey. 2020

[40] Hsiao S-H, Jang J-SR. Improving resnet-based feature extractor for face recognition via re-ranking and approximate nearest neighbor. In: 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). 2019. pp. 1-8

[41] Shi Y, Deb D, Jain AK. Warpgan: Automatic caricature generation. In: Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recognition. 2019. pp. 10762-10771

[42] Gauthier J. Conditional generative adversarial nets for convolutional face generation. In: Convolutional Neural Networks for Visual Recognition. 2014. p. 2

[43] Li M, Zuo W, Zhang D. Convolutional network for attributedriven and identity-preserving human face generation. arXiv preprint arXiv: 1608.06434, 2016

[44] Lu Y, Tai Y-W, Tang C-K. Attributeguided face generation using conditional cyclegan. In: Proceedings of the European Conference on Computer Vision (ECCV). 2018. pp. 282-297

[45] Wang K, Wan X. Sentigan: Generating sentimental texts via mixture adversarial networks. In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18. 2018. pp. 4446-4452

[46] Metz L, Poole B, Pfau D, Sohl-Dickstein J. Unrolled generative adversarial networks. In: 5th International Conference on Learning Representations. Toulon, France: ICLR, 2017

[47] Arjovsky M, Bottou L. Towards Principled Methods for Training Generative Adversarial Networks 2017

[48] Taphorn A. Gan and Their Chances and Risks in Face Generation and Manipulation. 2020

[49] Zhang Y, Gan X, Fan K, Chen X, Henao R, Shen D, Carin L. Adversarial Feature Matching for Text Generation. 2017

[50] Jang W, Ju G, Jung Y, Yang J, Tong X, Lee S. Stylecarigan: Caricature *Improving Face Recognition Using Artistic Interpretations of Prominent Features:… DOI: http://dx.doi.org/10.5772/intechopen.106073*

generation via stylegan feature map modulation. arXiv preprint arXiv: 2107.04331 2021

[51] Chiang P-Y, Liao W-H, Li T-Y. Automatic caricature generation by analyzing facial features. In: Proceeding of 2004 Asia Conference on Computer Vision (ACCV2004). Korea; 2004

[52] Zipeng Ye, Ran Yi, Minjing Yu, Juyong Zhang, Yu-Kun Lai, and Yong-jin Liu. 3d-carigan: An end-to-end solution to 3d caricature generation from face photos. IEEE Trans Vis Comput GraphIEEE Trans Vis Comput Graph, abs/2003.06841. 2021

[53] Brendan F, Bucak SS, Jain AK, Akgul T. Towards automated caricature recognition. In: 2012 5th IAPR International Conference on Biometrics (ICB). 2012. pp. 139-146

[54] Abacı B, Akgül T. Matching caricatures to photographs. Signal Image and Video Processing. 2015;**9**:1-9

[55] Mike Burton A, Jenkins R, Hancock PJB, White D. Robust representations for face recognition: The power of averages. Cognitive Psychology. 2005;**51**:256-284

[56] Huo J, Li W, Shi Y, Yang G, Yin H. Webcaricature: A benchmark for caricature recognition. In: British Machine Vision Conference 2018, BMVC 2018, Newcastle, UK: BMVA Press; 2018. p. 223

[57] Berg T, Belhumeur PN. Poof: Partbased one-vs.-one features for finegrained categorization, face verification, and attribute estimation. Computer Vision and Pattern Recognition. 2013: 955-962

[58] Berg T, Belhumeur PN. Poof: Part-based one-vs.-one features for finegrained categorization, face verification, and attribute estimation. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA: IEEE Computer Society; 2013. pp. 955-962

[59] Kumar N, Belhumeur PN, Nayar SK. Facetracer: A search engine for large collections of images with faces. In David A. Forsyth DA, Torr PHS, Zisserman A, editors, Computer Vision - ECCV 2008, 10th European Conference on Computer, Vision, Marseille, Proceedings, Part IV, volume 5305 of Lecture. Notes in Computer Science. France: Springer; 2008. pp. 340-353

[60] Kumar N, Berg AC, Belhumeur PN, Nayar SK. Attribute and simile classifiers for face verification. In IEEE 12th International Conference on Computer Vision, ICCV 2009. Kyoto, Japan: IEEEComputer Society; 2009. pp. 365-372

[61] Kumar N, Berg AC, Belhumeur PN, Nayar SK. Describable visual attributes for face verification and image search. In: PAMI. 2011

[62] Layne R, Hospedales TM, Gong S, Mary Q. Person re-identification by attributes. In Bowden R, Collomosse JP, Mikolajczyk K, editors. British Machine Vision Conference, BMVC 2012, Surrey, UK: BMVA Press; 2012. pp. 1-11

[63] Dharr S, Ordonez V, Berg TL. High level describable attributes for predicting aesthetics and interestingness. In The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011. Colorado Springs, CO, USA: IEEE Computer Society; 2011. pp. 1657-1664

[64] Hand EM, Chellappa R. Attributes for improved attributes: A multi-task network utilizing implicit and explicit

relationships for facial attribute classification. In Singh S, Markovitch S, editors. Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence. San Francisco, California, USA: AAAI Press; 2017. pp. 4068-4074

[65] Liu Z, Luo P, Wang X, Tang X. Deep learning face attributes in the wild. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile: IEEE Computer Society; 2015. pp. 3730-3738

[66] Rudd EM, Gunther M, Boult TE. Moon: A mixed objective optimization network for the recognition of facial attributes. In: Leibe B, Matas J, Sebe N, Welling M, editors. Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V, volume 9909 of Lecture Notes in Computer Science, Amsterdam, The Netherlands: Springer; 2016. pp. 19-35

[67] Cortes C, Jackel LD, Chiang W-P. Limits on learning machine accuracy imposed by data quality. In: Advances in Neural Information Processing Systems. 1994. p. 7

[68] Jain B, Patel H, Nagalapatti L, Gupta N, Mehta S, Guttula S, Mujumdar N, et al. Overview and importance of data quality for machine learning tasks. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020. pp. 3561-3562

[69] Cummaudo M, Guerzoni M, Marasciuolo L, Gibelli D, Cigada A, Obertovà Z, et al. Pitfalls at the root of facial assessment on photographs: A quantitative study of accuracy in positioning facial landmarks. International Journal of Legal Medicine. 2013;**127**(3):699-706

[70] Lin J, Xiao L, Wu T. Face recognition for video surveillance with aligned facial landmarks learning. Technology and Health Care. 2018;**26**(S1):169-178

[71] Google apologises for photos app's racist blunder, July 2015

[72] Crockford K. How is Face Recognition Surveillance Technology Racist?: News & Commentary, Jun 2020

[73] Buolamwini J, Gebru T. Gender shades: Intersectional accuracy disparities in commercial gender classification. In: PMLR. 2018

[74] Patrick Gother, Mei Ngan, and Kayee Hanaoka. Face recognition vendor test (frvt) - nist

[75] Lingenfelter B, Hand EM. Improving evaluation of facial attribute prediction models. In: 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021). Jodhpur, India: IEEE; 2021. pp. 1-7

[76] Gustavo EAPA, Prati RC, Monard MC. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter. 2004;**6**:20-29

[77] Suhk JH, Park JS, Nguyen AH. Nasal analysis and anatomy: Anthropometric proportional assessment in asiansaesthetic balance from forehead to chin, part i 2015

[78] Argyriou A, Evgeniou T, Pontil M. Multi-task feature learning. Advances in Neural Information Processing Systems. 2007;**2007**:41-48

[79] Ranjan R, Patel VM, Chellappa R. Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. CoRR. 2016;abs/1603.01249

### **Chapter 7**

## Spatial Change Recognition Model Using Image Processing and Fuzzy Inference System to Remote Sensing

*Majid Mirbod*

#### **Abstract**

After the advent of satellites whose job is to image the surface of the earth, a huge database of imaging data of the surface of the earth was made available to researchers in various sciences to exploit a large data set in their field of work, and the subject of remote sensing gradually came to the attention of researchers in various sciences. For example, geography, environmental science, civil engineering, etc., each analyzed the visual data of the earth's surface from the perspective of their field. According to this research, the issue of spatial change recognition and their location and calculating the percentage of changes at the ground level has been considered, and the model presented is based on machine vision, image processing, and a fuzzy interface system to reveal features. This research is in the category of applied research and finally, an application will be presented that can lead to the development of software such as Google Earth and can be added to that as an option. Another of the advantages of this model is its easy use compared to specialized software such as Arc GIS, and this is the novelty of this research.

**Keywords:** fuzzy interface system, spatial change recognition, remote sensing, image processing, remote sensing application

#### **1. Introduction**

This paper presents a spatial change recognition model using satellite images, image processing, and a fuzzy interface system as a remote sensing application, and it is in the applied research category. Change recognition in natural phenomena is very important for managing and preserving the environment. On the other hand, video data collection satellites have been able to produce large volumes of images from the surface of the earth and provide them to researchers, such as Google Earth. But the images taken by the satellites show the state of the earth at the time of the image capture, and to get more information about whether or not there has been a change in that area, it is necessary to compare exactly the previous images there. So, change recognition in the study area can give an idea of how to manage and control the

environment in that area. Also, change recognition in remotely sensed images is an active research area [1]. Remote Sensing on the Earth's surface and change recognition mean change detection on the Earth's surface by processing images of the same geographical area acquired at different times. This field can include forest or vegetation change, forest mortality, defoliation, and damage assessment, wetland change, urban expansion, damage assessment, crop monitoring, changes in glacier mass balance, environmental change and deforestation, regeneration, and selective logging [2]. So, briefly; we refer to some past research in this regard. For example: "Land cover change detection using GIS and remote sensing techniques: A Spatio-temporal study on Tanguar Haor, Sunamganj, Bangladesh" [3], that using a classification area model. Or, "Automated unsupervised change detection technique from RGB color image", that uses coefficients correlation calculation between color signatures of each two associated pixels from two satellite images for the same area [4], The change detection model has to be insensitive to illumination brightness changes [5], and for this reason, in the proposed model, we use high-resolution grayscale images to prevent changes in light and brightness in satellite images in calculating spatial changes. Or, "Change detection in the city of Hilla between 2007 and 2015 using Remote Sensing Techniques", that using ArcGIS 10.4 software and those processes include, geometric correction, spectral enhancement, image classification, and cartographic output [6], as mentioned, one of the advantages of this model, that is its easy use compared to specialized software such as Arc GIS, and this is the novelty of this research. Or understanding patterns of vegetation change at the Angkor world heritage site by combining remote sensing results with local knowledge based on analysis stages used to extract spectral plots of pixel values in the region of interest [7], or automatic change detection of buildings in an urban environment from very high spatial resolution images using existing geo-database and prior knowledge based on the Image segmentation model that result of the ratio of the detected area was 86–90% [8], or, On a survey of change detection methods based on remote sensing images for multi-source and multi-objective scenarios was expressed, at present, spatial change recognition as a remote sensing application manifests great significance in numerous change detection applications that are shown on the chart below (**Figure 1**) [9].

Another study entitled: change detection of soil formation rate in space and time based on multi-source data and geospatial analysis techniques, estimate the dissolution rate and soil formation rate in karst areas of China and analyzed their spatial diversity has been done [10], or in another study, change detection techniques based on multispectral images for investigating land cover dynamics using image processing and mining have been investigated [11]. In another study entitled: Change detection

#### **Figure 1.**

*Published literature statistics of urban change detection according to the keywords remote sensing and urban change detection in Web of Science (total of 1283 publications) [9].*

*Spatial Change Recognition Model Using Image Processing and Fuzzy Inference System… DOI: http://dx.doi.org/10.5772/intechopen.108975*

techniques for remote sensing applications, about the distribution of change detection methods [12].

Another study using an images enhancement method, improve the accuracy of SAR<sup>1</sup> images to change detection [14]. Or, in another study, based on similarity measurement between heterogeneous images, change detection has been done [15]. Or based on the maximum entropy principle to obtain the final change detection map and compare with the wavelet-based textural features, plain texture difference, image difference, and log-ratio methods change detection has been done [16].

#### **2. Materials and methods**

#### **2.1 Materials**

In this research, we want to use publicly available data without the need for special software or special knowledge to collect data and test the proposed model with it. Google Earth is one of the most popular and widely available applications. And since the model is universal and the quality of the images is enhanced by using image processing techniques, it is enough to introduce two images with exactly the same spatial characteristics to the model in two different periods or to display and calculate their differences. So, for example, we take pictures of some different places in the world using the time change feature in Google Earth software and test them with the proposed model. A very important point is that the location specifications are exactly the same in capturing images, which Google Earth software has this feature. So that without changing the desired location, it is enough to change the timeline and capture two images of a place at two different times. Here are examples of images Acquisition from Google Earth from a fixed location at two different time intervals (**Figures 2**–**5**).

**Figure 2.** *A: Greenland 1930, B: Greenland 2021, source and specifications of images: NOAA, US Navy, NGA, GEBCO, Landsat, 74,22*<sup>0</sup> *50,50 N 45,02*<sup>0</sup> *01,17 W, elev 2811 m, height: 3601.47 km.*

<sup>1</sup> Synthetic-aperture radar (SAR): synthetic-aperture radar is a form of radar that is used to create twodimensional images or three-dimensional reconstructions of objects, such as landscapes [13].

#### **Figure 3.**

*A: Jumeirah Palm beach 2000, B: Jumeirah Palm beach 2000, source and specifications of images: NOAA, US Navy, NGA, GEBCO, Landsat, 25,06*<sup>0</sup> *42,80 N 55,03*<sup>0</sup> *43,93 E, elev 10 m, height: 38.26 km.*

#### **Figure 4.**

*A: Oroomiye Lake 1984, B: Oroomiye Lake 2017, source and specifications of images: NOAA, US Navy, NGA, GEBCO, Landsat, 37,08*<sup>0</sup> *58,85 N 45,12*<sup>0</sup> *31,05 E, elev 2318 m, height: 157.76 km.*

#### **Figure 5.**

*South Pole 1957, B: South Pole 2022, source and specifications of images: NOAA, US Navy, NGA, GEBCO, Landsat, 83,32*<sup>0</sup> *58,95 S 62,22*<sup>0</sup> *15,95 E, elev 3178 m, height: 7734.93 km.*

Similarly, any other location can be imaged and compared at two intervals, and the proposed model has no limitations, it is enough to keep the exact location specifications and the camera does not change in terms of geographical coordinates or height.

*Spatial Change Recognition Model Using Image Processing and Fuzzy Inference System… DOI: http://dx.doi.org/10.5772/intechopen.108975*

#### **2.2 Methods**

The basic model we use to change recognition in images has previously been used to detect changes in industrial parts, where local imaging was performed by the camera and was a type of micrography [17]. The input of the model is the images prepared in the previous section, the type of which was described. In **Figure 6**, the spatial change recognition model for remote sensing is shown.

#### *2.2.1 Description of model components*

#### *2.2.1.1 Acquisition of the first and second spatial image*

The source for the acquisition of satellite spatial images with the history is the Google Earth application.

#### *2.2.1.2 Prepare spatial images data*

We used the MATLAB image processing toolbox to image mining included, convert RGB images to grayscale and converting the intensity image to double (Preprocessing and data preparation stage), and implemented other parts of the model. The reason for converting images from RGB to grayscale is reducing the data from three dimensions to two dimensions while simplifying the problem. So, in MATLAB, there is a function called "rgb2gray" is available to convert RGB images to grayscale images that we used [17].

#### *2.2.1.3 Edge detection from spatial images with different techniques*

In this part, three different methods have been used to edge detection. The reason for using these three methods is that each has its strengths and weaknesses, thus presenting different results in edge detection, which in total will lead to the model's strength in edge recognition.

**Figure 6.** *Spatial change recognition model to remote sensing.*

#### *2.2.1.3.1 Fuzzy interface systems to edge detection*

Briefly, the fuzzy conditions help to test the relative values of pixels that can be present in case of presence on an edge. So, the image is said to have an edge if the intensity variation between the adjacent pixels is large. The mask used for scanning the image is shown in **Figure 7** [17].

$$\mathbf{Gx} = [-\mathbf{1}\,\mathbf{1}], \mathbf{Gy} = \mathbf{Gx'} $$

The mask is slid over an area of the spatial image and changes that pixel's value, and then shifts one pixel to the right and continues to the right until it reaches the end of a row. It then starts at the beginning of the next row and the process continues till the whole image is scanned. When this mask is made to slide over the image, the output is generated by the FIS based on the rules and the value of the pixels [17]. In summary, the steps for using a fuzzy inference system are as follows: a) Crisp spatial images for fuzzified into various FS, having conventional crisp membership functions i.e. Black and White. b) Firing strength is calculated using fuzzy t-norms operators. c) Fuzzy rules are fired for each crisp spatial image. d) Aggregate resultant output FS for all fired rules is achieved by using the max operator (s-norm). e) De-fuzzification using the Centroid method. f) The crisp output is the pixel value of the output image i.e. one containing the edges, and black and white regions. g) The first derivative is performed on the image output from FIS after the application of the noise removal algorithm. h) Further refinement is performed by the second derivative and noise removal [17].

#### *2.2.1.3.2 Sobel's operator to edge detection*

The Sobel operator sometimes called the Sobel–Feldman operator or Sobel filter is used in image processing and computer vision, particularly within edge detection algorithms where it creates an image emphasizing edges [18]. Technically, it is a discrete differentiation operator, computing an approximation of the gradient of the image intensity function. At each point in the image, the result of the Sobel-Feldman operator is either the corresponding gradient vector or the norm of this vector. The Sobel-Feldman operator is based on convolving the image with a small, separable, and integer valued filter in the horizontal and vertical directions. The arrangement of pixels is about the pixel [i, j] shown in **Table 1**. The Sobel's operator is the magnitude of the gradient computed by:

**Figure 7.** *Define FIS for edge detection (the first and second spatial image). (fuzzy inference system).*

*Spatial Change Recognition Model Using Image Processing and Fuzzy Inference System… DOI: http://dx.doi.org/10.5772/intechopen.108975*


#### **Table 1.**

*Masks used by Sobel's operator [18].*

$$\mathbf{M}\sqrt{\mathbf{S}\_{\mathbf{X}}^{2} + \mathbf{S}\_{\mathbf{Y}}^{2}}$$

$$\mathbf{S}\_{\mathbf{X}} = (\mathbf{a}\_{2} + \mathbf{c}\mathbf{a}\_{3} + \mathbf{a}\_{4}) - (\mathbf{a}\_{0} + \mathbf{c}\mathbf{a}\_{1} + \mathbf{a}\_{6})$$

With the constant c ¼ 2.

Like the other gradient operators, Sx and Sy can be implemented using convolution masks:

#### *2.2.1.3.3 Prewitt's operator to edge detection*

Prewitt's operator uses the same equations as Sobel's operator, where constant c ¼ 1 (**Table 2**) [19].

#### *2.2.1.4 Images comparison with different techniques and error reduction*

#### *2.2.1.4.1 The structural similarity index measure*

The SSIM<sup>2</sup> formula is based on three comparison measurements between the samples, namely the luminance term, the contrast term, and the structural term. The overall index is a multiplicative combination of the three terms [20].

$$\begin{aligned} \text{SSIM}(\mathbf{x}, \mathbf{y}) &= \left[ \mathbf{l}(\mathbf{x}, \mathbf{y}) \right]^a \cdot [c(\mathbf{x}, \mathbf{y})]^\beta \cdot [s(\mathbf{x}, \mathbf{y})]^\gamma \text{l}(\mathbf{x}, \mathbf{y}) = \frac{2 \mu \mathbf{x} \mathbf{x} \mathbf{y} + \mathbf{C} \mathbf{1}}{\mu \mathbf{2x} + \mu \mathbf{2y} + \mathbf{C} \mathbf{1}}, \\\ \mathbf{c}(\mathbf{x}, \mathbf{y}) &= \frac{2 \sigma \mathbf{x} \sigma \mathbf{y} + \mathbf{C} \mathbf{2}}{\sigma \mathbf{2x} + \sigma \mathbf{2y} + \mathbf{C} \mathbf{2}} \end{aligned} $$
 
$$\begin{aligned} \mathbf{s}(\mathbf{x}, \mathbf{y}) &= \frac{\sigma \mathbf{x} \mathbf{y} + \mathbf{C} \mathbf{3}}{\sigma \mathbf{x} \sigma \mathbf{y} + \mathbf{C} \mathbf{3}} \\\ \mathbf{x} &= -1 \quad \text{0} \quad 1 \\\ \mathbf{x} &= -1 \quad \text{0} \quad 1 \\\ 1 &= 1 \quad \text{0} \quad 1 \end{aligned} $$


**Table 2.**

*Masks used by Prewitt gradient operator.*

<sup>2</sup> Structural Similarity Index measure.

Where μx, μy, *σ*x, *σ*y, and *σ*xy are the local means, standard deviations, and cross-covariance for images x, y. If α ¼ β ¼ γ ¼ 1 (the default for Exponents), and C3 ¼ C2*=*2 (default selection of C3) the index simplifies to:

$$\text{SSIM}(\mathbf{x}, \mathbf{y}) = \frac{(2\mu\mathbf{x}\mu\mathbf{y} + \mathbf{C1})(2\sigma\mathbf{x}\mathbf{y} + \mathbf{C2})}{(\mu 2\mathbf{x} + \mu 2\mathbf{y} + \mathbf{C1})(\sigma 2\mathbf{x} + \sigma 2\mathbf{y} + \mathbf{C2})}$$

And, dissimilarity Structural is: (1- SSIM(x,y)).

SSIM measures the perceptual difference between two similar images. It cannot judge which of the two is better: that must be inferred from knowing which the "original" and which has been subjected to additional is processing such as data compression.

#### *2.2.1.4.2 Spatial image subtracts*

Each grayscale image is a matrix with color code (0–256), so we can subtract the two matrices to compare the difference between the two images, here we also used the MATLAB image expression box command, Z = imsubtract (img1, img2).

#### *2.2.1.4.3 Absolute difference between the two spatial images*

Another way to compare the differences between two spatial images is defined as the sum of the absolute difference at each pixel. The difference value is defined as

$$\mathbf{D}(\mathbf{t}) = \sum\_{\mathbf{i}=\mathbf{0}}^{\mathbf{M}} |\mathbf{I}\_{\left(\mathbf{t}-\mathbf{T}\right)}(\mathbf{i}) - \mathbf{I}\_{\mathbf{t}}(\mathbf{i})|^{2}$$

Where "M" is the resolution or number of pixels in the image. This method for image difference is noisy and extremely sensitive to camera motion and image degradation. When applied to sub-regions of the image, D (t) is less noisy and may be used as a more reliable parameter for image difference.

$$\mathbf{Ds}\left(\mathbf{t}\right) = \sum\_{\mathbf{j}=\mathbf{s}}^{\mathbf{H}/\mathbf{n}} \sum\_{\mathbf{i}=\mathbf{s}}^{\mathbf{w}/\mathbf{n}} |\mathbf{I}\_{\left(\mathbf{i}-\mathbf{T}\right)}(\mathbf{i}, \ \mathbf{j}) - \mathbf{I}\_{\mathbf{t}}(\mathbf{i}, \ \mathbf{j})|^{2}$$

Ds (t) is the sum of the absolute difference in a sub-region of the image, where S represents the starting position for a particular region, and n represents the number of sub-regions [21]. There is a function to the MATLAB image processing toolbox to compare two images.

Z ¼ imabsdiff img a, img b ð Þ

#### *2.2.1.4.4 Histogram comparison*

In this part of the model, the histogram of two spatial images is drawn and compared in a map that compares the color changes between 0 and 256 from black to white in spatial images.

#### *2.2.1.4.5 Error calculation and reduction*

To check the error rate in the model, it is necessary to compare two images taken exactly the same in terms of space and time from Google Earth with the model.

*Spatial Change Recognition Model Using Image Processing and Fuzzy Inference System… DOI: http://dx.doi.org/10.5772/intechopen.108975*

#### **Figure 8.**

*Calculate the error rate with an accuracy of 4 decimal places, image description according to Table 3.*

Therefore, ideally, the model should not show any difference. In other words, since the two images taken are exactly the same, the difference calculated by the model must be absolute zero. The accuracy of error calculation is up to 4 decimal places. A comparison of two exactly identical images is shown below (**Figure 8**):

#### *2.2.1.5 Information integration*

This part, same a single system that information must be shared across all of the functional areas, and the information collected are integrated [22].

#### *2.2.1.6 Spatial changes recognition*

We use the following method to calculate the difference between the two spatial images.

### Percentage difference <sup>¼</sup> ð Þ <sup>1</sup>‐SSIM <sup>∗</sup> <sup>100</sup>–Error

And also, we will provide a complete map to show the spatial changes along with calculating the percentage of changes.

#### **3. Result**

In this section, after introducing the materials and methods, we run the proposed model to obtain the results. Description of the comparative general map in results is shown in **Table 3**.


**Table 3.**

*Description of the comparative general map in results.*

**Experiment 1:** See **Figures 9**–**13**. **Experiment 2:** See **Figures 14**–**18**. **Experiment 3:** See **Figures 19**–**23**. **Experiment 4:** See **Figures 24**–**28**.

### **4. Discussion**

Spatial analysis in GIS knowledge is done for experts in this field through specialized software such as ArcGIS, etc., and the issue of recognition of environmental

**Figure 9.** *A: Greenland 1930, B: Greenland2021, input spatial image to model.*

*Spatial Change Recognition Model Using Image Processing and Fuzzy Inference System… DOI: http://dx.doi.org/10.5772/intechopen.108975*

**Figure 10.**

*Comparative general map, the description of the components is as shown in Table 3.*

**Figure 11.**

*Comparison of histograms of two temporal spatial images.*

**Figure 12.** *a: Bar histogram of spatial image in time 1, b: Bar histogram of spatial image in time 2, c: Histogram comparison.*

**Figure 13.**

**Figure 14.** *a: Jumeirah Palm beach 2000, b: Jumeirah Palm beach 2000.*

*Spatial Change Recognition Model Using Image Processing and Fuzzy Inference System… DOI: http://dx.doi.org/10.5772/intechopen.108975*

*Comparative general map, the description of the components is as shown in Table 3.*

**Figure 16.**

*Comparison of histograms of two temporal spatial images.*

**Figure 17.** *a: Bar histogram of spatial image in time 1, b: Bar histogram of spatial image in time 2, c: Histogram comparison.*

changes, etc. is of special importance to them. In the preparation of image data according to the rules of data mining, the only change in the captured images should be time and other variables of the desired location, including geographical characteristics, camera height from the ground, etc. must remain constant, and otherwise, a calculation error will occur. Therefore, by strictly observing this important point, the system error will be zero; in other words, we will have an accuracy of %100. Also, the test results will be the same no matter how many (repetitions) are performed, which explains the validation of the model.

Comparison segmentation method and proposed model, advantages and disadvantages:

Using segmentation in spatial images is important task for detecting changes. Segmentation must not allow regions of the image to overlap. Thresholding is one of the oldest methods used for image segmentation. It is based on the gray level intensity

**Figure 18.**

*Histogram compression, above axis: the spatial image in time 2 and below: the spatial image in time 1.*

*Spatial Change Recognition Model Using Image Processing and Fuzzy Inference System… DOI: http://dx.doi.org/10.5772/intechopen.108975*

**Figure 19.** *a: Oroomiye Lake1984, b: Oroomiye Lake 2017.*

**Figure 20.** *Comparative general map, the description of the components is as shown in Table 3.*

**Figure 21.** *Comparison of histograms of two temporal spatial images.*

value of pixels. The histogram of an image consists of Conceptually is similar to the classifiers except that they are implemented in the spatial domain of an image rather than in a feature space. It treats the segmentation as a registration process. Some researchers used atlases not only to impose spatial constraints but also to provide probabilistic information about the tissue model. The advantage is that it can segment an image with no well-defined relation between regions and pixels. K-means is a clustering method that partitions the n-points into the k-clusters in which each pixel belongs to one cluster by minimizing an objective function in such a way that within a cluster sum of squares is get minimized. It starts with k-clusters and each pixel is

**Figure 22.** *a: Bar histogram of spatial image in time 1, b: Bar histogram of spatial image in time 2, c: Histogram comparison.*

*Spatial Change Recognition Model Using Image Processing and Fuzzy Inference System… DOI: http://dx.doi.org/10.5772/intechopen.108975*

**Figure 23.** *Histogram compression, above axis: the spatial image in time 2 and below: the spatial image in time 1.*

assigned to one cluster. The limitation of the K-means algorithm is computational time increases on implementation in large amounts of data but our proposed model is independent of clustering so will be faster than the K-means algorithm.

#### **5. Implication**

In the proposed model, using machine vision, image data processing, fuzzy mathematical techniques, the use of known masks, as well as historical images in Google Earth software, we can measure changes in images and measure them. This software can be added to Google Earth software as a development part and users can easily view spatial changes in terms of time changes.

**Figure 25.** *Comparative general map, the description of the components is as shown in Table 3.*

**Figure 26.**

*Comparison of histograms of two temporal spatial images.*

*Spatial Change Recognition Model Using Image Processing and Fuzzy Inference System… DOI: http://dx.doi.org/10.5772/intechopen.108975*

**Figure 27.** *a: Bar histogram of spatial image in time 1, b: Bar histogram of spatial image in time 2, c: Histogram comparison.*

**Figure 28.** *Histogram compression, above axis: the spatial image in time 2 and below: the spatial image in time 1.*

#### **6. Conclusion**

The change recognition model presented by the author of the article earlier [17], was also used in spatial change recognition. The most important difference between the previous models in change recognition in industrial parts with the current model is in the field of capturing images and image preparation. Here the images are taken from Google Earth and no filter is used to prepare them because the images are of sufficient quality and adding any filter will cause a computational error in the model (this was tested many times by the authors) but in the base model, the capture of images was with a local camera that had the error and the macrographic imaging technique was done [17].

#### **Author details**

Majid Mirbod Department of Industrial Management, Tehran North Branch, Islamic Azad University, Tehran, Iran

\*Address all correspondence to: mjmirbod@yahoo.com

© 2022 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

*Spatial Change Recognition Model Using Image Processing and Fuzzy Inference System… DOI: http://dx.doi.org/10.5772/intechopen.108975*

#### **References**

[1] Khurana M, Saxena V. Soft computing techniques for change detection in remotely sensed images: A review. International Journal of Computer Science Issues. 2015;**12**(2)

[2] Bruzzone L, Bovolo F. A novel framework for the Design of Change-Detection Systems for very-highresolution remote sensing images. IEEE. 2013;**101**(3)

[3] Inzamul Haque M, Basak R. Land cover change detection using GIS and remote sensing techniques: A spatiotemporal study on Tanguar Haor, Sunamganj, Bangladesh. The Egyptian Journal of Remote Sensing and Space Sciences. 2017;**20**(2):251-263. DOI: 10.1016/j.ejrs.2016.12.003

[4] Gomaa M, Hamza E, Elhifnawy H. Automated unsupervised change detection technique from RGB color image. Materials Science and Engineering. 2019;**610**:012046. DOI: 10.1088/1757-899X/610/1/ 012046

[5] Fisher R. Change detection in color images. In: Proceedings of 7th IEEE Conference on Computer Vision and Pattern. Mathematics. Citeseer. 1999

[6] Kadhum ZM, Jasim BS, Obaid MK. Change detection in city of Hilla during period of 2007-2015 using remote sensing techniques. Materials Science and Engineering. 2020;**737**:012228. DOI: 10.1088/1757-899X/737/1/012228

[7] Wales N, Murphy RJ, Bruce E. Understanding patterns of vegetation change at the Angkor world heritage site by combining remote sensing results with local knowledge. International Journal of Remote Sensing. 2021;**42**(2). DOI: 10.1080/01431161.2020.1809739

[8] Bouziani M, Goïta K, He D-C. Automatic change detection of buildings in urban environment from very high spatial resolution images using existing geodatabase and prior knowledge. ISPRS Journal of Photogrammetry and Remote Sensing. 2010;**65**:143-153. DOI: 10.1016/ j.isprsjprs.2009.10.002

[9] You Y, Cao J, Zhou W. A survey of change detection methods based on remote sensing images for multi-source and multi-objective scenarios. Remote Sensing. 2020;**12**(15):2460. DOI: 10.3390/ rs12152460

[10] Li Q, Wang S, Bai X, Luo G, Song X, Tian Y, et al. Change detection of soil formation rate in space and time based on multi source data and geospatial analysis techniques. Remote Sensing. 2020;**12**:121. DOI: 10.3390/rs12010121

[11] Panuju DR, Paull DJ, Gri AL. Change detection techniques based on multispectral images for investigating land cover dynamics. Remote Sensing. 2020;**12**:1781. DOI: 10.3390/rs12111781

[12] Asokan A, Anitha J. Change detection techniques for remote sensing applications: A survey. Earth Science Informatics. 2019;**12**:143-160. DOI: 10.1007/s12145-019-00380-5

[13] Kirscht M, Rinke C. 3D reconstruction of buildings and vegetation from synthetic aperture radar (SAR) images. MVA. 1998

[14] Lia Z, Jia Z, Liu L, Yang J, Kasabovc N. A method to improve the accuracy of SAR image change detection by using an image enhancement method. ISPRS Journal of Photogrammetry and Remote Sensing. 2020;**163**:137-151. ISSN: 0924-2716. DOI: 10.1016/j. isprsjprs.2020.03.002

[15] Sun Y, Lei L, Li X, Sun H, Kuang G. Nonlocal patch similarity based heterogeneous remote sensing change detection. DOI: 10.1016/j.patcog.2020. 107598

[16] Ansari RA, Buddhiraju KM, Malhotra R. Urban change detection analysis utilizing multiresolution texture features from polarimetric SAR images. Remote Sensing Applications: Society and Environment. DOI: 10.1016/j.rsase. 2020.100418

[17] Mirbod M, Ghatari AR, Saati S, Shoar M. Industrial parts change recognition model using machine vision, image processing in the framework of industrial information integration. Journal of Industrial Information Integration. 2022;**26**:100277. DOI: 10.1016/j.jii.2021.100277. ISSN: 2452-414X

[18] Kanopoulos N et al. Design of an Image Edge Detection Filter using the Sobel operator. Journal of Solid-State Circuits, IEEE. 1988;**23**(2):358-367

[19] Seif A et al. A hardware architecture of Prewitt edge detection. In: Sustainable Utilization and Development in Engineering and Technology (STUDENT), 2010 IEEE Conference. Computer Science. Malaysia; 2010. pp. 99-101

[20] Zhou W, Bovik AC, Sheikh HR, Simoncelli EP. Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing. 2004;**13**(4):600-612

[21] Xiong Z, Huang TS. The Essential Guide to Video Processing. Texas, USA: Department of Electrical and Computer Engineering The University of Texas at Astin; 2009

[22] Xu L. Enterprise Integration and Information Architecture: A Systems Perspective on Industrial Information Integration. Auerbach Publications; 2014. p. 446. ISBN: 9781439850244

#### **Chapter 8**

## Methods for Real-time Emotional Gait Data Collection Induced by Smart Glasses in a Non-straight Walking Path

*Nitchan Jianwattanapaisarn, Kaoru Sumi and Akira Utsumi*

#### **Abstract**

Emotion recognition is an attractive research field because of its usefulness. Most methods for detecting and analyzing emotions depend on facial features so the closeup facial information is required. Unfortunately, high-resolution facial information is difficult to be captured from a standard security camera. Unlike facial features, gaits and postures can be obtained noninvasively from a distance. We proposed a method to collect emotional gait data with real-time emotion induction. Two gait datasets consisting of total 72 participants were collected. Each participant walked in circular pattern while watching emotion induction videos shown on Microsoft HoloLens 2 smart glasses. OptiTrack motion capturing system was used to capture the participants' gaits and postures. Effectiveness of emotion induction was evaluated using selfreported emotion questionnaire. In our second dataset, additional information of each subject such as dominant hand, dominant foot, and dominant brain side was also collected. These data can be used for further analyses. To the best of our knowledge, emotion induction method shows the videos to subjects while walking has never been used in other studies. Our proposed method and dataset have the potential to advance the research field about emotional recognition and analysis, which can be used in realworld applications.

**Keywords:** emotion induction, emotion recognition, gait analysis, motion capturing, smart glasses, non-straight walking behavior, emotional movies, watching video while walking

#### **1. Introduction**

Intelligent video surveillance research gains a lot of interests by the public. In this study, an example of applications was conducted to show the potential of monitoring human behaviors from their movements. The authors have conducted some studies to analyze the characteristics of individuals. In order to conduct research about recognition of human emotions, the authors proposed a research environment and research method in which human emotions can be changed in real time using

video stimuli, and experiments on emotion recognition can be performed using our proposed environment and method. Recognizing human emotions is very useful in several circumstances, for example, improvement of human-robot interaction experiences, suspicious behaviors detection for crime and altercation prevention, customer satisfaction evaluation, students' engagement evaluation. These are some examples of applications that can improve the quality of life for humans.

Affective computing [1] is a specific research field that was emerged because of the popularity of emotion analysis research, which attempts to make a computer to be able to understand and generate human-like affects. There are many studies related to affective computing proposed during recent years. A good example is about the online exercise program for students to practice their programming skills. Affective computing technique was applied to an online exercise program by analyzing the emotion of students as well as their performance in each task. Then, an animated agent is used to interact with students during the exercise so the students can interact with the agent. This method can improve students' experiences and performances at the same time [2].

Another good example is about surveillance and security application, which is related to intelligent video surveillance topic of this book. A survey conducted by [3] reveals that gait analysis is very useful for crime prevention applications. In recent years, a CCTV camera system is now a standard equipment, which is installed in almost every public places. Human gaits can be analyzed in a very short time due to the advancement of computer vision and machine learning technology together with the help of on-board computation devices. Therefore, suspicious behaviors can be detected promptly. According to a study [4], smart video surveillance is very useful for many applications by applying gait analysis techniques such as human identification, human re-identification, and forensic analysis because human gaits can be obtained from far away without the subjects' awareness or cooperation.

In emotion recognition and prediction aspect, in the past, these applications were performed by human observers [5]. Unfortunately, using human observer to judge the emotions of other subjects is time consuming. Humans as the judges are not consistent enough to be used in reality. As a result, automatic emotion methods have been developed. Most publicly available methods for emotion recognition nowadays are performed using facial expression. Facial features perform very well in some situations. However, there are still some limitations of using facial features to perform emotion recognition. If the facial images or videos must be captured in a crowded and noisy environment, it is difficult to capture high-quality facial features since a standard security camera cannot perform well enough. Particularly, when the subject is not facing forward with the camera, facial features are difficult to be obtained accurately. Moreover, some subjects wear eyeglasses and sunglasses, or have beard and mustache, which also prevent emotion analysis from facial features to be performed effectively. Therefore, if facial features cannot be clearly captured, other features should be used instead to make emotion recognition and analysis more practical for real-world uses.

Gaits and postures are the way that human body moves and poses while they are walking. This kind of expressions can be observed from distance without awareness of subjects. Also, it can be captured without the need for high-resolution images or videos. Thus, gaits and postures are very good expressions of a human that can be used for emotion recognition and analysis. There are several applications that can be

*Methods for Real-time Emotional Gait Data Collection Induced by Smart Glasses in a Non… DOI: http://dx.doi.org/10.5772/intechopen.107410*

performed effectively and accurately using human gaits and postures such as human identification [6, 7] human re-identification [8], human age estimation, and gender recognition [9, 10]. From many studies proposed, these prove that human gait and posture are very appropriate features for prediction and recognition of human emotions [5, 11–20].

The objective of this study is to propose a method for emotional gait data collection using a novel method to induce subjects' emotions when the subjects are walking in a non-straight walking direction, which is an unconventional walking path and cannot be found in other studies. This study proposed a method and environment to perform gait data collection in different emotions. Microsoft HoloLens 2 smart glasses were used for displaying the emotion induction videos to the participants while they are walking. OptiTrack motion capturing system was used to record the walking data of subjects. Although we used the OptiTrack, which is a marker-based devices, to record body movements while walking, other marker-less motion capture devices such as Microsoft Kinect or Intel RealSense can be also used instead of OptiTrack. Also, because of the advancement of pose-estimation software, for example, OpenPose or wrnch AI, any video cameras can be used to capture human gaits for gait analysis.

#### **2. Related work**

Many studies on emotion recognition were proposed in recent years because of their usefulness. Most of them were performed using facial features, which are sufficiently accurate in some situations. However, facial features still have some limitations as discussed in the previous section. Even though there are many studies that used human gaits and postures as features for emotion recognition, the number is still fewer than the studies using facial features.

A survey conducted by [21] investigated several studies about gait analysis, not only for emotion recognition but also for human identification. They found that characteristics of human walking are different in different emotions. This information can be used for development of automatic emotion recognition. In comparison with other biometrics, for example, speech features, facial features, physiological features, they found that using gaits has many advantages. For instance, gaits can be observed without subject's awareness from afar, imitation of gaits are very difficult, and subject's cooperation is not required to obtain human gaits. Hence, gaits are very powerful expressions, which can be used to perform automatic emotion recognition. We are going to mention only about the equipment that can be used for collection of gait data and the results, which shows the effectiveness of emotion prediction from gaits. From this survey study, several devices can be used to capture gait data. For instance, a force plate can be used to record velocity and pressure data [11], velocity data can be recorded well by using an infrared light barrier system [11, 22], motion capturing devices such as Vicon are a good tool to record the coordinates of body parts by attaching the markers on subjects' body [12–15, 23–25], wearable devices such as smart watches equipped with accelerometer as well as smart phones can be used to record body movements data to use for gait analysis [18–20], and Microsoft Kinect is also an effective tool for recording human skeleton without the need of markers to be attached on subjects' body [6–9, 16, 17, 26]. Some findings are useful for future studies are as follows. When the subjects feel happy, they step faster [5] and their strides are longer [27]. Also, their joint angles amplitude [27] and arm movements [12] increase.

When the subjects feel sad, their arm swings decrease [5], and their limb shape and torso shape [24] are contracted. Their joint amplitude also decreases [12].

Nowadays, there are several studies about gait analysis proposed. Some examples include emotion prediction [11, 21], mental illness prediction [22, 23], human identification or re-identification [6–8], and gender prediction [9, 10]. Several tools can be used for gait data collected as already mentioned, for example, light barrier, force plate, video camera, accelerometer, motion capturing system. From this equipment, we focus on the equipment that captures the coordinates of body parts or silhouette images of human body since these gait features are sensitive to walking directions. Using straight walking direction usually results in high-quality gait data [11, 12, 14–17, 19, 22, 23, 26, 28–30], so most studies used this type of walking direction. There are fewer works that used free-style walking pattern; that is, subjects can choose the walking path any direction the want [6–9]. By using free-style walking data, the results often lower than using straight walking data, but it increases the opportunities to deploy the proposed methods in reality. Since human walking in public spaces is always lack of awareness for being observed and the walking pattern cannot be controlled to be a straight walking path. That is, to collect the straight walking gait data in real-world environment is more difficult than random direction walking data.

In this study, we decided to use the latest technology smart glasses called Microsoft HoloLens 2 to display the emotional videos to the subjects while they are walking. Therefore, we concerned some issues including the interference of smart glasses to human gaits while walking, negative effects such as trips and slips are also important to be considered. Some studies were performed on this topic, and they are useful for our study. For example, a study proposed by [31] performed an investigation of gait performance while the subjects use head-worn display during walking. Experiment was done using 12 subjects to check whether the subjects can walk normally in different conditions. Several factors were assessed, that is, walking speed and obstacle crossing speed, required coefficient of friction, minimum foot clearance, and foot placement location around the obstacle. From this study, they found that using headworn display to perform tasks while walking has no effect with level walking performance when comparing with using a paper list and with baseline walking that used nothing. For obstacle crossing experiment, they found that the subjects choose more cautious and more conservative strategy to cross the obstacle if they are using the head-worn display. Obstacle crossing speed also decreases by 3% when compared with the baseline walking. Besides, using head-worn display does not affect with foot placement location around the obstacle.

Other useful studies that investigated the negative effects on human gaits when using head-worn display are [32, 33]. They performed experiments to find out the adverse effect when the subjects use head-worn display while walking. They asked 20 subjects (10 men and 10 women) to walk in four different conditions on a treadmill. Subjects were asked to perform one single-task walk (walking and do nothing) and three dual-task walks (walking and perform attention-demanding tasks). Dual-task walks were conducted by different display types including paper-based, smart phone, and smart glasses for displaying information to the subjects while they walk. Attention-demanding tasks include Stroop test, categorizing, and arithmetic. The subjects use head-down posture while they performed tasks on paper-based display and on smart phone. In single-task walking and in dual-task walking using smart glasses, they use head-up posture. Vicon motion capture system with seven cameras was used in their experiments. The results of their studies reveal that walking while

#### *Methods for Real-time Emotional Gait Data Collection Induced by Smart Glasses in a Non… DOI: http://dx.doi.org/10.5772/intechopen.107410*

using smart glasses to perform attention-demanding tasks has more impacts with gait performance, for example, gait stability in comparison with walking while performing attention-demanding tasks on other display types. The important finding from their studies is that the subjects are more unstable if they use smart phone and paper-based display to perform tasks while walking than using smart glasses. This means that the head-up and head-down postures affect with human gaits.

From reviewing of related works, Microsoft HoloLens 2 was confirmed that it can be used for displaying videos as the subjects can use head-up posture while walking, and they can also see the room environment while watching videos since the HoloLens 2 display is transparent. Even though there could be some negative impacts such as walking stability or obstacle crossing strategy, we cope with these issues by asking our subjects to take rehearsal walks to make them familiar with using HoloLens while walking and with the walking area before performing the actual recording walks. About the obstacle, our walking space is very clear so there should be no problem with using HoloLens 2 while walking.

#### **3. Data collection**

Gait data collection method described in this study has been proposed by us [34]. The data collection method we used is as follows. Since most studies in emotion recognition and analysis using gaits and postures were performed by asking subjects to walk straightly on the pathway or on the treadmill, we found that walking in a straight line will result in cleaner gait data. However, it will be more difficult to be implemented in reality. In emotion induction aspect, there are some techniques widely used as follows. First, subjects are asked to walk while recalling their own personal experiences according to assigned emotions. Second, subjects are not normal people but professional actors. Third, subjects are asked to watch an emotional video on a conventional screen such as television or computer display before they start walking.

With these settings, it is possible that some problems can occur. In the first method, subjects may not be able to recall their memories well enough to express the desired emotions on their gaits and body movements. In the second method, using professional actors instead of normal people can make the gaits too exaggerate and not natural. In the last method, it is possible that the induced emotions will not last until the end of walking because the video stimuli end before the subjects start walking. These issues can make the collected gait data incorrectly reflect human emotions and the relationship between collected gaits and emotions will be inaccurate.

In order to solve this problem, our experiments were designed to make the subjects watch a video for emotion induction and walk at the same time to record real-time emotion of subjects. Since there is a latest smart glasses technology named Microsoft HoloLens 2 available for consumer uses, we decided to use HoloLens 2 for displaying emotional videos to the subjects while they are walking. With this method, subjects can watch the stimuli at the same time they walk; hence, their emotions will be constantly and consistently induced. As of now, to the best of our knowledge, there is no other researcher who used this method before. Because of the transparent display of HoloLens 2, subjects can see the walking space and the room environment at the same time while they walk. We also expected that showing videos to the subjects during walking will be more similar to when subjects walk in real life and see some situations that make their emotion change in real time according to those situations. This emotion induction method is planned to simulate the subject's real-time emotion. Also, because we showed the videos at the same time of walking, we can ensure that the induced emotions will be more stable, more consistent, and last until the end of the walk.

For the walking direction, because our subjects have to watch the videos for emotion induction and walk in the walking space at the same time, allowing them to walk freely without path guidance at all can be too difficult for them. Because subjects need to concentrate with the content of the videos, if they also have to select the walking path while walking, it is possible that they will not be able to focus on the videos well enough and the emotion induction will not be effective. Consequently, we asked the subjects to walk in a circular pattern without guidance line on the floor. That is, subjects can walk in lax circular path, clockwise, or counter-clockwise depending on their own preferences. With this walking direction, subjects can walk like an oval shape or like a rounded-rectangle shape as they want. Therefore, we can collect both straight walking and non-straight walking in one walking trial.

#### **3.1 Equipment for data collection**

Motion capturing devices can be categorized into two main types. First, markerless type, which is easy to setup and requires nothing to be attached on subject's body. Second, marker-based type, which requires several markers to be attached on subject's body. Differences between these two types are marker-less type uses image processing and machine learning technology to predict the positions of body parts from depth image and color image captured from build-in cameras, while marker-based type requires several cameras to be installed, and the actual position of each marker is calculated from the reflection of infrared light captured by all cameras. This means that the position of each marker is reconstructed from all cameras data to obtain a coordinate in three-dimensional space. This makes the marker-based motion capturing device more accurate but also more difficult to setup, whereas marker-less device such as Microsoft Kinect is much easier to setup and use in any situation.

In this study, we decided to use OptiTrack, which is a famous marker-based motion capturing system to capture human gaits. Fourteen OptiTrack Flex 3 cameras were installed around the recording space, and OptiTrack Entertainment Baseline Markerset consisting of 37 markers was used in our experiments. **Table 1** lists all marker names, and **Figure 1** shows the position of each marker on human body.

#### **3.2 Recording environment**

The black tape was used for marking a rectangle walking area on the floor as shown in **Figure 2**. Inside the rectangle is the area that OptiTrack can capture. The size of this walking space is 2.9 by 3.64 meters. Fourteen OptiTrack Flex 3 cameras were installed on seven camera stands, and each stand was placed around the walking space as illustrated in **Figure 3**. In other words, one camera stand has two OptiTrack Flex 3 cameras installed at different height levels, one on higher level and another on the lower level as shown in **Figure 4**.

#### **3.3 Materials for data collection**

Three videos were selected as stimuli to induce subject's emotion. HoloLens 2 was used for displaying these videos to each subject while he or she is walking circularly in the recording area.

*Methods for Real-time Emotional Gait Data Collection Induced by Smart Glasses in a Non… DOI: http://dx.doi.org/10.5772/intechopen.107410*


**Table 1.**

*List of OptiTrack baseline markers.*

#### **Figure 1.**

*Position of each marker on the body (Human Figure Source: https://sketchfab.com/3d-models/man-5ae6bd 9271ac4ee4905b96e5458f435d).*

#### **Figure 2.**

*Rectangle walking area marked with black tape on the floor.*


The Neutral video was selected from nature landscape videos on YouTube that should not induce any emotion. Positive video (for inducting happy emotion) and *Methods for Real-time Emotional Gait Data Collection Induced by Smart Glasses in a Non… DOI: http://dx.doi.org/10.5772/intechopen.107410*

**Figure 3.** *Position of each camera and dimension of the walking area.*

**Figure 4.** *Two OptiTrack Flex 3 cameras installed on each camera stand at different height levels.*

negative video (for inducting sad emotion) were selected from a public annotate movie database named *LIRIS-ACCEDE* (https://liris-accede.ec-lyon.fr/). This database was published by [35]. It consists of several movies and their emotion annotations in valence-arousal dimension. All movies in this database are published under the Creative Commons license. In our study, we selected two movies from the *Continuous LIRIS-ACCEDE collection*. We found that most movies contain both positive valence and negative valence. As we would like to make an entire walking trial to contain only one emotion, we selected one movie with only positive valence and another movie with only negative valence annotation. As each subject needs to walk when the movie starts until the movie ends, all movies we selected must not be too long. In our opinion, less than 15 minutes in length is acceptable. The lengths of the neutral video, negative movie, and positive movie are 5:04, 13:10, and 12:14 minutes, respectively. Sample plots of valence score for annotated movies are shown in **Figure 5**, and plots of the negative and positive movie we used are shown in **Figure 6**. Neutral video has no sound at all to ensure that it will not induce any emotion. Positive video and negative video contain music, sound effects, and conversation in English. Subjects can hear the audio from stereo speakers, which are build-in with the HoloLens 2.

#### **3.4 Methods for data collection**

Before participating in our experiments, we kindly asked our participants to answer the health questionnaire and signed the consent form. Questions in the health questionnaire are as follows.


**Figure 5.** *Valence plots of sample movies.*

*Methods for Real-time Emotional Gait Data Collection Induced by Smart Glasses in a Non… DOI: http://dx.doi.org/10.5772/intechopen.107410*

**Figure 6.** *Valence plots of negative movie (parafundit) and positive movie (tears of steel) we selected.*


6. If you have any problem with your health condition, please describe it.

According to this questionnaire, any subject who had a health issue could be excluded from participation. However, in this study, all subjects confirmed that they were healthy.

For the first dataset we proposed in [1], only health questionnaire listed above was used. In addition, for the second dataset proposed in this study, more questions were added to check for the dominant hand, dominant foot, and dominant brain side of each subject.

The dominant hand of each subject was determined using the modified version of *Flinders Handedness survey questions* published by *Left Handers Association of Japan* available online at https://lefthandedlife.net/faq003.html. All questions on this website were translated into Japanese and to make the questions more appropriate with Japanese culture. The questions of dominant hand questionnaire and their English translation are as follows.

1.文字を書くとき、どちらの手でペン(筆記具)を持ちますか?

When writing, which hand do you hold a pen (writing instrument)?

2.食事をするとき、どちらの手でスプーンを持ちますか?

When you eat, which hand do you hold the spoon?

3.歯を磨くとき、どちらの手で歯ブラシを持ちますか?

When brushing your teeth, which hand do you hold your toothbrush?

4.マッチを擦るとき、どちらの手でマッチ棒を持ちますか?

When you rub a match, which hand do you hold the matchstick with?

5.消しゴムで文字や図画を消すとき、どちらの手で消しゴムを持ちますか? When erasing letters and drawings with an eraser, which hand do you hold

6.お裁縫をするとき、どちらの手で縫い針を持ちますか?

When sewing, which hand do you hold the sewing needle?

7.食卓でパンにバターを塗るとき、どちらの手でナイフを持ちますか?

When you put butter on bread at the table, which hand do you hold the knife?

8.釘を打つとき、どちらの手で金づち(ハンマー)を持ちますか?

When you hit a nail, which hand do you hold a hammer?

9.ジャガイモやりんごの皮をむくとき、どちらの手でピーラー (皮むき器) を持 ちますか?

When peeling potatoes or apples, which hand do you hold a peeler?

10.絵を描くとき、どちらの手で絵筆やペンを持ちますか?

When drawing, which hand do you hold a paintbrush or pen?

In each question, subjects can choose for left hand, right hand, and both hands. The score for each question is �1, +1, and 0 for left hand, right hand, and both hands, respectively. Total score for all questions was calculated for each subject to check the dominant hand of that subject. If the total score is �10 to �5, the subject is classified as left-handed. If the total score is �4 to +4, the subject is classified as both-handed, and if the total score is +5 to +10, that subject is right-handed.

Additionally, another questionnaire for checking the dominant foot for each subject was also used. Dominant foot was determined by using Chapman et al.'s Foot Dominant test questions, which are translated into Japanese language. The questions are available in Japanese at https://blog.goo.ne.jp/lefty-yasuo/e/37149f8d3105e9b43aa58c5925024915. The questions in Japanese and English translation are as follows.

1.サッカーボールを蹴る

Which foot do you use to kick a soccer ball?

2.缶を踏みつける

Which foot do you use for trampling the can?

*Methods for Real-time Emotional Gait Data Collection Induced by Smart Glasses in a Non… DOI: http://dx.doi.org/10.5772/intechopen.107410*

3.ゴルフボールを迷路に沿って転がす

Which leg do you use to roll a golf ball along the maze?

4.砂に足で文字を書く

Which foot you use to write letters on the sand?

5.砂地をならす

Which foot do you use to smooth the sand?

6.小石を足で並べる

Which foot do you use to arrange the pebbles?

7.足先に棒を立てる

Which foot do you use to put a stick on your toes?

8.ゴルフボールを円に沿って転がす

Which foot do you use to roll the golf ball along the circle?

9.片足跳びをできるだけ速くする

Which foot do you use to make one-legged jumps as fast as possible?

10.できるだけ高く足を蹴上げる

Which foot do you use to kick your feet as high as you can?

11.足先でこつこつリズムをとる

Which foot do you use to take a rhythm with your feet?

The dominant foot was judged by checking the total score. If the subject's answer is left foot, the score is 3 points, right foot is 1 point, and both feet score is 2 points. In total, if the total score is 28 points or more, that subject was judged as left-footed. If the total score is less than 28 points, that subject was classified as right-footed.

Another questionnaire is for checking the dominant brain side. There are many dominant brain test questions available. In this study, we selected the arm and hand folding questions for testing the dominant brain side. There are two questions in this questionnaire. Subjects selected a picture that is matched with them for each question. The questions and pictures are from https://www.lettuceclub. net/news/article/194896/. Both questions are shown in Japanese and English as follows.

1.自然に腕を組んでください。どのようになりましたか?

Please fold your arm naturally. Which picture match with you?

2.自然に手を組んでください。どのようになりましたか?

Please fold your hand naturally. Which picture match with you?

Subjects were asked to select the picture of arm folding and hand folding that match with them. Hand folding test was used for testing the input brain, and arm folding test was used for testing the output brain. For hand folding, if the thumb of the right hand is below, the input brain is right side. If the thumb of the left and is below, the input brain is left side. For arm folding, if the right arm is below, the output brain is right side. If the left arm is below, the output brain is left side. The pictures for the subjects to select in the questionnaire are shown in **Figure 7**.

After finishing the health questionnaire, informed consent, dominant hand questionnaire, dominant foot questionnaire, and dominant brain side questionnaire, each subject was instructed to walk in a circular pattern inside the walking area marked by the black tape on the floor. Subjects are free to choose the direction they want to walk between clockwise or counter-clockwise. Also, subjects could switch the direction anytime when they want during each walking trial. The following are all walking trials each subject was asked to walk.


The intention of the first rehearsal walk is to make the subjects to be familiar with the room environment and the walking space. For the second rehearsal walk with HoloLens 2 showing nothing, as we found from [31–33], if the subjects have never

**Figure 7.** *Arm and hand folding test questions (Source: https://www.lettuceclub.net/news/article/194896/).*

*Methods for Real-time Emotional Gait Data Collection Induced by Smart Glasses in a Non… DOI: http://dx.doi.org/10.5772/intechopen.107410*

used smart glasses before, gait performance can be unstable. Therefore, we asked each subject to take another rehearsal walk to make the subject to feel familiar with walking and wearing HoloLens 2. After two rehearsal walks, we showed the neutral video on HoloLens 2 and ask each subject to start walking when the video starts and stop walking when the video stops. Then, we showed the first emotional video on HoloLens 2 and asked the subjects to walk in the same procedures as the first video. After this emotional video ended, we asked the subjects to go for a break for 10 minutes to reset their emotion to normal condition. Finally, we showed the second emotional video on HoloLens 2 and asked the subjects to walk while watching the last video. The first emotional video and the second emotional video were swapped between positive video then negative video, and negative video then positive video. Overall process for data collection of the first dataset is shown in **Figure 8**. For the second dataset, questionnaire for dominant hand, dominant foot, and dominant brain side was conducted after the health questionnaire and before the first rehearsal walk.

Furthermore, subjects were asked to report their perceived emotion after finishing neutral walk, positive walk, and negative walk. The questions are as follows.


In the first dataset, only self-reported emotion questionnaire was used after neutral walk, negative walk, and positive walk. In the second dataset, we added another question after the last self-reported questionnaire, that is, after finishing the last walking trial. As we are unsure that the subjects can walk naturally while they are watching videos on HoloLens 2 or not, we added a question asking them whether they can walk naturally while using HoloLens 2 and asked them to explain the reason.

Sample screenshots of a subject walking in circular pattern while watching a video on HoloLens 2 are shown in **Figure 9**. A sample image of a subject wearing HoloLens 2 and OptiTrack motion capturing suit with markers is shown in **Figure 10**.

**Figure 9.** *Samples of walking in the recording area.*

**Figure 10.** *A Subject Wearing HoloLens 2 and OptiTrack Motion Capture Suit with 37 Markers.*

#### **4. Result and discussion**

Two emotional gait datasets were collected. The first dataset proposed in [1] contains 49 subjects including 41 men and 8 women. The average age is 19.69 years with 1.40 years standard deviation. The average height is 168.49 centimeters with 6.34 centimeters standard deviation. The average weight is 58.88 kilograms with 10.84 kilograms standard deviation. In total, there are 147 walking trials in this dataset. As the order of emotional videos shown to each subject was swapped, we have 24 subjects watched negative movie before positive movie (neutral – > negative – > positive), and 25 subjects watched positive movie before negative movie (neutral – > positive – > negative). For the emotion perceived by the subjects from the self-reported emotion

*Methods for Real-time Emotional Gait Data Collection Induced by Smart Glasses in a Non… DOI: http://dx.doi.org/10.5772/intechopen.107410*

questionnaire, there are 44 sad walking trials, 44 happy walking trials, and 59 neither walking trials. Comparison between expected emotion, which is the annotated emotions of the videos (negative, positive, neutral), and the reported emotion, which is the emotion reported by the subjects after finished walking (happy, sad, neither), is shown in **Table 2** and **Figure 11**.

As we also performed another data collection in addition to the first dataset, our second dataset contains 23 subjects including 10 men and 13 women. The average age is 19.91 years. The standard deviation of age is 3.04 years. The average height is 164.93 centimeters, and the standard deviation of height is 9.58 centimeters. The average weight is 57.32 kilograms, with 11.32 kilograms standard deviation. In total, this dataset consists of 69 walking trials. The order of emotional videos shown to each subject was also swapped as same as the first dataset. In this dataset, there are 12 subjects who watched negative movie before positive movie (neutral – > negative – > positive), and 11 subjects who watched positive movie before negative movie (neutral – > positive – > negative). Reported emotion that the subjects perceived compared with expected emotion is listed in **Table 3** and **Figure 12**. In this dataset, we also collected the dominant hand, dominant foot, and dominant brain side. The results of these questionnaire are as follows.

• **Dominant Hand:** 10 left-handed subjects, 8 right-handed subjects, and 5 both handed subjects


#### **Table 2.**

*Comparison of expected emotion and reported emotion (first dataset).*

**Figure 11.** *Plots of expected emotion and reported emotion (first dataset).*


**Table 3.**

*Comparison of expected emotion and reported emotion (second dataset).*

**Figure 12.** *Plots of expected emotion and reported emotion (first dataset).*


Dominant hand, dominant foot, and dominant brain side data will be useful in the future when this dataset is used for emotion recognition and analysis of body movements is performed.

According to **Table 2** and **Figure 11** that show the comparison of expected emotion and reported emotion for the first dataset, we found that not all subjects feel the emotions we want them to feel. That is, positive video cannot make everyone feel happy, and negative video cannot make everyone feel sad. From positive video, the number of subjects who feel sad is almost twice from the subjects who feel happy; that is, 12 subjects feel happy while 23 subjects feel sad. For negative video, more subjects feel sad than happy; that is, 19 subjects feel sad while 13 subjects feel happy. For neutral video, the results are quite random since we intended to make it not inducing any emotion. Therefore, most subjects feel neither for neutral video. Other reported emotions of neutral video including happy and sad can occur because of other causes. For example, if the subjects have never used HoloLens 2 before, walking while watching a video on HoloLens 2 can make them feel happy, sometimes, if the subjects

#### *Methods for Real-time Emotional Gait Data Collection Induced by Smart Glasses in a Non… DOI: http://dx.doi.org/10.5772/intechopen.107410*

feel uncomfortable while walking and watching a video on HoloLens 2 at the same time, it is possible that they will feel sad after watching neutral video.

For the second dataset, **Table 3** and **Figure 12** show that the reported emotions for positive movie are 10 subjects feel happy and 7 subjects feel sad, which are not so different. These results reveal that the emotion induction for positive movie is not so effective. In the first dataset, a lot more subjects feel sad after watching positive movie, which is opposite to the emotion we want them to feel. In the second dataset, more subjects feel happy than feel sad after watching positive movie. But the numbers are not much different. This still means that emotion induction for positive video as stimuli is not effective enough even though the number of happy subjects is higher than sad subjects unlike the first dataset. Next, when we consider the negative movie, in the first dataset, 13 subjects feel happy and 19 subjects feel sad. In comparison with the second dataset, no one feels happy but 19 subjects feel sad. The results show that emotion induction for negative video in the second dataset is more effective than the first dataset although the stimulus we used for negative video is the same movie. One possible reason is that the subjects in the second dataset are more sensitive to negative movie than the subjects in the first dataset. Moreover, neutral video for both dataset results in random reported emotion for both datasets. This is the good outcome showing that the neutral video did not induce any emotion as we expect this video to be.

From all results of comparison between expected and reported emotions, we can see that the reported emotions are not similar to the expected emotions, which are the annotated emotions of the video stimuli. There are several possible causes; for instance, some subjects might be more sensitive to feel sad when they saw some stories. That is, some subjects can feel very sad, whereas other subjects can feel little sad, not sad and not happy (neutral), or feel happy when they saw the negative movie. This phenomenon is normal since different people have different emotion perception. This explanation is also valid for positive movie. Although the annotated emotion of this movie is positive, some subjects can feel happy while other subjects feel sad. Another possible reason is that, sometimes, the subjects cannot fully understand the contents of the movies because they watched the movies and walked at the same time. Therefore, subjects need to concentrate on walking in addition to watching movies. This makes some subjects cannot completely understand the content of the movies, so the reported emotion is opposite from the desired emotion we want them to feel. Other possible explanation is that some components or stories in the positive movie can make some subjects feel sad. For example, some music or video scenes might be very intense and some subjects are sensitive to these contents. This explanation is still related to the previous one. Additionally, individual preferences are also importance issues that should be considered. For example, if some subjects do not like sci-fi movie that we used as the positive stimuli, that movie can make them feel sad because this movie is the kind they do not like. Music soundtrack of the movies is also the reason that makes the perceived emotions different from the emotion we expected. If the subjects like the music, they can feel happy even the movie is negative movie, and if the subjects do not like the music, they can feel sad even the movie is positive movie. Lastly, if the subjects did not feel well when they watch the movie during walking, for example, some subjects feel motion sickness, or some subjects feel bored, the perceived emotions will be inaccurate and different from the emotions we expected. Because of this reason, we asked the subjects whether they can walk naturally or not after they finished walking.

In the first dataset, we did not have this information. But for the second dataset, seven subjects answered that they can walk naturally, eight subjects answered that they cannot walk naturally, and eight subjects answered they are unsure. If we

consider their explanation, the subjects who answered they can walk naturally have positive feedbacks. The followings are some examples.

• 映像に集中していたから

Because I was concentrating on the video

• 歩きにくさを感じなかったから

Because I did not find it difficult to walk

• 飽きなかったから!

I did not get tired of it!


Unfortunately, the subjects who answered that cannot walk naturally have negative feedbacks with walking while watching videos on HoloLens 2. These are some examples.

• 音や映像に気を取られたから

Because I was distracted by the sound and image

• 途中でフラついたり、まっすぐ歩けなかったりしたからです。

Because I was fluttering on the way and I could not walk straight

• 映像に集中していて、たまに枠線を超えそうになったから。

Because I was concentrating on the video and sometimes I almost crossed the border.

• 歩く範囲が小さいため

Because the walking range is small

Even some subjects answered that they feel unsure, some feedbacks are also negative.

• 時々眠たかったから

Sometimes I wanted to sleep

• 映像を見ながら歩くのが少し難しかった

It was a little difficult to walk while watching the video

• 枠外には出なかったが、円状を単調に歩いていたので、目がくらんで、不自然 に歩いていたかもしれないから。

I did not go out of the frame, but I was walking monotonously in a circle, so I might have been dazzled and walked unnaturally.

There are some positive feedbacks from unsure subjects. These answers show that they are walk unconsciously so we can collect the very natural walking styles.

• 自分がどう歩いていたか気にしていなかった

I did not care how I was walking

• 何も考えていなかったため、自然だったかはわからないです

I did not think about anything, so I do not know if it was natural

From these answers, we can see that some subjects have difficulties with watching videos on HoloLens 2 while walking at the same time. These are very reasonable explanations why emotion induction is not effective, and the reported emotions are much different from expected emotions. In the first dataset, we did not collect these data, hence, we cannot know that the subjects can walk naturally or not, and we cannot know how they feel after watching videos on HoloLens 2 while walking, but for the second dataset, we have these data, and they are very useful.

By using Microsoft HoloLens 2 for displaying emotional videos, there are several things we found and worth for consideration.

	- The subjects need to pay too much attention with the movie, so some subjects cannot walk naturally, some subjects have motion sickness, and some subjects feel bored because the movies are too long.
	- If subjects decide to give priority to walking more than watching movies, some subjects cannot fully understand the movie content.
	- Showing short movies instead of full movie or using some animation as mixed reality agents (VR/AR) should be better ideas to induce real-time emotion of the subjects.

#### **5. Conclusion**

To summarize, this study extends our previously proposed emotion induction and data collection method [34]. In conventional emotion induction, emotional videos are shown to the subjects before walking using a computer screen or television. In our method, emotional videos are shown on Microsoft HoloLens 2, which is the latest smart glasses technology. We found that displaying emotional videos on HoloLens 2 while walking can make the subjects express emotions on their gaits unconsciously while walking. Subjects can also see the room environment and the stimuli contents on HoloLens 2 at the same time. Some subjects think it is easy to walk while watching videos on HoloLens 2, while some subjects say that it is difficult to focus on waking and paying attention with video contents on HoloLens 2 at the same time. Our goal of this study is to simulate a real-time emotion while walking, however using full-length movies may not be a good idea because there are some negative feedbacks from our participants. For the walking direction, using a non-straight walking path will make the data collection become more realistic since capturing human gaits in reality is difficult to capture only straight and clean walking data. Therefore, if the emotion recognition system was developed and tested using non-straight walking gait data, opportunities to deploy this system in real-world scenario are increased. Additionally, expected emotions, which are the annotated emotion of the stimuli, should not be used to tag a walking trial since the stimuli emotions can be opposite to the emotions that the subjects perceived. Asking the subjects to report their actual feelings after walking is the best way we can do for now. In this study, OptiTrack motion capturing system was used for capturing gait data, but it is not mandatory to use marker-based systems such as OptiTrack or Vicon. Using marker-less motion capturing device, for example, Microsoft Kinect is also possible. Even a standard video camera or mobile phone camera with pose-estimation software such as OpenPose can also be used to capture body movement data for emotion recognition by gait analysis. In summary, this study investigates the possibility of performing emotion recognition and analysis by using smart glasses to induce emotions of subjects. The results show that emotion recognition from human gaits can be performed and is very useful in many circumstances. Since emotion recognition is an example of tangible applications of intelligence video surveillance, methods for inducing human emotions should be also considered. To develop an effective emotion recognition system as a part of intelligence video surveillance, obtaining a highquality dataset is an important factor that should be focused on.

#### **6. Acknowledgment**

The author would like to thank all participants who joined our both experiments. Additionally, we appreciated the help and support from all members of Kaoru Sumi Laboratory at Future University Hakodate, who supported and assisted in both experiments such as experiment venue setup, equipment setup, experimental design, translation of all documents, and interpretation between Japanese and English during the entire experiments processes.

#### **Funding statement**

This work was supported by JST Moonshot R&D Grant Number JPMJMS2011.

*Methods for Real-time Emotional Gait Data Collection Induced by Smart Glasses in a Non… DOI: http://dx.doi.org/10.5772/intechopen.107410*

### **Author details**

Nitchan Jianwattanapaisarn1,2, Kaoru Sumi1 \* and Akira Utsumi<sup>2</sup>

1 School of Systems Information Science, Future University Hakodate, Hokkaido, Japan

2 Interaction Science Laboratories, Advanced Telecommunications Research Institute International, Kyoto, Japan

\*Address all correspondence to: kaoru.sumi@acm.org

© 2022 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

### **References**

[1] Picard RW. Affective Computing. MIT press; 2000

[2] Tiam-Lee TJ and Sumi K. Analysis and prediction of student emotions while doing programming exercises. In: International Conference on Intelligent Tutoring Systems. Springer. 2019. pp. 24–33

[3] Bouchrika I. A survey of using biometrics for smart visual surveillance: Gait recognition. In Surveillance in Action. Cham: Springer; 2018. pp. 3-23. DOI: 10.1007/978-3-319-68533-5\_1

[4] Anderez DO, Kanjo E, Amnwar A, Johnson S, Lucy D. The rise of technology in crime prevention: Opportunities, challenges and practitioners perspectives. 2021. arXiv preprint arXiv:2102.04204

[5] Montepare JM, Goldstein SB, Clausen A. The identification of emotions from gait information. Journal of Nonverbal Behavior. 1987;**11**(1):33-42

[6] Khamsemanan N, Nattee C, Jianwattanapaisarn N. Human identification from freestyle walks using posture-based gait feature. IEEE Transactions on Information Forensics and Security. 2017;**13**(1):119-128

[7] Limcharoen P, Khamsemanan N, Nattee C. View-independent gait recognition using joint replacement coordinates (jrcs) and convolutional neural network. IEEE Transactions on Information Forensics and Security. 2020;**15**:3430-3442

[8] Limcharoen P, Khamsemanan N, Nattee C. Gait recognition and reidentification based on regional lstm for 2-second walks. IEEE Access. 2021;**9**: 112057-112068

[9] Kitchat K, Khamsemanan N, Nattee C. Gender classification from gait

silhouette using observation angle-based geis. In: 2019 IEEE International Conference on Cybernetics and Intelligent Systems (CIS) and IEEE Conference on Robotics, Automation and Mechatronics (RAM). IEEE. 2019. pp. 485–490

[10] Isaac ER, Elias S, Rajagopalan S, Easwarakumar K. Multiview gait-based gender classification through pose-based voting. Pattern Recognition Letters. 2019;**126**:41-50

[11] Janssen D, Schöllhorn WI, Lubienetzki J, Fölling K, Kokenge H, Davids K. Recognition of emotions in gait patterns by means of artificial neural nets. Journal of Nonverbal Behavior. 2008;**32**(2):79-92

[12] Roether CL, Omlor L, Christensen A, Giese MA. Critical features for the perception of emotion from gait. Journal of Vision. 2009;**9**(6):15-15

[13] Karg M, Kühnlenz K, Buss M. Recognition of affect based on gait patterns. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics). 2010;**40**(4):1050-1061

[14] Barliya A, Omlor L, Giese MA, Berthoz A, Flash T. Expression of emotion in the kinematics of locomotion. Experimental Brain Research. 2013;**225** (2):159-176

[15] Venture G, Kadone H, Zhang T, Grèzes J, Berthoz A, Hicheur H. Recognizing emotions conveyed by human gait. International Journal of Social Robotics. 2014;**6**(4):621-632

[16] Li B, Zhu C, Li S, Zhu T. Identifying emotions from non-contact gaits information based on microsoft kinects.

*Methods for Real-time Emotional Gait Data Collection Induced by Smart Glasses in a Non… DOI: http://dx.doi.org/10.5772/intechopen.107410*

IEEE Transactions on Affective Computing. 2016;**9**(4):585-591

[17] Li S, Cui L, Zhu C, Li B, Zhao N, Zhu T. Emotion recognition using kinect motion capture data of human gaits. PeerJ. 2016;**4**:e2364

[18] Zhang Z, Song Y, Cui L, Liu X, Zhu T. Emotion recognition based on customized smart bracelet with built-in accelerometer. PeerJ. 2016;**4**:e2258

[19] Chiu M, Shu J, Hui P. Emotion recognition through gait on mobile devices. In: 2018 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops). IEEE. 2018. pp. 800–805

[20] Quiroz JC, Geangu E, Yong MH. Emotion recognition using smart watch sensor data: Mixed-design study. JMIR Mental Health. 2018;**5**(3):e10153

[21] Xu S, Fang J, Hu X, Ngai E, Guo Y, Leung V, et al. Emotion recognition from gait analyses: Current research and future directions. arXiv preprint arXiv: 2003.11461. 2020.

[22] Lemke MR,Wendorff T, Mieth B, Buhl K, Linnemann M. Spatiotemporal gait patterns during over ground locomotion in major depression compared with healthy controls. Journal of Psychiatric Research. 2000;**34**(4–5):277-283

[23] Michalak J, Troje NF, Fischer J, Vollmar P, Heidenreich T, Schulte D. Embodiment of sadness and depression gait patterns associated with dysphoric mood. Psychosomatic Medicine. 2009; **71**(5):580-587

[24] Gross MM, Crane EA, Fredrickson BL. Effort-shape and kinematic assessment of bodily expression of emotion during gait. Human Movement Science. 2012;**31**(1):202-221

[25] Destephe M, Maruyama T, Zecca M, Hashimoto K, Takanishi A. The influences of emotional intensity for happiness and sadness on walking. In: 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE. 2013. pp. 7452-7455

[26] Sun B, Zhang Z, Liu X, Hu B, Zhu T. Self-esteem recognition based on gait pattern using kinect. Gait & Posture. 2017;**58**:428-432

[27] Halovic S, Kroos C. Not all is noticed: Kinematic cues of emotion-specific gait. Human Movement Science. 2018;**57**: 478-488

[28] Sadeghi H, Allard P, Duhaime M. Functional gait asymmetry in ablebodied subjects. Human Movement Science. 1997;**16**(2-3):243-258

[29] Kang GE, Gross MM. Emotional influences on sit-to-walk in healthy young adults. Human Movement Science. 2015;**40**:341-351

[30] Kang GE, Gross MM. The effect of emotion on movement smoothness during gait in healthy young adults. Journal of Biomechanics. 2016;**49**(16): 4022-4027

[31] Kim S, Nussbaum MA, Ulman S. Impacts of using a head-worn display on gait performance during level walking and obstacle crossing. Journal of Electromyography and Kinesiology. 2018;**39**:142-148

[32] Sedighi A, Ulman SM, Nussbaum MA. Information presentation through a head-worn display ("smart glasses") has a smaller influence on the temporal structure of gait variability during dualtask gait compared to handheld displays (paper-based system and smartphone). PLoS One. 2018;**13**(4):e0195106

[33] Sedighi A, Rashedi E, Nussbaum MA. A head-worn display ("smart glasses") has adverse impacts on the dynamics of lateral position control during gait. Gait & Posture. 2020;**81**: 126-130

[34] Jianwattanapaisarn N, Sumi K. Investigation of real-time emotional data collection of human gaits using smart glasses. Journal of Robotics, Networking and Artificial Life. 2022;**9**(2):159-170. DOI: 10.57417/jrnal.9.2\_159

[35] Baveye Y, Dellandréa E, Chamaret C, Chen L. Deep learning vs. kernel methods: Performance for emotion prediction in videos. In: 2015 International Conference on Affective Computing and Intelligent Interaction (acii). IEEE. 2015. pp. 77–83

#### **Chapter 9**

## Combining Supervisory Control and Data Acquisition (SCADA) with Artificial Intelligence (AI) as a Video Management System

*Muhammad H. El-Saba*

#### **Abstract**

The latest Video management systems (VMS) software relies on CCTV surveillance systems that can monitor a larger number of cameras and sites more efficiently. In this paper, we study the utilization of SCADA to control a network of surveillance IP cameras. Therefore, the video data are acquired from IP cameras, stored and processed, and then transmitted and remotely controlled via SCADA. Such SCADA application will be very useful in VMS in general and in large integrated security networks in particular. In fact, modern VMS are progressively doped with artificial intelligence (AI) and machine learning (ML) algorithms, to improve their performance and detestability in a wide range of control and security applications. In this chapter, we have discussed the utilization of existing SCADA cores, to implement highly efficient VMS systems, with minimum development time. We have shown that such SCADA-based VMS programs can easily incubate AI and deep ML algorithms. We have also shown that the harmonic utilization of neural networks algorithms (NNA) in the software core will lead to an unprecedented performance in terms of motion detection speed and other smart analytics as well as system availability.

**Keywords:** SCADA, distributed control systems (DCS), security, artificial intelligence (AI), face recognition, machine learning, video surveillance video analytics

#### **1. Introduction**

The Supervisory Control and Data Acquisition (SCADA) is a software overlay application, which is used on top of intelligent control networks. The control nodes are traditionally smart sensors and programmable logic controllers (PLCs) [1–3]. In fact, the SCADA industry started due to the need for a user-friendly front-end to control systems containing smart devices, devices, and PLCs. However, SCADA systems evolved rapidly and are now penetrating the reliable operation of modern infrastructures [2, 3] of smart cities. SCADA systems have made substantial progress over the recent years to increase their functionality, scalability, performance, and openness. As shown in **Figure 1**, the main components of a SCADA system are as follows:


Indeed, it is possible to purchase a SCADA system from a single supplier or tailor a SCADA system from different manufacturers, such as Siemens and Allen-Bradley PLCs. This chapter presents a method based on neural networks (NN) for monitoring and operating video surveillance systems (VMS), like those in traffic control networks and electronic plaza sites. The method suggests that the thresholds used for generating alarms can be adapted to each surveillance device (e.g., IP Camera). The intelligent SCADA method was actually utilized in other application fields, for example, in electrical power control and renewable energy systems [3].

In this chapter, we show how to exploit such existing SCADA programs to implement a wide-area video management systems (VMS), which incorporate state-ofthe-art AI technologies, such as access control, intrusion detection, face recognition, license plate recognition, crowd detection, and city surveillance. These technologies have been implemented in our emerging VMS, Xanado [4], which is expected to have a unique value in identifying criminals and terrorists, patrolling highways, and in aiding forensics.

Such SCADA solutions are multi-tasking and are based upon a real-time database that is located on dedicated servers of the system. Such SCADA servers are responsible for data acquisition and handling (e.g., data polling, alarm checking, logging verification, and data archiving) on the basis of a set of chosen parameters.

#### **Figure 1.**

*Conventional Architecture of a SCADA system, The Master Station refers to the servers and software responsible for communicating with Remote Terminal Units (RTUs), such as PLCs.*

*Combining Supervisory Control and Data Acquisition (SCADA) with Artificial Intelligence… DOI: http://dx.doi.org/10.5772/intechopen.104766*

#### **2. SCADA-based video surveillance solutions**

One of the main advantages of SCADA systems is that it allows operators to visualize, in real time, what is happening in any particular industrial process, react to alarms, control processes, change configurations, and track information in real time. However, SCADA systems differ from distributed control systems (DCSs) that are generally found in industrial plant sites. While a DCS covers a plant site, a SCADA system covers much larger areas. Similarly, wide-area video monitoring requires large-scale monitoring systems, like those of SCADA networks. For operations that span several sites, it is important to have a central monitoring station that acts as eyes and ears across all sites. Central monitoring stations use diverse types of cameras and sets of technology to monitor and protect people and property, especially when personnel cannot be on site. It is important that these technologies work together to create a holistic monitoring system. Fortunately, SCADA architecture supports TCP/IP (Internet Protocol), UDP, or other IP-based communications protocols, which makes it ideal for video surveillance control with a network of IP cameras. In fact, SCADA systems have traditionally used combinations of direct serial buses or Ethernet or Wi-Fi connections to meet communication requirements, as well as IP over SONET (Synchronous Optical Network) at large sites. **Figure 2** depicts the architecture of SCADA-based VMS programs, which are employing neural network algorithms (NNA). The employed NNA aims to improve the control of the system by using an iterative supervised process. The objective is to determine and optimize the SCADA-VMS control parameters, for specific sites with specific surveillance devices. The chosen parameters, such as the favorite angles of PTZ cameras, the detection speed (of motion anomalies), and false arm causes, will help to increase the surveillance performance and system availability. In addition, the optimized system will minimize false alarms in a continuous adaptive manner, according to each site-specific equipment. Actually, the NNA is based on finding differences in the behavior of the surveillance system over time. The iterative process starts with a database of the stored SCADA-VMS database, as shown in **Figure 2**.

The SCADA-based VMS programs, with NNA, can help in this context and can easily incorporate the following features and intelligent analytics:


#### **2.1 Object detection and tracking**

The process of identifying objects in an image and finding their position is known as object detection. **Figure 3** depicts the object detection and identification tasks. This activity benefitted a lot from the field of computer vision assisted by AI. As shown, the trained model using deep learning must be evaluated for its performance on some data called as test dataset' [5–9].

#### **2.2 Face recognition**

Face identifiers offer advantages in access control, safety, security, retail stores, and traffic control. Facie recognition is actually an analytic program that identifies persons from their facial features in an image or video surveillance.

Face recognition programs are usually utilizing AI to quickly identify and interpret complex figures. When comparing a face image to a database of previously stored images of known faces, the AI algorithm can then determine the best match. In fact, the face recognition and analysis algorithms have enabled security systems to capture many wanted criminals and stop many crimes.

As facial recognition increases in efficacy, the number of other applications will also increase (e.g., in banking, retail stores, and means of transportation). Thanks to deep learning-based AI algorithms, face analysis is not only able to identify with high accuracy, but also provides extraordinary analytic capabilities. For instance, it can now detect the criminal behavior, from his/her mood and gests. Thanks to 3D/4D digital signal processing (DSP), with AI, face recognition technology will be expanding in identifying terrorists and criminals' actions, by incorporating motion detection algorithms.

#### **2.3 Automatic number plate recognition (ANPR)**

The ANPR is analytic software that reads vehicle plates and automatically matches them to recognized vehicle license plates, without the need for human

**Figure 3.**

*Block diagram of object detection and identification y artificial intelligence.*

*Combining Supervisory Control and Data Acquisition (SCADA) with Artificial Intelligence… DOI: http://dx.doi.org/10.5772/intechopen.104766*

intervention. Therefore, ANPR offers accurate identification and safety of vehicle access and traffic control. ANBR is usually implemented using optical character recognition (OCR) and convolutional neural networks (CNN). Actually, CNN is a widely used neural network architecture for computer vision tasks. The CNN automatically extracts important features on images. More details about CNN are provided below in Section 4.

#### **3. Artificial intelligence in video surveillance and video analytics**

Artificial intelligence (AI) is nowadays revolutionizing video management systems (VMS) and the ways of securing smart premises and smart cities by video surveillance and control. In fact, AI-based technology can promote security and surveillance equipment by enhancing object detection and motion interpretation as well as providing analytics with increasingly reliable data. For instance, the reduction of false alarms in security systems is one of the major benefits of AI-based tools and algorithms. For instance, AI-CCTV cameras are networked IP cameras that deliver advanced analytical functions, such as face recognition, vehicle classification, car counting, license plate recognition (LPR), and other traffic analytics. Advanced video analytics software is built into the camera and recorder, which then enables artificial intelligence functions. Some AI algorithms are rule-based and others are self-learning. Like typical CCTV cameras, AI-CCTV stores information so any incidents can be reviewed. However, AI CCTV can detect and send alerts in real time. In particular, SCADA legacy can help a lot in large-sites and wide-area VMS systems.

The artificial intelligence (AI) tools are heavily dependent on neural networks (NN) and computer vision [5]. As shown in **Figure 4**, a neural network (NN) is a system of software or hardware, which mimics the operation of human brain neurons. Therefore, an NN is simply a group of interconnected layers of a perceptron. Note that an NN has multiple hidden layers, and each layer has multiple nodes. The neural network takes the training data in the input layer and forwards it through hidden layers, on the basis of specific weights at each node [10]. Therefore, it returns an output value to the output layer. The inputs to nodes in a single layer have adaptable weights that affect the final output prediction.

There are a lot of different kinds of neural networks that are used in machine learning projects. There are recurrent neural networks, feed-forward neural networks, and convolutional neural networks (CNNs). It can take some time to properly tune a neural network to get consistent, reliable results. Testing and training your NN is important before deciding which parameters (input features of a face image) are important in your recognition model.

**Figure 4.** *Schematic diagram of a neural network (NN) and how it works.*

**Figure 5.**

*Schematic diagram of the layers of a convolutional neural network (CNN) showing its classification sequence of a handwritten character (in ASCII).*

#### **4. Deep learning-based video surveillance solutions**

In deep learning, a convolutional neural network (CNN) is a sort of NNs; which is particularly useful in video surveillance projects. The advantage of deep learning- (DL)-based algorithms with respect to legacy computer vision algorithms is that DL systems can be continuously trained and improved with updated datasets.

In DL, a convolutional neural network (CNN) is a type of NN, commonly used in image recognition and processing, with emphasis on machine vision of images and video. As shown in **Figure 5**, the layers of a CNN consist of an input layer, an output layer, and a hidden layer that includes multiple convolutional layers, pooling layers, fully connected layers, and normalization layers.

Deep learning systems have shown a remarkable ability to detect undefined or unexpected events. This feature has the true potential of significantly reducing false alarm events that happen in many security video analytics systems. Many applications have shown that deep learning systems can "learn" to achieve 99.9% accuracy for certain tasks, in contrast to rigid computer algorithms where it is very difficult to improve a system past 95% accuracy [4].

#### **5. Xanado program**

Xanado is a video management system (VMS) software, designed for large-scale and high-security installations. It is built as a client-server DCS to ensure end-to-end protection of video integrity and boost the overall performance of existing hardware. In addition to central management of all data servers, IP cameras, and users in a multi-site setup, Xanado includes an integrated video wall size for operators demanding overall awareness of any event. The software supports failover recording servers making it the perfect choice for mission-critical installations that require continued access to video recording in case of a server failure. Xanado is ideal for installations with 24/7 operation requirements and can run on high-speed recording engines (NVR), making it suitable for monitoring airports, banks, traffic control, as well as smart city surveillance.

The general system architecture is shown in **Figure 6**. As shown in Figure, the management data server lies in the center of the VMS. It holds the main application and handles the system configuration. Note that the recording server is responsible *Combining Supervisory Control and Data Acquisition (SCADA) with Artificial Intelligence… DOI: http://dx.doi.org/10.5772/intechopen.104766*

**Figure 6.** *Xanado general system architecture.*

for all communication, recording, and event handling related to devices, such as cameras and I/O modules. The system stores the video in a customized database. The management server, event server, and log server use an SQL server to store configuration, alarms, and log events. As the VMS is designed for a large-scale operation, the Management Client may run locally or remotely, for centralized administration.

The smart client nodes (SCN) are working as follows. SCN connects to the management server and attempts to log in. The management server tries to authenticate the user and the user-specific configuration is retrieved from the SQL database. Therefore, the login is granted and the configuration is sent to the SCN. Live video streams are then retrieved from cameras by the recording server. The recording server sends a multicast stream to the multicast-enabled network. This requires that all switches handling the data traffic between the SCN and the recording server must be configured for multicast.

We adopt the ONVIF standard [11] for full video interoperability in multi-vendor installations to ensure information exchange by a common protocol. The ONVIF protocol profiles are collections of specifications for interoperability between ONVIF compliant devices, such as cameras and NVR.

Of course, Xanado VMS has many add-on modules and can be tailored to several specific applications. In the following subsection, we describe an application of our VMS to the case of Electronic Toll Plaza control, which is utilized nationwide in highway traffic control.

#### **6. Case study: toll plaza control**

Toll plazas are utilized on traffic highways to collect fees from passing vehicles. The toll plaza consists of six zones—an approach zone, queue area, toll lanes, the toll island, departure zone, a bailout lane, and, in some cases, a terminal supervisor. The so-called Electronic Toll Collection (ETC) enables toll collection without delay or total stop of vehicles. This section deals with the equipment and software to be

**Figure 7.** *Schematic of a booth in an electronic toll office, which communicates with a passing vehicle.*

installed on the roadside of toll plaza networks. We shortly depict the basic concepts for installing the system of electronic toll collection. **Figure 7** is a schematic of a single booth and a lane of an electronic toll plaza.

#### **6.1 Main specifications of a smart toll-plaza control system**

A smart toll collection system has the following four components. The first three components are usually installed at the toll booths. The later backend is installed in the control room and is usually connected with manages the complete toll collection process [12].


#### **6.2 Toll plaza components equipment**

The toll system comprises of Lane System and Plaza System, integrated into an architecture that facilitates easy and accurate toll collection. The following **Figures 8** and **9** depict the lane and plaza equipment. The OHLS (Over Head Lane Signal) is often required. The so-called AVCC (auto vehicle classification system ) is needed to determine the different fares of different vehicles if lanes have no signage with the type of passing vehicle. AVCC Systems may be Treadle systems that use a combination of vehicle magnetic loop, height sensors, and piezo sensors. Alternatively, AVCC may be IR-based.

*Combining Supervisory Control and Data Acquisition (SCADA) with Artificial Intelligence… DOI: http://dx.doi.org/10.5772/intechopen.104766*

**Figure 8.** *Schematic of toll-plaza lane and booth equipment.*

#### **Figure 9.**

*Schematic of plaza equipment.*

Also, the PBX telephone, which may be required inside each booth, is not shown. The WIM (Weigh-In-Motion) platform, which senses and records the vehicle weight, is dedicated to truck traffic control. Additional cameras should be installed inside each booth. In fact, a major challenge faced by any concessionaire in operating toll roads is the prevention of revenue leakage that goes very high like 20% of the daily collection in the situation of un-monitored systems.

#### *6.2.1 Booth & Lane equipment list*

a-Booth Equipment: 1. Toll Lane Controller; 2. Operator Terminal Screen (POS); 3. Receipt Printer; 4. Barcode Reader; 5. Barrier Controller; 6. Alarm; 7. Document Viewer; 8. WIM Indicator.

b-Lane Equipment: 9. OHLS; 10. Fare Display; 11. Incident Capture Outdoor Camera; 12. Barrier Gate; 13. Traffic Light; 14. AVCC; 15. Vehicle Entry Loop; 16. Vehicle Exit Loop; 17. Outdoor ANPR Camera; 18. Smart Card or RFID Reader; 19. WIM Platform.

#### *6.2.2 Plaza equipments*

The toll Plaza equipment consists of:

1. Database Server; 2. CCTV Display Screen; 3. POS Workstation; 4. CCTV (extra security) Cameras; 5. Report Printer; 6. IP Phone; 7. Network Switches; 8. IP phone Master (PABX) Unit; 9. UPS Power Supply.

#### **6.3 Plaza network installation procedure**

The first step, before the installation of any security system, is to ensure the presence of detailed drawings and documentation, a bill of materials with their specification. The installation of the ETC system uses many devices like vehicle-mounted electronic tags, toll point of sale(POS), RFID readers, and switches.

#### **Figure 10.**

*Connection scenario #1 of a toll plaza using a ring fiber between booths and plaza control room.*

**Figure 11.** *Illustrating example showing the internet.*

*Combining Supervisory Control and Data Acquisition (SCADA) with Artificial Intelligence… DOI: http://dx.doi.org/10.5772/intechopen.104766*

#### **Figure 12.**

*Example of a VLAN configuration of a toll plaza network.*

Actually, there exist several scenarios to connect the plaza network using Ethernet copper cables and/or optical fibers. The network topology may have several choices. For instance, you can use a bus topology, rings, μ-rings, or a hybrid bus/ring topology. One of these scenarios is depicted below in **Figure 10**.

If we have a sufficiently long multi-core optical fiber cable, you can also install a large ring between all switches in the main booth and the control room.

Note that the design of the toll plaza data network should be secured and isolated from direct exposure to other internet users. In particular, the sensitive video signal (from IP cameras) should be routed indirectly to the internet, via NVR, followed by a separate NIC card to the plaza server. The plaza server can be then connected to the internet by another NIC card. This is illustrated in the following **Figure 11**. In all cases, a virtual private network (VPN) should be installed before routing the plaza signals, to any external WAN, such as the internet.

In all cases, the data switches are dividing the core network into small subnets; called VLANs. For instance, by sorting node devices by functions or positions (camera, POS, etc., VLANs are often associated with IP subnets. Hence, networks with different VLANs will not be visible to each other. On the physical layer, the network remains the same as shown in **Figure 12**.

#### **6.4 IP planning**

The only way for someone to access the CCTV system is to know the IP address, username, and password (**Table 1**).

#### **6.5 Port forwarding and accessing the internet**

To remotely view a security CCTV system, you had to allow it to communicate to the internet. To allow access to your system from the internet, you have to configure the firewall inside the router to flow through the NVR. This process is called port forwarding and requires some advanced knowledge. There exist many guides on port forwarding if you feel confident configuring remote viewing. Each router will have its own method for port forwarding and I recommend checking PortForward.com for your specific model.


#### **Table 1.**

*Example of IP planning of the overall data network.*

#### **Figure 13.**

*Modular architecture of Xanado VMS, with application to traffic surveillance and traffic control at a toll plaza network.*

*Combining Supervisory Control and Data Acquisition (SCADA) with Artificial Intelligence… DOI: http://dx.doi.org/10.5772/intechopen.104766*

#### **6.6 Remote viewing using a smartphone**

There exist Android and iPhone network operating systems (NoS) Apps that work with CCTV systems, such as IPTecno Pegaso [12]. This can connect to the system and display live video feeds. It also allows for one-way and two-way audio interaction, PTZ camera control, and motorized camera control. To learn how to connect your system, please follow any guide on how to view security cameras from iPhone or Android, such as this: https://www.cctvcameraworld.com/ how-to-view-security-cameras-from-phone/.

#### **6.7 Plaza database server and toll management system**

The Smart Toll Collection System can greatly expedite the time taken by each vehicle to pay the toll fees. There are existing solutions that are deployed and are practically well-suited. A distributed database architecture platform should be installed to enable operations of ticket issues and validation. In the control system, the failure of the LAN connectivity should not impact the lane operation.

#### **6.8 ETC management system**

The following **Figure 13** depicts the modular architecture of Xanado, with emphasis on traffic surveillance control at toll plazas. Such software provides comprehensive capabilities to manage toll collection operations (24/7). The Central Administration Module facilitates the entire operation of monitoring and collection across toll plazas as one centralized unit. The Plaza Module can configure, manage all plazas toll collection operations, report toll collection in an audited manner. The Lanes Module is running under the plaza module, lanes module facilitates accurate toll collection for the vehicle passing from the toll plaza.

#### **7. Conclusions**

Public concern over security in recent years has driven the demand for video surveillance. Security and video surveillance need continuous change with time, to cope with the new hardware capabilities of IP cameras and video storage equipment. In addition, the increasing need for video analytics is required in modern VMS software, to perform their job of monitoring activities and protecting humans and their properties. Governments and police departments worldwide are constantly looking for new CCTV surveillance features that will help prevent crime. The latest VMS software relies on CCTV surveillance systems that can monitor a larger number of cameras and sites more efficiently. Therefore, combining SCADA features with VMS is significant.

This chapter presents a method based on neural networks (NN) for monitoring and operating video surveillance systems (VMS), like those in traffic control networks of electronic plaza sites. The method suggests that the thresholds used for generating alarms can be adapted to each surveillance device (e.g., IP Camera).

The industry needs to do more research on hybrid systems that combine the best of SCADA and AI algorithms together with VMS software.

*Intelligent Video Surveillance - New Perspectives*

### **Author details**

Muhammad H. El-Saba Professor at the Dept of Electronics and Communication Engineering, Ain-Shams University, Engineering College, Cairo, Egypt

\*Address all correspondence to: mhs1308@gmail.com

© 2022 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

*Combining Supervisory Control and Data Acquisition (SCADA) with Artificial Intelligence… DOI: http://dx.doi.org/10.5772/intechopen.104766*

#### **References**

[1] M. H. El-Saba, Supervisory Control And Data Acquisition (SCADA). 2017. Measurement & Instrumentation Systems, Publisher: Hakim. Available from: https://www.researchgate.net/ publication/353142664\_Supervisory\_ Control\_And\_Data\_Acquisition\_SCADA

[2] Mariana Hentea. 2008. Improving Security for SCADA Control Systems. Available from: https://www. researchgate.net/publication/253290135

[3] Marugán AP, Márquez FG. SCADA and Artificial Neural Networks for Maintenance Management. In: International Conf. on Management Science and Engineering Management. Cham: Springer; 2017. pp. 912-919

[4] El-Saba MH. Xanado, Video Management System, to be published

[5] Whittaker D. Why AI CCTV is the Future of Security and Surveillance in Public Spaces. Security magazine, USA, securitymagazine.com, 2021

[6] Bharadwaj HS, Biswas S, Ramakrishnan KR. A large scale dataset for classification of vehicles in urban traffic scenes. In: Proceedings of the Tenth Indian Conference on Computer Vision, Graphics and Image (ICGIP). Published in Association for Computing Machinery (ACM), New York, NY, USA. 2016

[7] Mohana et al. Design and implementation of object detection, tracking, counting and classification algorithms using artificial intelligence for automated video surveillance applications. In: Advanced Computing and Communication Society (ACCS), 24th annual International Conference on Advanced Computing and Communications (ADCOM-2018). Bangalore: IIITB; 2018

[8] Kain Z et al. Detecting abnormal events in university areas. In: 2018 International conference on Computer and Applications Beirut, Lebanon, IEEE. 2018

[9] Mohana, Aradhy HVR. Design object detection and tracking using deep learning and artificial intelligence for video surveillance applications. International Journal of Advanced Computer Science and Applications. 2019;**10**(12)

[10] Turaga SC, Murray JF, Jain V, Roth F, Helmstaedter M, Briggman K, et al. Convolutional networks can learn to generate affinity graphs for image segmentation. Neural Computation. 2010;**22**(2):511-538

[11] Balamurugan Rajagopal BGSA, Parasuram K. A robust framework to detect moving vehicles in different road conditions in India. Journal of Theoretical and Applied Information Technology. 2019;**96**(1):1-14

[12] Wang P, Li L, Jin Y,Wang G. Detection of unwanted traffic congestion based on existing surveillance system using in freeway via a CNN-architecture trafficnet, 13th IEEE Conference on Industrial Electronics and Applications (ICIEA), Wuhan IEEE. 2018. pp. 1134-1139

### *Edited by Pier Luigi Mazzeo*

The development of new technologies based on artificial intelligence and computer vision, together with the possibility of connecting different devices together in realtime, has enabled the development and progress of intelligent video surveillance. Thanks to IoT technologies, high-resolution cameras can be networked to monitor their territory, collect recordings and have them analyzed by artificial intelligence systems trained to identify critical situations. This book discusses new achievements in intelligent video surveillance solutions and their future prospects.

Published in London, UK © 2023 IntechOpen © scyther5 / iStock

Intelligent Video Surveillance - New Perspectives

Intelligent Video Surveillance

New Perspectives

*Edited by Pier Luigi Mazzeo*