Context-Driven Applications

#### **Chapter 1**

## Introductory Chapter: Intelligent Video Surveillance – What's Next?

*Pier Luigi Mazzeo*

#### **1. Introduction**

Most of the biggest urban areas are speedily translating into smart cities with structures able to handle huge quantity of data shaped by the Internet of Things apparatus for intelligent analysis. Decreasing human contribution and improving people's life quality are two main objectives that smart cities proposed to reach. A concrete research field and its applications that satisfy both of these principles is smart video surveillance [1].

Intelligent video surveillance systems aim to analyze the observed scenario using machine learning, computer vision, and data analytics in order to minimize or completely eradicate human contribution.

The request for such intelligent security systems qualified for recognizing both natural emergencies, such as fire, floods, earthquakes, and human-made emergencies, such as violence, traffic accidents, and weapon threats, is growing solidly [2].

Intelligent video surveillance systems are usually adopted in different contexts, spreading from public areas and infrastructures to commercial buildings. They are often used for a double scope: i) real-time monitoring of physical estates and areas and ii) for reviewing collected video information to estimate security indicators and plan safety measures, consequently.

In the last decades, intelligent video surveillance systems are deeply employed in the public and security sectors, but now a significant interest in these topics has quickly been raised by other stakeholders. This interest has been caused by the constant increase in crime rates and security national and international threats, which are conducting incredible growth in the market of video surveillance and security systems. A report redacted by Mordor Intelligence [3] estimates that the video surveillance market has been valued at 30 billion dollars in 2016, but is expected to reach a value of 72 billion dollars by the end of 2022. A boost to the market perspective is also given by the recent results obtained in artificial intelligence and digital technologies—introducing intelligence, scalability, and higher accuracy in video surveillance solutions. Some spontaneous questions arise—what are the main technology trends in smart video surveillance and how can they be best used?

### **2. Technology trends in intelligent video surveillance solutions**


### **3. Designing and developing of video surveillance systems**

All the technologies described above open new challenges and possibilities in the expansion, application, and function of new generation of intelligent video

#### *Introductory Chapter: Intelligent Video Surveillance – What's Next? DOI: http://dx.doi.org/10.5772/intechopen.109154*

surveillance systems. An important role is played by the developers and implementers of this intelligent surveillance systems, who should integrate and use full features of the mentioned cutting-edge technologies. To reach this objective, it is crucial to design and realize the right architecture for the video surveillance framework. Novel intelligent video surveillance solutions respect the **edge/fog**-**computing paradigm** [5] (see **Figure 1**) to elaborate video data sources earlier directly near the observed scene. Using this paradigm permits to save bandwidth performing real-time security supervising. Smart cameras are placed at the edge of the designed network and become edge nodes, where frames are grabbed and processed "in situ." This way, the intelligence is decentralized and these edge nodes can realize data collection intelligence and tuning frame rate, according to the recognized security context. Furthermore, they are linked to the cloud architecture, where information from multiple cameras is merged, assessed, and processed on higher time scales.

Choosing edge/fog-computing architectures [5] is the best choice for supporting the integration of past video surveillance systems with the actual technologies. IoT-driven drones will be combined with suitable edge nodes and they will be part of a mobile edge-computing infrastructure. It is strongly recommended that real-time processing of the acquired streaming flow should be computed at the edge, instead of in the cloud of the video surveillance architecture. Contrarily, deep learning computing can be performed both at the edge and in the cloud of the video surveillance infrastructure: If deep neural networks are placed at the edge, they can extract complex feature patterns in real time. However, the extraction of complex feature patterns and information over wider areas observed by several edge nodes (e.g., citylevel structure) should be done only if deep learning is implemented in the cloud.

In general, it is difficult to find the right balance among the functionality to place in the cloud or on the edge. The decisions are done by making a trade-off among some opposite features (e.g., processing speed versus obtained results accuracy for some surveillance tasks).

All the smart video surveillance solutions should take some advantages of open equipment from different cameras and device vendors. In fact, a surveillance system may contain different devices and video capture means (e.g., high-definition cameras,

**Figure 1.** *Edge/fog-computing paradigm.*

wired or wireless cameras, cameras in drones/UAVs, and so on). There are many benefit to have an open architecture because we can obtain flexibility, elasticity, and technological durability. In the last years, some energies have been spent to present an open, standards-based architecture for edge/fog computing in order to have video surveillance as the principal example of the use of fog computing.

#### **4. Discussion and future works in intelligent video surveillance solutions**

Choosing the right configuration when an edge-computing architecture is designed for implementing a video surveillance system that meets other kinds of challenges. These kinds of challenges include privacy preservation and data protection compliance. For example, producing surveillance devices is subjected to many privacy and data protection regulations and directives emanating from different countries. This often forces some restrictions in designing smart video surveillance systems. Stronger limitation are also applied in the use of drones that must respect some tighter regulations.

Another type of challenge interests the automation rate reached by the proposed solution. Considering that automation is commonly required for covering and monitoring wider spaces and saving further human workers, but human involvement in assessing is still necessary for the trustworthiness of the designed solution. Furthermore, it should be considered that nowadays new cyber-physical threats and attacks are arising against surveillance systems. Notice that a physical attack is often supported by a cyberattack on the video surveillance framework, which completely compromise the capacity to detect the physical assault that is happening.

The implementation of intelligence is data-driven (e.g., proactive threat prediction and AI analysis) needs large amounts of data that include examples of security threats that are very difficult to have. The study and design of artificial intelligence algorithms (e.g., lightweight and easy-to-use deep neural networks) is taking its first steps, although numerous innovative start-ups with cutting-edge AI products and services are already emerging.

Facing the many new challenges described above by developers and distributors of intelligent video surveillance solutions forces them to better comply with standards and regulations while adopting a phased approach to deployment. This gradual process should enable a transition from manual, that is, human-mediated, systems to fully automated video surveillance based on artificial intelligence.

Overcoming the challenges, we face requires a gradual implementation of datadriven intelligence. Starting with simple supervised training rules and moving on to more sophisticated machine learning techniques capable of detecting more complex asymmetric attack patterns. Another important outcome that could be achieved is the implementation of open architectures capable of accommodating innovative surveillance sensors by making them coexist with older ones, so as to exploit new advanced capabilities while obtaining the best value for money.

In conclusion, it can be said that all future smart video surveillance solutions may include many innovative features and functionalities, as they may employ new cutting-edge IT and artificial intelligence technologies.

*Introductory Chapter: Intelligent Video Surveillance – What's Next? DOI: http://dx.doi.org/10.5772/intechopen.109154*

### **Author details**

Pier Luigi Mazzeo National Research Council of Italy (CNR), Institute of Applied Sciences and Intelligent Systems (ISASI), Lecce, Italy

\*Address all correspondence to: pierluigi.mazzeo@cnr.it

© 2022 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

### **References**

[1] Porikli F, Brémond F, Dockstader SL, Ferryman J, Hoogs A, Lovell BCS, et al. Video surveillance: Past, present, and now the future dsp forum. IEEE Signal Processing Magazine. 2013;**30**(3):190-198

[2] Xu Z, Hu C, Mei L. Video structured description technology based intelligence analysis of surveillance videos for public security applications. Multimedia Tools and Applications. 2016;**75**(19):12155-12172

[3] Available from: https://www. marketsandmarkets.com/Market-Reports/video-surveillance-market-645. html

[4] Available from: https:// www.deepmind.com/research/ highlighted-research/alphago

[5] Raj P, Saini K, Surianarayanan C, editors. Edge/Fog Computing Paradigm: The Concept, Platforms and Applications. Vol. 127. Advances in Computers Series; 2022. ISBN: 9780128245064

**Chapter 2**

## Change Point Detection-Based Video Analysis

*Ashwin Yadav, Kamal Jain, Akshay Pandey, Joydeep Majumdar and Rohit Sahay*

#### **Abstract**

Surveillance cameras and sensors generate a large amount of data wherein there is scope for intelligent analysis of the video feed being received. The area is well researched but there are various challenges due to camera movement, jitter and noise. Change detection-based analysis of images is a fundamental step in the processing of the video feed, the challenge being determination of the exact point of change, enabling reduction in the time and effort in overall processing. It is a well-researched area; however, methodologies determining the exact point of change have not been explored fully. This area forms the focus of our current work. Most of the work till date in the area lies within the domain of applied methods to a pair or sequence of images. Our work focuses on application of change detection to a set of time-ordered images to identify the exact pair of bi-temporal images or video frames about the change point. We propose a metric to detect changes in time-ordered video frames in the form of rank-ordered threshold values using segmentation algorithms, subsequently determining the exact point of change. The results are applicable to general time-ordered set of images.

**Keywords:** change point detection, time-ordered images, difference image, threshold values, segmentation algorithms

#### **1. Introduction**

Intelligent video surveillance involves automated extraction of information related to an object or scene of interest, including detection, localization, and tracking amongst other applications. One of the earliest comprehensive efforts in this regard was undertaken by Robert T Collins et al. [1] as part of a Defence Advanced Project Research Agency (DARPA) Video Surveillance and Monitoring (VSAM) project. The three fundamental methods tested for moving object detection were the background subtraction, optical flow and temporal differencing method. Due to the limitation of individual methods, hybrid schemes involving combination of individual methods were tested. Adaptive background subtraction was combined with the three-frame difference method in order to overcome the limitation of either method. The frame difference method it may be noted is a simple technique but, however, suffers from the limitation that the complete shape of the detected object cannot be extracted

precisely. The frame differencing method and in general temporal differencing make use of a static or dynamic threshold value in order to determine a change or no change scenario. This provides us a key to development of threshold as possible metric for our current work. Change detection (CD) is related to the fundamental task of object detection moving or static insofar as that it enables one to cull out relevant images or frames from a stack. Thus, the search space in scene analysis as part of the task for an image analyst gets reduced. This aspect is highlighted in the work by Huwer on adaptive CD for real-time surveillance applications [2]. CD enables one to detect viable changes, which can then be inputs for the subsequent object detection or tracking task. CD may be considered as an elementary stage in the video analytics framework entailing segmenting a video frame into the foreground and background. This may be considered a simple task but is an important precursor to further highend processing. A most comprehensive recent review on a deep learning frameworkbased CD has been carried out by Murari Mandal et al. [3, 4]. Various applications of CD as part of video analysis including video synopsis generation, anomaly detection, traffic monitoring, action recognition, and visual surveillance have been covered as part of the study.

"*CD is the process of identifying differences in the state of an object or phenomenon by observing it at different times*" Singh [5, 6]. This standard definition of the CD process though applied to the context of remote sensing images articulates the objective and purpose clearly insofar as even video surveillance is concerned. The objective is to detect the relevant change as part of the video surveillance in form of the object or activity (phenomenon) of interest. Considering the fact that today the quantum of data in form of the video feed to be analysed by the image analyst has increased vastly in recent times, there is scope for automation in the analysis process at various levels. Determination of the exact change point (CP) within a set of video frames or sequences will reduce the workload of the image analyst by filtering in only the relevant changes that have occurred during the period of interest. This in turn shall increase the overall efficiency of the video analysis workflow by rendering the necessary automation as a useful aid to the analyst. Limited work in the domain of applied CD exists with regard to the aspect of determination of the exact point of change. This is the objective of the current work wherein we make use of the threshold of the difference image sequence based on various segmentation algorithms as a metric for the determination of the possible CP in an image sequence or video feed. Malek Al Nawashi et al. [7] have made use of the simple temporal differencing approach along with a threshold function in order to determine the moving image as part of their work on abnormal human activity detection in an intelligent video surveillance system. Thus, there is a scope to apply the image difference approach in order to determine the point of change while subsequently overcoming its limitation in terms of the inability to detect the complete target shape [1].

CP detection has been studied in time series data analysis. In the context of remote sensing images as a sample case from an image processing perspective, Militino et al. [8] have carried out a very comprehensive survey recently (2020), of the various methods and tools available for CP detection. They infer that the methods applied to time series data may be applied in the context of time-ordered satellite images and image processing as well. We would like to extend this notion to the case of image processing as applied to video analytics in general. Amongst the techniques studied the nonparametric approach is a viable option given the fact that abrupt changes are likely to occur in a video sequence at any point of time, rendering it difficult for an underlying Bayesian or model-based approach to be followed. The nonparametric approach is

#### *Change Point Detection-Based Video Analysis DOI: http://dx.doi.org/10.5772/intechopen.106483*

applicable to a wider variety of problems in CP detection since no assumption is made regarding any underlying model as surmised by Samaneh Aminikhanghahi et al. [9] in their comprehensive survey on CP detection methods for general time series data. The study points out that the inferences are applicable to the domain of image analysis as well. Nonparametric approach has also been analysed by Murari et al. [3, 4] too as part of their comprehensive survey on the DL-based CD methods as well.

One of the few studies on CP detection approach in a time-ordered set of images is that carried out by Manuel Bertoluzza et al. [10]. The objective of their work was to determine an accurate CD map between a selected pair of images amongst a timeordered series of images by representing the changes along a temporal closed loop as binary sequences. In order to analyse the consistency of changes determined within a closed loop, the notion of a binary change variable was introduced. In our opinion, the use of the metric in order to compare the changes and finally achieve the desired accuracy is a novel idea. Though this step improves accuracy in existing methods of CD, the important question of determination of when a change has occurred or the CP still remains unanswered. The answer will enable efficient filtering of the video frames to a select few in form of image pairs about the respective CPs. The likely object or phenomenon of interest lies amongst these image pairs or frames. This can be a primary step yielding increased speed in processing within the overall intelligent video surveillance framework.

Based on the above discussion, the objective of our study is to determine a simple and robust method to determine CP within a set of segmented video frames forming part of a video surveillance feed. A change of variable or metric [10] in form of the threshold based on different image pairs is utilized to determine the point of change from amongst a set of images or frames. Rank ordering the changes based on the thresholds enables second or third CP detection as deemed fit by the image analyst. Nonparametric methods are more robust making no assumptions about the underlying model structure [9] and amongst these, Pettitt's approach [11] is a simple and widely used technique. Our proposal for the change metric is similar to that of Pettitt's.

The main contribution of the work is the following: 1) Determination of a suitable CD metric based on a comparative analysis of various segmentation methods to include Otsu, K Means based (denoted by ISODATA), minimum cross-entropy threshold (MCE) methods, 2) Application of the CD metric CP detection within the set of segmented video frames, and 3) Proposed framework to apply the CD metricbased CP concept to the intelligent video surveillance problem.

The chapter is organized as follows. Subsection 1.1 after introduction covers the aspects of the data set. Section 2 describes the basic CP detection algorithm based on the change metric concept. Section 3 discusses the results obtained based on a comparative analysis of respective CD metrics obtained from the four segmentation algorithms tested. Section 4 describes the proposed application framework of the results to the intelligent video surveillance framework.

#### **1.1 Data set description**

Most CD open source data sets are in form of image pairs as the objective is the application and testing of specific CD methods or algorithms to the same. In order to achieve the objective of the current work, there is a need for a time-ordered image data set. For this purpose, Google Earth-based time-ordered satellite image data sets of specific locations sourced from open source data [12] have been used and customized

for testing purposes. The satellite image data set has viability for automation in terms of information extraction by the image analyst, which is currently being done manually. Hence, the choice of this data set for developing results as part of the study has been undertaken. However, it is worth mentioning that the results obtained subsequently can be well applied to a general image processing scenario including video analysis. Google Earth images are a valid source of satellite imagery used for research purposes as evinced in work such as Ur¨ka Kanjir's et al.'s survey [13]. The sample data set is as shown in **Figures 1**-**3**.Out of the time-ordered data set of 19 images, the relevant point of change is that between the fourth and fifth image (refer red arrow in **Figure 1**) when the object of interest or change has first appeared. The testing has been carried out on 10 such sets with the object appearing at an instance within the data set, which denotes the point of change. The spatial resolution of the data set is as per the standard Google Earth platform (= > 5 cm) with each image corresponding to an area of 12 x 12 km on ground. The average temporal resolution of the 10 data sets of images was 10–15 years calculated between the first and last set of images.

CDNET2014 [14] is another standard open source data set for testing various CD algorithms based on static images and video sequences. We make use of this data set to demonstrate a more general application of the algorithm and analyse results on a test case along with those obtained for the above cases (**Figures 1**–**3**). The data set sample pertains to the intermittent object motion category and is like a parking lot with a man entering the scene at a certain point (frame number 57). **Figures 4** and **5** show the sample data set that actually consists of 2500 frames forming part of a video feed in which testing is carried out on a selected number of frames (e.g. 80). The objective is to detect the point of change which is at the point of entry of the

**Figure 1.** *Sample data set sequence 1.*

**Figure 2.** *Sample data set sequence 2.*

individual. As can be observed, the changes are extremely minor and difficult to detect between respective frames as it is of a video recording. Application of various segmentation algorithms such as Otsu, MCE, ISODATA and analysis therein to the Google Earth and CDNET2014 data set shall enable the selection of a suitable method accordingly.

#### **2. Concept: Change point detection**

#### **2.1 Background**

CP detection in time series is a well-researched area with a comprehensive survey on various methods carried out by Aminikhanghahi et al. [9]. The application areas include medical condition monitoring, climate change monitoring, speech recognition and image analysis. CP detection in image analysis is the least researched area, and our endeavour in the current work is to apply the useful lessons learned in the case of the time series approach to that of image or video analysis. CP detection in time series is much simpler compared to the case of image or video analysis considering that the numeric values to be compared are easily extracted from the data itself. CP detection in case of image or video analysis requires the determination of a suitable change metric to be applied in a similar framework of time series in order to apply the benefits of the same in this case. Trend and CP detection in remote sensing has been well studied and classified by Militino et al. [7]. The nonparametric methods are robust and applicable to a larger variety of problems compared with parametric

**Figure 3.** *Sample data set sequence 3.*

methods since changes in phenomenon or objects may be arbitrary not following any pattern or model. Amongst nonparametric methods, Pettitt's method [11] is a wellestablished and applied method. We take a cue from this approach wherein the random variables forming part of the test hypothesis are substituted by the respective threshold values of the difference image sequences in order to determine the CP as explained below.

A suitable change variable or metric for the determination of the maximum CP in a time-ordered image set is the threshold values obtained from image segmentation of the different pairs of images. Subject to a minimum or no change scenario between images there will be minimum or no variation amongst the respective threshold values in the set. This premise has a rationale that any change in the sequence of images shall result in a variation in pixel values. This variation can be directly captured in form of a variation in the threshold values of the segmented image as per different algorithms applied. Otsu Binary segmentation algorithm [15] is a standard segmentation algorithm along with Li's information theoretic MCE threshold method [16] and Coleman's K means clustering image segmentation algorithm [17]. The threshold

*Change Point Detection-Based Video Analysis DOI: http://dx.doi.org/10.5772/intechopen.106483*

**Figure 4.** *CDNET2014 data set result (MCE).*

**Figure 5.** *CDNET2014 data set result (Otsu).*

**Figure 6.** *Proposed basic CD framework.*

values determined by these algorithms along with a mean threshold method are proposed to be used as the change variable or metric for the determination of CP in the time-ordered image sequence.

The methodology is thus based on thresholding (*via* application of the respective segmentation algorithms) of the binary image difference sequences constituting the image set. The point of maximum change is determined by the maximum value from amongst the threshold sequences of the binary image difference sequence. The algorithm is described in steps in the next section as illustrated in **Figure 6**.

#### **2.2 Steps**

Let us consider the set of time-ordered image sequences or video frames as *T* ¼ f g *In*j*nε*½ � 1, *N* where *N* is the total number of the images being processed. The objective is to select the pair of images that define precisely the maximum CP and further rank order the images in reducing relevance of CPs. This will assist the image analyst in sifting the images so as to determine the exact point of change while analysing the phenomenon or object of interest. This will enable timely and efficient analysis of the time-ordered image sequences or video frames. The steps are as follows:


5.Based on the index of CP the corresponding image pair may be processed further to extract information as desired by the image analyst.

#### **3. Results and analysis**

#### **3.1 Results**

Methodology and steps described in section 2 have been applied to 10 data sets of the type described in **Figures 1**–**3**, and results obtained therein are displayed in **Tables 1** and **2**, respectively. **Table 1** pertains to the category 1 evaluation wherein no margin for error is permitted and a valid detection is considered if as per ground truth, the CP is detected based on the maximum threshold value of the segmented difference image sequence. This is in keeping with the requirements or validity of the algorithm. It is also possible that due to pixel value variations owing to noise, and in certain cases the precise point of change is not captured corresponding to the maximum threshold value but the second highest threshold value or subsequent. Corresponding to this relaxation (valid detection considered up to the second highest threshold value), the results are re-valuated and presented in **Table 2** as category 2. The standard Receiver Operator Characteristic (ROC) metrics of True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN) are applicable for the current methodology as well with a slight modification. A correct detection in form of a TP corresponds to a TN as well since we are interested in the detection of only the correct image pair and not in the number of targets detected in a particular image as per standard applications. Similarly, in case, the correct image pair is not detected, that is, a FP occurs that corresponds to a FN as well. Recalling as per the standard


**Table 1.**

*ROC metrics: category 1.*


**Table 2.** *ROC metrics: category 2.*

**Figure 7.** *Threshold plot: Google Earth sample data set.*

definition = (TP/(TP + FN)) represents the number of valid targets correctly detected. Precision as per the standard definition = (TP/(TP + FP)) gives the quality of the detection in terms of the correct number of target detected with a minimum of FPs. F1 Score as per the standard definition = (2\*Precision\*Recall)/ (Precision+Recall) represents the degree of balance obtained in terms of both the precision and recall. In the present case for reasons aforesaid, that is, FP and TP being coincidental with FN and TN, respectively, the values of Recall, Precision and F1 score will all give the same values. Hence, for ease of assimilation of the reader and analysis therein we only mention Recall. Certain applications such as military target detection call for a high degree of recall compared with precision, that is, a minimum or no target miss scenario wherein one is ready to compromise to a certain extent on precision viz.-a-viz. recall.

The threshold plot corresponding to the sample image sequence (refer **Figures 1**–**3**) is presented in **Figure 7**. The threshold plot displays the variation in values corresponding to the respective threshold methods. As is visible, the point of change is correctly detected between the fourth and fifth image, compared with the ground truth (refer red arrow in **Figure 1**).

The CDNET2014 data set sample results are shown in **Figures 4** and **5**, respectively, with the corresponding plot displayed in **Figure 8**. From **Figure 4**, it is observed that using the cross-entropy method, MCE is able to detect the point of change accurately (refer **Figure 4**) with Otsu method (refer **Figure 5**).

#### **3.2 Analysis of results**

Based on the results, following are the relevant deductions:


#### **4. Proposed framework: CP detection in video analysis**

#### **4.1 Case I: Static format**

Based on the basic CD concept described in section 2 and results obtained in section 3, we describe two formats for implementation as part of the intelligent video analysis framework. The current description in this subsection pertains to static format case (refer Case I in **Figure 9**) wherein only a limited number of video frames or images are received and required to be analysed. In this scenario, the determination of the important CPs and in turn the filtering of probable objects or phenomenon are based on the basic CD framework described in Section 2. The steps remain the same as described in subsection B of the section. **Figure 1** is a diagrammatic description of the concept, which is further modified for video surveillance case vide as in **Figure 9**. The modification is the addition of level I or level 2 processing element in form of a basic segmentation algorithm or an object detection algorithm. The level 1 processing scenario entails application of a segmentation algorithm as used for change metric determination, applied to the different images or image pairs about the CP. In case of searching for a specific category of target, a level 2 processing step in form of an object detection algorithm may be applied. A level 1 processing step applies the same segmentation algorithm (e.g. MCE) that has been used to determine the CD metric. This

**Figure 9.** *Proposed CD framework for video analytics.*

*Change Point Detection-Based Video Analysis DOI: http://dx.doi.org/10.5772/intechopen.106483*

ensures full exploitation of the notion of a segmentation algorithm in terms of its capability to distinguish or partition a scene into the foreground and background [3, 4]. The foreground is likely to contain the phenomenon of interest or object of interest. By filtering the entire set of images or video frames received, to a likely pair of images the overall time period of processing will definitely reduce and effort too on the part of the image analyst. The steps as described vide in subsection B in Section 2 are applicable in the current case too and are as follows:


#### **4.2 Case II: Fixed/moving calibration window format**

This format is applicable when a continuous feed of video frames is being received for analysis in a fixed or moving camera scenario. The fixed window implies application of a calibration module over the first set of frames (refer red box and title First frame set). As part of the calibration module, the corresponding thresholds of the difference image sequences are determined. Once all calibration frames are received, the minimum and maximum thresholds corresponding to the segmented difference images are determined. The premise of employing a calibration module is to capture the background model in form of the thresholds of the successive difference images prior to the system being applied in a live scenario. Live scenario pertains to the actual phase of application wherein the information regarding the object or scene of interest is to be captured. Thus, in order to analyse the environment or background where the fixed or moving camera is employed, the calibration module enables capturing the background information or no change detected scenario. Once the thresholds of successive difference images are captured as part of the calibration module, subsequent threshold of difference images lying outside the range of those of the calibration module is indicative of a probable CD scenario. The yellow rhombus indicates this decision in **Figure 9**.The issues of false triggering are likely to be reduced as minor variations in the scene, which are to constitute the background captured in the calibration module prior to the application of the live phase. The steps for fixed calibration window are as follows:


The moving window concept is similar to the static case with the difference that the corresponding maximum and minimum threshold values vary as per the shifting window or set of frames over which calibration is carried out. In this case, the problem of dynamically changing scenarios such as vehicles starting and stopping abruptly is addressed. In such cases, the background needs to be dynamically updated for which adaptive algorithms have been proposed [1]. However, the CD metric is a powerful concept which in the current scenario is representative of the background static or dynamic as captured in form of the calibration module. In case of an envisaged scenario wherein the dynamic variation in background continues for a longer period, the moving window calibration module is applied to overcome these problems. Here, the threshold ranges detected over a fixed calibration frame within the static format are varied to change over sequences of frames being captured. The moving window calibration frames are depicted *via* the dashed lines in **Figure 9**. As the video frames are received, the set of thresholds corresponding to the calibration module are captured over the latest set of video frames in a pre-decided interval (corresponding to the anticipated degree of dynamism in background). Thus, the range of threshold values of the calibration module will be shifted over the next set of say *n* video frames thereby capturing the latest background in order to detect corresponding changes in subsequent frames. The steps for moving calibration window are as follows:

1.Determine the image difference sequence for the say first *n* image difference sequence forming part of the calibration frame set as *Tdiff* ¼

f g *I*<sup>1</sup> � *I*2, *I*<sup>2</sup> � *I*3, *:* … … , *In*�<sup>1</sup> � *In* .


The limitation in the simple frame differencing method of being unable to recover a complete shape of detected target [1] is overcome in our proposed framework by application of a Level 1 or Level 2 processing step post-detection of the CP as shown in **Figure 9**. Thus, once the point of change is detected, further application of say a level 2 processing will enable determination of the complete shape of the intended target.

#### **4.3 Implementation issues**

The CP detection-based methodology proposed for video analysis as described above in subsections A and B, respectively, is a simple adaptation of the CP-based approach. The advantage of the approach as adapted for video analysis is that it is simple and independent of the time-ordered set of video frames being received. Both offline (refer subsection A) and online (refer subsection B) options of implementation exist and it is a nonparametric approach, not making any assumptions regarding the underlying model. The change metric is a single value derived in a simple manner independent of any probabilistic methodology. Thus, the approach being nonparametric is applicable to a large number of scenarios since no assumptions are made regarding any specific scenario. The methodology is unsupervised not requiring any training data as in case of many deep learning or machine learning-based approaches.

Thus, the speed of implementation will be inherently higher in our case. The challenges in application of the method proposed are that initially it will require certain amount of testing and fine tuning in conjunction with an image analyst (for checking the performance of the algorithm). Factors such as the number of calibration frames, that is, window size for determination of the CD metric, will require certain fine tuning and innovation during implementation stage. The basic CP framework as described in Sections 2 and 3 was executed in Python code and the adaptation for the video analysis framework as described in the current section may follow suite. The architecture described in **Figure 9** is simple and flexible and may hence be modified suitably as per results obtained during implementation stage.

#### **4.4 Comparison with the state of the art (SOA) in intelligent video surveillance**

The current focus of the SOA in the field of video surveillance is primarily on specific application scenarios as described in the comprehensive review by Guruh Fajar Shidik et al. [19]. Intelligent video surveillance includes anomaly detection, object detection, target tracking, etc., as few of the applications, which could apply a CD algorithm component as an important precursor step. It is worth noting that the CP detection concept as described in Sections 2 and 3 covering the application to the video analysis framework has not been well researched. Hence, a valid comparison with an equivalent method in context of video analytics does not exist. The closest semblance to the proposed method based on the CD concept is that of a discriminative framework for anomaly detection proposed by Allison Del Giorno et al. [20]. The proposed method endeavours to overcome the limitation in the existing anomaly detection methods namely the requirement of training data and dependence on temporal ordering of the video data. Their method is based on a nonparametric technique inspired by the density ratio estimation for CP detection. The approach is novel and similar to our proposed method in terms of the nonparametric approach wherein no assumptions are made about the underlying model. Further, the method proposed by Allison et al. does not require training data and is unsupervised similar to our case as well. They endeavour to use a metric- or score-based approach in order to determine anomaly points in a video sequence independent of the ordering of the video frames. However, the method does require an input of the features to aid in distinguishing the anomalies. It may be noted that the proposed methodology in our case is much simpler wherein no such feature set description is required to determine the CP and a single metric in form of the threshold of the image difference pairs is sufficient. This metric-based approach in our case makes the method simple and fast. Moreover, the CP concept is robust and adaptable to an anomaly detection framework. Thus, our method is simpler than the approach proposed by Allison Del Giorno et al. [20], which ultimately utilizes a probabilistic approach to determine the metric used to determine the anomaly points. The proposed CP-based video analysis methodology may be considered as a primary step in the intelligent video analysis framework prior to application of subsequent steps and a potential field for research. This analysis is to the best of the knowledge of the authors, the most relevant possible comparison with the SOA. A thorough review of the existing CD methods in other areas such as time series analysis and remote sensing has already been covered as part of literature review in Section 1. Thus, Section 1 and the current subsection comprehensively cover all the aspects of the proposed method and its typesetting viz.-a-viz. other areas of research.

#### **5. Conclusion**

To the best of the knowledge of the authors, this is the only study on CP detection in respect of image processing in particular as applicable to video surveillance as well. Important results have been obtained with the best method being determined as the cross-entropy MCE, followed by Otsu and ISODATA thereafter. The image difference-based CD metric method is by no means limited only to time-ordered set of images as represented in **Figures 1**–**3**. The method has been applied to a selected CDNET2014 data set as well as displayed in **Figures 4** and **5**. It may be noted that the sequence of images taken from the CDNET2014 data set are originally part of a video sequence, and hence, the results demonstrated in Section 3 (Refer **Figures 4** and **5**) are well suited to be applied to a video surveillance scenario. Thus, formulating the method in sliding window format will enable application to video surveillance scenario including suspicious activity detection scenarios. The block diagram for the proposed application of the CD concept is displayed in **Figure 9** and proposed methodology has been described in detail in Section 4. The scope of applications possible is by no means limited to these two cases. In summary, the CD metric methodology in form of the threshold value needs to be exploited in an innovative manner. Further alternate change variable metrics may be a good area for further research. The objective of the current work has been to answer the important question of Where the change lies? or when has it occurred in a time-ordered set of images? This is important in order to act as a precursor for pin-pointed analysis of the images about the detected point of change as proposed in Section 4.

The level 2 processing in **Figure 9** may also be in an Object Based CD (OBCD) framework [21]. Alternate options for processing the images detected about the CP may be considered part of future research scope.

### **Author details**

Ashwin Yadav<sup>1</sup> \*, Kamal Jain<sup>1</sup> , Akshay Pandey<sup>1</sup> , Joydeep Majumdar<sup>2</sup> and Rohit Sahay3

1 Department of Civil Engineering, Indian Institute of Technology, Roorkee, India

2 Department of Mechanical Engineering, Indian Institute of Technology, Indore, India

3 Department of Computer Science Engineering, Indian Institute of Technology, Kharagpur, India

\*Address all correspondence to: ashwiny77@gmail.com

© 2022 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

### **References**

[1] Collins R, Lipton A, Kanade T, Fujiyoshi H, Duggins D, Tsin Y, et al. A System for Video Surveillance and Monitoring Tech. Report, CMU-RI-TR-00-12, Robotics Institute, Carnegie Mellon University; May 2000

[2] Huwer S, Niemann H. Adaptive change detection for real-time surveillance applications. Proceedings Third IEEE International Workshop on Visual Surveillance. July 2000. pp. 37-46. DOI:10.1109/VS.2000.856856

[3] Mandal M, Vipparthi SK. An Empirical Review of Deep Learning Frameworks for Change Detection: Model Design, Experimental Frameworks, Challenges and Research Needs, in IEEE Transactions on Intelligent Transportation Systems. July 2022;**23**(7):6101-6122. DOI: 10.1109/ TITS.2021.3077883

[4] Lu D, Mausel P, Brondízio E, Moran E. Change detection techniques, International Journal of Remote Sensing. 2004;**25**(12):2365-2401. DOI: 10.1080/ 0143116031000139863

[5] Singh A. Review Article Digital change detection techniques using remotely-sensed data. International Journal of Remote Sensing. 1989;**10**(6): 989-1003. DOI: 10.1080/ 01431168908903939

[6] Isever M, Ünsalan C. Two-Dimensional Change Detection Methods: Remote Sensing Applications. Springer Publishing Company, Incorporated; 2012. ISBN: 978-1-4471-4254-6

[7] Al-Nawashi M, Al-Hazaimeh OM, Saraee M. A novel framework for intelligent surveillance system based on abnormal human activity detection in academic environments. Neural Computation and Application. 2017; **28**(1):565-572. DOI: 10.1007/s00521- 016-2363-z

[8] Militino AF, Moradi M, Ugarte MD. On the performances of trend and change-point detection methods for remote sensing data. Remote Sensing. 2020;**12**(6):1008. DOI: 10.3390/ rs12061008

[9] Aminikhanghahi S, Cook DJ. A survey of methods for time series change-point detection. Knowledge and Information Systems. May 2017;**51**(2): 339-367. DOI: 10.1007/s10115-016-0987 z. Epub 2016 Sep 8. PMID: 28603327. PMCID: PMC5464762

[10] Bertoluzza M, Bruzzone L, Bovolo F. A novel framework for bi-temporal change detection in image time series. 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2017. pp.1087-1090. DOI: 10.1109/ IGARSS.2017.8127145

[11] Pettitt AN. A non-parametric approach to the change-point problem. Journal of the Royal Statistical Society: Series C (Applied Statistics). 1979;**28**(2): 126-135. DOI: 10.2307/2346729

[12] Available from: http://climateviewer. org/history-and-science/government/ma ps/surface-to-air-missile-sites-world wide

[13] Kanjir U, Greidanus H, Oštir K. Vessel detection and classification from spaceborne optical images: A literature survey. Remote Sensing of Environment. 2018;**207**:1-26. ISSN 0034-4257. DOI: 10.1016/j.rse.2017.12.033

*Change Point Detection-Based Video Analysis DOI: http://dx.doi.org/10.5772/intechopen.106483*

[14] Wang Y, Jodoin P-M, Porikli F, Konrad J, Benezeth Y, Ishwar P. CDnet 2014: An Expanded Change Detection Benchmark Dataset. United States: IEEE CVPR Change Detection workshop. Jun 2014. p. 8. (hal-01018757)

[15] Otsu N. A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man, and Cybernetics. Jan 1979;**9**(1): 62-66. DOI: 10.1109/TSMC.1979. 4310076

[16] Li CH, Lee CK. Minimum cross entropy threshold. Pattern Recognition. 1993;**26**(4):617-625. DOI: 10.1016/0031- 3203(93)90115-D. ISSN 0031-3203

[17] Coleman GB, Andrews HC. Image segmentation by clustering. Proceedings of the IEEE. 1979;**67**(5):773-785. DOI: 10.1109/PROC.1979.11327

[18] Malik MM, Spurek P, Tabor J. Cross-entropy based image thresholding. Schedae Informaticae. 2015;**24**:21-29. DOI: 10.4467/ 20838476SI.15.002.3024

[19] Shidik GF, Noersasongko E, Nugraha A, Andono PN, Jumanto J, Kusuma EJ. A systematic review of intelligence video surveillance: Trends, techniques, frameworks, and datasets. in IEEE Access. 2019;**7**:170457- 170473. DOI: 10.1109/ACCESS. 2019.2955387

[20] Giorno AD, Bagnell JA, Hebert M. A discriminative framework for anomaly detection in large videos. In: Leibe B, Matas J, Sebe N, Welling M, editors. Computer Vision – ECCV 2016. ECCV 2016. Lecture Notes in Computer Science. Vol. 9909. Springer, Cham. 2016. DOI: 10.1007/978-3-319- 46454-1\_21

[21] Hussain M, Chen D, Cheng A, Wei H. David Stanley, change detection from remotely sensed images: From pixel-based to object-based approaches. ISPRS Journal of Photogrammetry and Remote Sensing. 2013;**80**:91-106. ISSN 0924-2716

#### **Chapter 3**

## Spaceborne Video Synthetic Aperture Radar (SAR): A New Microwave Remote Sensing Mode

*Jian Liang and Liang An*

#### **Abstract**

The transient information like a 'picture' can be obtained by the traditional microwave remote sensing system. It will bring some shortcomings for detection of the moving targets and long-time monitoring of the variational scene over the region of interest. As a new imaging mode, more and more scholars and agencies have focused on the video Synthetic Aperture Radar (SAR) due to it can provide continuous surveillance over the region of interest. The spaceborne video SAR has the corresponding advantages over the spaceborne SAR image system and the optical video system. The working principles, imaging algorithm, and application method of spaceborne video SAR have been proposed in this chapter. First of all, a theoretical System of spaceborne video SAR has been constructed. The operation and application mode have also been defined. Some key performances have been discussed. To meet the demand for video SAR applications, one imaging algorithm has been proposed for dealing with the spaceborne video SAR data. Experiments on simulated data show that the algorithm was effective.

**Keywords:** video SAR, working principles, imaging algorithm, moving target detection, parameter estimation

#### **1. Introduction**

Video SAR (synthetic aperture radar) is a new imaging mode that can provide continuous surveillance over a region of interest [1]. Compared with traditional SAR imaging, the SAR image stream of spaceborne video SAR is acquired by rapid imaging in a short time. Due to its dynamic information acquisition ability, video SAR is more suitable for the observation of moving targets and time-varying scenes [2–8]. The main work and contributions of this chapter are summarized as follows.

1.A theoretical system of spaceborne video SAR has been constructed. Based on the traditional spaceborne SAR system, the concept of spaceborne video SAR was introduced. The operation and application mode have also been defined. Some key performances, such as resolution, duration of the different operation times, and division method of the raw data have been discussed.

2.To meet the demand for video SAR applications, such as high accuracy and efficient computation. One imaging algorithm has been proposed for dealing with the spaceborne video SAR data. The image formation algorithms can avoid the duplication of processing, to improve the computation efficiency. Finally, experiments on simulated data show that the proposed algorithms were effective.

This chapter has broken through some key technologies in the construction and application of spaceborne video SAR system. The research result can be used as advice to build a spaceborne video SAR system.

#### **2. The theory of spaceborne video SAR**

#### **2.1 SAR imaging**

Synthetic aperture radar is a two-dimensional high-resolution imaging radar, whose high resolution in the range direction is achieved by transmitting LFM signal followed by matched filtering. The azimuth high resolution is achieved by using the relative motion between the radar and the target to form an equivalent large aperture [9, 10].

#### **2.2 Spaceborne video SAR**

Broadly speaking, video is generally defined as a number of linked images played continuously at a certain frequency, which forms a moving image.

Narrowly defined video is generally used in movies or television, and refers to continuous image changes of more than 24 frames per second, according to the principle of visual transient, the human eye cannot distinguish a single static picture, looks like a smooth continuous visual effect, such a continuous picture is also called video.

The U.S. Defense Advanced Research Projects Agency (DARPA) has made a preliminary definition of video SAR: the technology that can reflect a series of SAR images of continuous changes of a target or scene displayed at a fixed frame rate is called video SAR [6].

#### *2.2.1 The operating mode of spaceborne video SAR*

In the application of spaceborne video SAR, multiple SAR imaging of the same scene is mainly realized, while the effective observation time is increased as much as possible. For airborne platforms, circular-track SAR is generally used to realize longtime observation of the scene, while for spaceborne platforms, the observation time can be effectively extended through reasonable orbit design to realize video observation.

#### 1.GEO Video SAR

GEO SAR satellites can form a near-circular satellite trajectory to the earth through orbit design, providing the possibility of GEO circular-track video SAR, which can effectively extend the observation time while realizing the gaze on fixed scenes.

*Spaceborne Video Synthetic Aperture Radar (SAR): A New Microwave Remote Sensing Mode DOI: http://dx.doi.org/10.5772/intechopen.106074*

**Figure 1.** *GEOSAR satellite circular trajectory video SAR subsatellite point.*

For geosynchronous orbit satellites, circular-track SAR for video imaging can be achieved by making the sub-satellite point trajectory circular. The design method is to control the north–south drift of the satellite to be equal to the east–west drift, the north–south drift is determined by the orbit inclination, while the east–west drift is determined by the eccentricity, and the ascending node determines the longitude of the circle track center. Thus, the circle-trace video observation can be realized by a reasonable orbit design. The sub-satellite point trajectory in the process of GEO circular-track video observation was shown in **Figure 1**.

2. Spotlight video SAR

To prolong the video SAR imaging time, the low-orbit satellite video SAR can work in the large-angle staring spotlight mode, in which the satellite uses the azimuth direction large-angle sweep capability to achieve a long time observation of the observation scene, and the conventional spotlight SAR improves the azimuthal resolution by extending the observation time, while the video SAR can achieve multiple imaging by reducing the azimuth resolution of a single image frame. Video imaging can be achieved by reasonably segmenting the echo data throughout the imaging time, and the geometric schematic of its imaging mode was shown in **Figure 2**.

#### 3. Sliding spotlight video SAR

In order to increase the azimuthal width of the video imaging scene, the video SAR can also operate in sliding spotlight SAR, and the geometric schematic of its imaging mode was shown in **Figure 3**, in which the satellite uses its agile maneuvering capability to perform sliding beam imaging of the observed scene, and then performs attitude maneuvering after the first image is completed to complete the second frame of video imaging, and repeats the imaging until it is beyond the satellite's attitude maneuvering capability or observable range, thus forming a video SAR image.

#### *2.2.2 The applications of spaceborne video SAR*

The spaceborne video SAR is actually a sequence of SAR images of the same target area with high update frequency, and the applications based on video SAR images mainly include the following aspects [11–13].

#### 1.Multi-aspect observation of target

For the spaceborne video SAR, different video frames have different observation aspects of the target, so different video frames can achieve multi-aspect observation of

**Figure 2.** *The geometric schematic of spotlight video SAR.*

**Figure 3.** *The geometric schematic of sliding spotlight video SAR.*

#### *Spaceborne Video Synthetic Aperture Radar (SAR): A New Microwave Remote Sensing Mode DOI: http://dx.doi.org/10.5772/intechopen.106074*

the target, and the SAR images of the target under different observation aspects can fully describe the characteristic information of the target, which is of great significance to the target identification and confirmation.

#### 2. Suppressing the coherent speckle of SAR images.

Coherence speckle refers to the subechoes of multiple scattering points superimposed or eliminated with the same phase in certain resolution units, which makes dotted bright or dark areas appear in SAR images. The traditional coherent speckle suppression method uses multi-look processing to achieve the non-coherent superposition of multi-look images. The effective suppression of coherent speckle can also be achieved in video SAR products by non-coherent superposition of multi-frame images.

#### 3.Continuous monitoring of scenarios and targets [14–18]

The existing SAR moving target detection technology does not achieve continuous video monitoring of hotspot areas, and the ATI or DPCA-based moving target detection has the problems of minimum detection speed and blind speed, and the estimation of target motion parameters also has the problem of ambiguity, which brings greater challenges to the localization and imaging of moving targets. Video SAR effectively extends the information in the time dimension, and the detection of moving targets and estimation of motion parameters can be achieved by using the change information between frames. The video SAR products after locating and imaging moving targets can intuitively display the motion information such as position, velocity, and motion trend of moving targets in stationary scenes.

#### *2.2.3 Analysis of imaging duration of spaceborne video SAR*

The imaging duration of the spaceborne video SAR is mainly constrained by the following factors, and the minimum value of the video imaging duration determined by these factors is the imaging duration of the spaceborne video SAR.

#### 1. Incidence angle constraint

The variation of the incidence angle affects the ground range resolution, and the incidence angle also affects the radar observation distance. In general, the backscattering cross section decreases with the increase of the incidence angle. From the perspective of energy return, the incidence angle should be chosen as small as possible within the applicable range of scattering theory, but from the perspective of application requirements, different application requirements require different incidence angles. The difficulty of system implementation will increase after the incidence angle is extended. Considering these factors, the constraint of incidence angle needs to be satisfied in the process of spaceborne video SAR imaging.

#### 2.The beam azimuth sweeping capability constraint

The imaging duration of spaceborne video SAR is also constrained by the beam sweeping capability of the satellite. For phased-array antennas, the antenna gain decreases seriously during the large-angle sweeping process, and the dispersion effect will occur at the same time, resulting in the degradation of imaging quality, while the reflector antenna has the characteristics of stable antenna pattern and high gain

during the large-angle sweeping process. For the reflector antenna, the azimuth sweeping capability is mainly limited by the maneuverability of the agile platform. The sweeping capability determines the duration of the video imaging, and the larger the sweeping angle, the longer the video imaging time within the incidence angle meets the observation requirements.

3. Squint angle constraint

The coupling between range and azimuth direction is serious in the process of SAR imaging processing when the squint angle is large, which brings difficulties to the imaging processing, and the existing imaging processing algorithm has the maximum squint angle limitation if a good focusing effect is to be achieved. Therefore, the requirement of squint angle during imaging processing is also one of the constraints on the duration of video SAR imaging.

#### **3. Image formation algorithm of spaceborne video SAR**

The main two modes of video SAR implementation by LEO satellites are spotlight and sliding spotlight. Since there is no overlap of data between adjacent frames in sliding spotlight mode, the imaging algorithm in this mode is the same as the traditional imaging algorithm in sliding spotlight mode. In contrast, for the spotlight mode, there is overlap between adjacent video frames, and to avoid repeated operations of overlapping data, the imaging algorithm applicable to spaceborne video SAR needs to be studied. In this section, a video SAR imaging algorithm is proposed for the key technical problems to be solved in the process of video SAR imaging, and the simulation is verified.

#### **3.1 Echo data segmentation method**

Typically, the spaceborne radar is operated in the spotlight mode for an extended period of time while taking video. The conventional spotlight SAR imaging mode achieves the high azimuth resolution through lengthening the synthetic time, while in the video mode the spotlight raw data was divided into pieces to form the video frames. The image geometrical mode of spaceborne video SAR based on equivalent squint range mode was shown in **Figure 4**, in which *Ls* is the distance the radar moves throughout the video imaging progress. *β* is the largest synthetic angle, *ls* is the distance the radar moves of a video frame. *θ<sup>i</sup>* is the synthetic angle of a video frame, while *θci* is the squint angle of the video frame. The equivalent velocity of the radar is *vr*.

The Doppler bandwidth of the ith video frame can be expressed as follows.

$$B\mathbf{a}\_i = f\mathbf{d}\_\mathbf{b} - f\mathbf{d}\_\mathbf{a} \tag{1}$$

Where *fd*<sup>a</sup> and *fd*<sup>b</sup> are the Doppler frequency at the start and end time of the ith video frame.

$$f\!d\_{\mathfrak{a}} = -\frac{2\nu\_{\mathfrak{r}}\cos\left(\theta\_{\mathrm{ci}} - \frac{\theta\_{\mathrm{i}}}{2}\right)}{\lambda} \tag{2}$$

$$fd\_{\mathbf{b}} = -\frac{2\nu\_{\mathbf{r}}\cos\left(\theta\_{\mathrm{ci}} + \frac{\theta\_{\mathrm{i}}}{2}\right)}{\lambda} \tag{3}$$

*Spaceborne Video Synthetic Aperture Radar (SAR): A New Microwave Remote Sensing Mode DOI: http://dx.doi.org/10.5772/intechopen.106074*

**Figure 4.** *The image geometrical mode of spaceborne video SAR.*

The Doppler bandwidth of the ith video frame can be presented as follows.

$$B a i = f d\_{\rm b} - f d\_{\rm a} \approx \frac{2 \nu\_{\rm r} \theta\_{\rm i} \sin \varphi\_{\rm ci}}{\lambda} \tag{4}$$

The azimuth resolution can be expressed as follows.

$$\rho\_{\rm ai} = \frac{v\_{\rm g}}{Ba\_i} = \frac{\lambda v\_{\rm g}}{2v\_{\rm r}\theta\_i \sin \varphi\_{\rm ci}}\tag{5}$$

Then the distance the radar moves of the ith video frame can be derived as follows.

$$L\_i = \frac{\lambda \nu\_\text{g} R\_0}{2\nu\_\text{r} \rho\_{\text{ai}} \sin^3 \rho\_{\text{ci}}} = \frac{\lambda \nu\_\text{g} \left(R\_0^2 + \nu\_\text{r}^2 t\_\text{i}^2\right)^{3/2}}{2\nu\_\text{r} \rho\_{\text{ai}} R\_0^2} \tag{6}$$

Where *ti* is the middle time of the ith video frame.

In video SAR the synthetic aperture time to achieve the desired azimuth resolution typically exceed the frame period. As a result, there can be a significant overlap in the collected phase history used to form consecutive images in the video. **Figure 5**

**Figure 5.** *The overlap between adjacent frames of video SAR.*

illustrates the overlap between adjacent frames of video SAR, where *Li* is the synthetic aperture length of the ith video frame, *CL* is the overlap length of the adjacent frames. Δ*L* is the distance between adjacent frames.

The overlap length of the adjacent frames can be presented as follows.

$$\mathbf{C}\_{\rm L} = \frac{\mathbf{L}\_{i}}{2} + \left(\frac{\mathbf{L}\_{i-1}}{2} - \Delta\mathbf{L}\right) = \frac{(L\_{i} + L\_{i-1})}{2} - \Delta\mathbf{L} \tag{7}$$

Then the overlap rate can be expressed as follows.

$$a\_i = \frac{1}{2} + \frac{\lambda v\_{\rm g} \left(R\_0^2 + v\_{\rm r}^2 (t\_i - T\_{\rm f})^2\right)^{3/2} - 2T\_{\rm f} v\_{\rm r}^2 R\_0^2 \rho\_{\rm air}}{2\lambda v\_{\rm g} \left(R\_0^2 + v\_{\rm r}^2 t\_i^2\right)^{3/2}}\tag{8}$$

#### **3.2 Spaceborne video SAR imaging based on BP algorithm**

In the spaceborne video SAR system, the low-orbit satellite works in the spotlight mode, the video imaging is realized through the reasonable segmentation of the echo data, and the key technical problems to be solved in the spaceborne video SAR imaging mainly include as follows [19–26].

#### 1.Large squint angle problem of some data frames

Since the LEO satellite works in the spotlight mode, some data frames have large squint angles, and there are serious range-azimuth coupling and range cell migration, which will lead to serious degradation of image quality by the traditional approximation method.

#### 2.Data overlap of adjacent video frames

From the above analysis, it can be seen that in the spaceborne video SAR, since the synthetic aperture time is larger than the frame period, there is a large overlap of data between adjacent frames, and processing each frame individually will lead to repeated operations of overlapping data between adjacent frames, which will greatly reduce the operation efficiency.

#### 3.Real-time problem

The spaceborne video SAR imaging needs to provide real-time or quasi-real-time video frame images, which requires the algorithm to have high computing efficiency, and the possibility of parallel computing needs to be explored based on the application of advanced computing hardware equipment.

A fast BP (Back-Projection) algorithm, which can be implemented in parallel, is proposed to address the imaging characteristics of space-based video SAR and the key technical problems to be solved. As an accurate time-domain algorithm, the BP algorithm avoids the geometric approximation, so it can well solve the problems of imaging under a complex distance model and the serious coupling between range and azimuth direction in the case of large squint angle in spaceborne video SAR [27].

*Spaceborne Video Synthetic Aperture Radar (SAR): A New Microwave Remote Sensing Mode DOI: http://dx.doi.org/10.5772/intechopen.106074*

In the application of the BP algorithm, the azimuth resolution increases with the increase of coherent cumulative pulse number, so the spaceborne video SAR imaging based on the BP algorithm can effectively avoid the repetitive operation of overlapping data of adjacent frames. The current frame can be imaged with the operation result of overlapping data of the previous frame, thus effectively improving the operation efficiency. At the same time, the sub-aperture division can further improve the operation speed of the BP algorithm, and the sub-aperture based processing can also realize parallel computing, which can effectively ensure the real-time or quasi-real-time output of video SAR images.

The process of spaceborne video SAR imaging based on the BP algorithm is shown in **Figure 6**, firstly, the echo signal is divided into several sub-apertures, and in order to achieve avoiding repeated operations through sub-aperture superposition, the frame period is required to be an integer multiple of the sub-aperture length, if the synthetic aperture length of a single video frame is *L*, and it is divided into *N* sub-apertures, the length of each sub-aperture is *L*sub ¼ *L=N*. The imaging of each sub-aperture can be calculated in parallel to the image of each sub-aperture can be computed in parallel to generate a low-resolution image, and finally, the fullresolution image of a single frame can be obtained by sub-aperture synthesis.

The range pulse compression signal in a single sub-aperture is as follows.

$$\mathfrak{s}\_{\text{BM}(\vec{\eta})}(\tau,\eta) = \mathfrak{s}\_{\text{BM}(\vec{\eta})}(\tau,\eta) \otimes \mathfrak{s}\_{\text{BM}(\vec{\eta})}^{\*} \left[ \left( \mathbf{T}\_{\mathbf{c}(\vec{\eta})} - \tau \right), \eta \right] \tag{9}$$

Where *τ* is the range time, *η* is the azimuth time, Τ<sup>c</sup>ð Þ*ij* is the time delay of the reference point, since the echo time delay is difficult to coincide with the sampling point, the distance to the pulse compressed signal should be interpolated, and the interpolated signal is as follows.

$$\mathcal{S}\_{\text{BUM}(\vec{\eta})}(t,\eta) = \sum\_{|t-n\Delta\tau| \le N\_s} s\_{\text{BM}(\vec{\eta})}(n\Delta\tau,\eta)h\_{\text{w}}(t-n\Delta\tau) \tag{10}$$

Where *τ* ¼ *n*Δ*τ* is the sampling point before interpolation, 2*N*<sup>s</sup> is the length of interpolation kernel, *h*wð Þ*t* is the interpolation kernel function after window sharpening. Since the processed echo data has been demodulated, the echo phase compensation is needed to achieve coherent accumulation, and the result after the echo phase compensation is as follows.

$$\mathcal{S}\_{\text{BCUM}(\vec{\eta})}(t,\eta) = \mathcal{S}\_{\text{BUM}(\vec{\eta})}(t,\eta) \exp\left(j2\pi f\_c t\right) \tag{11}$$

**Figure 6.**

*Flow diagram of one frame SAR image formation.*

The time delay between the target point and each radar position within the subaperture is as follows.

$$t\_{\vec{\eta}}(n\Delta\eta) = R\_{\text{bi}(\vec{\eta})}(n\Delta\eta)/c \tag{12}$$

Where *η* ¼ *n*Δ*η* is the azimuth sampling point, then the imaging result of the target P*ij* in the *kth* sub-aperture is as follows.

$$\left(f\_{\mathcal{B}(\vec{\eta})}^{k}\right)(a\_{i},\delta\_{\vec{\eta}}) = \sum\_{n=R}^{S} \mathsf{S}\_{\text{BCUM}(\vec{\eta})} \left[t\_{\vec{\eta}}(n\Delta\eta), n\Delta\eta\right] \tag{13}$$

Where *R*Δ*η* is the starting time of the *kth* sub-aperture echo data, and *S*Δ*η* is the end time of the *kth* sub-aperture echo data, At this point, the low-resolution imaging result for the *kth* sub-aperture *f k* <sup>B</sup>ð Þ *α*, *δ* is obtained by traversing each grid in the scene.

In the imaging process, the sub-apertures are calculated in parallel, which can effectively improve the computing efficiency. Finally, the low-resolution images of sub-apertures are coherently added to obtain the full-resolution image of the *ith* video frame as follows.

$$F\_i(a, \delta) = \sum\_{k=1}^{N} f\_{\mathcal{B}}^k(a, \delta) \tag{14}$$

#### **3.3 Simulation**

To validate the video SAR algorithm, a point target simulation is carried out based on the parameters in **Table 1**.

The scene size is 2 � 2 km. There are 25 point targets in the simulated scene, the initial moment is arranged uniformly by 5 � 5, the first and fifth columns and the first and fifth rows of the point targets are stationary, and the coordinates of the target in the center of the scene are assumed to be (0, 0). The motion parameters of the remaining targets are shown in **Table 2**.


*Spaceborne Video Synthetic Aperture Radar (SAR): A New Microwave Remote Sensing Mode DOI: http://dx.doi.org/10.5772/intechopen.106074*


#### **Table 2.**

*Motion parameters of moving targets.*

#### **Figure 7.**

*Image formation result of spaceborne video SAR. (a) Stationary target imaging results (b) The first frame of the video SAR (c) The 10th frame of the video SAR (d) The 18th frame of the video SAR.*

The results of spaceborne video SAR simulation are shown in **Figure 7** with a frame rate of 5 Hz, where **Figure 7(a)** shows the imaging results of 25 points when the target is stationary in the initial state, and it can be seen that the target achieves a good focusing effect at each point when the target is stationary. **Figure 7(b)** shows the imaging results of the 1st frame of the spaceborne video SAR, and the comparison of the imaging results of *S*ð Þ 2, 2 , *S*ð Þ 2, 3 and *S*ð Þ 2, 4 show that the larger the velocity in the range direction, the larger the offset of the azimuth direction of the target, and the velocity in the range direction and the acceleration in the azimuth direction have less influence on the azimuth spreading of the target. The comparison of the imaging results of *S*ð Þ 3, 2 , *S*ð Þ 3, 3 and *S*ð Þ 3, 4 show that the larger the azimuth velocity of the target, the more serious the azimuth spread of the target, and the range acceleration is also the main cause of the azimuth spread of the target. From the imaging results of *S*ð Þ 4, 2 , *S*ð Þ 4, 3 and *S*ð Þ 4, 4 , it can be seen that when the target has both azimuth velocity, range velocity and range acceleration, the azimuth image of the target is both shifted and defocused. **Figure 7(c)** and **(d)** show the imaging results of frame 10 and frame 18 of the spaceborne video SAR respectively. From the imaging results of *S*ð Þ 2, 4 , *S*ð Þ 3, 4 and *S*ð Þ 4, 4 , it can be seen that the azimuth velocity and range velocity of the three targets gradually become larger and the azimuth spreading and shifting become larger as the time increases. It can be seen that the imaging results of spaceborne video SAR correctly reflect the motion information of the targets, which can provide the basis for the subsequent motion target detection, motion parameter estimation and repositioning and imaging of the moving targets based on SAR video.

#### **4. Conclusions**

In this chapter, a general definition of spaceborne video SAR is given first, and three operating modes and possible application directions of spaceborne video SAR are proposed for the demand of long-time observation. The imaging duration of spaceborne video SAR is mainly affected by the incidence angle, azimuth sweeping capability and the maximum squint angle allowed by the imaging process. The analysis shows that for low-orbit satellites, the angle of incidence is the main factor limiting the duration of video SAR. Then a parallel computable video SAR imaging algorithm based on sub-aperture division is proposed for the three types of key technical problems to be solved in spaceborne video SAR imaging, and computer simulation is conducted to verify the results. The simulation results show that the video imaging results correctly reflect the motion of the target and can provide the basis for the motion target detection, parameter estimation, and repositioning and imaging based on SAR video. The research results of this chapter can provide suggestions and references for the construction and application of future spaceborne video SAR systems.

#### **Conflict of interest**

The authors declare no conflict of interest.

*Spaceborne Video Synthetic Aperture Radar (SAR): A New Microwave Remote Sensing Mode DOI: http://dx.doi.org/10.5772/intechopen.106074*

### **Author details**

Jian Liang\* and Liang An Institute of Remote Sensing Satellite, China Academy of Space Technology, Beijing, China

\*Address all correspondence to: liangjiancast@163.com

© 2022 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

### **References**

[1] DARPA. Video synthetic aperture radar (ViSAR). Broad Agency Announcement. 2012:DARPA-BAA-12-41:1-51

[2] Well L, Doerry A, et al. Developments in SAR and IFSAR System and Technologies at Sandia National Laboratories. Vol. 2. Big Sky, MT, USA: IEEE; 2005. pp. 1085-1095

[3] Available from: http://www.sandia.g ov/radar

[4] Damini A, Balaji B, Parry C. A video SAR mode for the X-band wideband experimental airborne radar. Proceedings of SPIE. 2010;**7699**(0E):1-11

[5] Moses RL, Joshua N. Recursive SAR imaging. In: Proceedings of the SPIE Algorithms for Synthetic Aperture Radar Imagery XV. Orlando, FL, USA: Ohio State University Department of Electrical and Computer Engineering Columbus OH USA. Vol. 6970. 2008. pp. 1-12

[6] Defense Advanced Research Projects Agency. Broad Agency Announcement: Video Synthetic Aperture Radar System Design And Development. DARPA, USA: Arlington; 2012

[7] Wallace HB. Development of a video SAR for FMV through clouds. Proceedings of SPIE. 2015;**9479**(0L):1-2

[8] Palma S, Wahlen A, Stanko S, et al. Real-time onboard processing and ground based monitoring of FMCW-SAR videos. In: The 2014 EUSAR. Vol. 3607. Berlin: IEEE; 2014. pp. 2-5

[9] Cumming IG, Wong FH. Digital Processing of Synthetic Aperture Radar Data Algorithms and Implementation. London: Artech House Inc; 2005

[10] Curlander JC, McDonough RN. Synthetic Aperture Radar: Systems and Signal Processing. NY, USA: John Wiley and Sons, Inc.; 1991

[11] Elachi C. Spaceborne Radar Remote Sensing: Applications and Techniques. New York: The Institute of Electrical and Electronic Engineers, Inc; 1987:33-42

[12] Damini A, Mantle V, Davidson G. A new approach to coherent change detection in video SAR imagery using stack averaged coherence. In: The 2013 IEEE Radar Conference. Vol. 5794. Ottawa, ON, Canada: IEEE; 2013. pp. 13-17

[13] Carrara WG, Goodman RS, Majewski RM. Spotlight Synthetic Aperture Radar: Signal and Processing Algorithms. Norwood, MA, USA: Artech House; 1995

[14] Kirscht M. Detection and velocity estimation of moving objects in a sequence of single-look SAR images. Physical Review A. 1996;**1**(5):333-335

[15] Ouchi K. On the multi look images of moving targets by synthetic aperture radars. IEEE Transactions on Antennas and Propagation. 1985;**8**(33):823-827. DOI: 10.1109/TAP.1985.1143684

[16] Kirscht. Method of detecting moving objects and estimating their velocity and position in SAR images United States Patent, 2005.10.04:US6952178B2

[17] Kirscht M. Detection and imaging of arbitrarily moving targets with singlechannel SAR. Radar, Sonar and Navigation, IEEE Proceedings. 2003; **150**(1):7-11. DOI: 10.1109/RADAR.2002. 1174697

*Spaceborne Video Synthetic Aperture Radar (SAR): A New Microwave Remote Sensing Mode DOI: http://dx.doi.org/10.5772/intechopen.106074*

[18] Yang J, Liu C, Wang Y. Imaging and parameter estimation of fast-moving targets with single-antenna SAR. IEEE Geoscience and Remote Sensing Letters. 2014;**11**(2):529-533. DOI: 10.1109/ LGRS.2013.2271691

[19] Zhao S, Chen J, Yang W, et al. Image formation method for spaceborne video SAR. In: IEEE 5th Asia-Pacific Conference on Synthetic Aperture Rad. Singapore: IEEE; 2015. pp. 148-151

[20] Linnehan R, Miller J, Bishop E. An autofocus technique for video-SAR. Proceedings of SPIE. 2013;**8746**(08):1-10

[21] Miller J, Bishop E, Doerry A. Applying stereo SAR to remove heightdependent layover effects from video SAR imagery. Proceedings of SPIE. 2014; **9093**(3A):1-10

[22] Hawley RW, Garber WL. Aperture weighting technique for video synthetic aperture radar. Proceedings of SPIE. 2011;**8051**(07):1-7

[23] Jinping S, Zhifeng Y, Wenb H. A new subaperture chirp scaling algorithm for spaceborne spotlight SAR data focusing. In: The 2007 IEEE Radar Conference. Vol. 5794. London: IEEE; 2013. pp. 13-17

[24] Kaizhi Wang, Xingzhao Liu. Squintspotlight SAR imaging by sub-band combination and rage-walk removal. Geoscience and Remote Sensing Symposium. Vol. 8742:(2) Alaska: IEEE; 2004. 3930-3933.

[25] Jin L, Liu X, Wang J. Adaptive subaperture approach for spotlight SAR azimuth processing. In: Geoscience and Remote Sensing Symposium. Boston: IEEE. Vol. 2808:(3) 2008: 1292-1295.

[26] Wang W, Ma X. A novel data preprocessing method for resolving doppler ambiguity of spaceborne spotlight SAR. In: Geoscience and Remote Sensing Symposium. Munich: IEEE; Vol. 978:(1) 2012:5-8.

[27] Miller J, Bishop E, Doerry A. An application of backprojection for Video SAR image formation exploiting a subaperture circular shift register. Proceedings of SPIE. 2013;**8746**(09):1-14

#### **Chapter 4**

## Evolution of Attacks on Intelligent Surveillance Systems and Effective Detection Techniques

*Deeraj Nagothu, Nihal Poredi and Yu Chen*

#### **Abstract**

Intelligent surveillance systems play an essential role in modern smart cities to enable situational awareness. As part of the critical infrastructure, surveillance systems are often targeted by attackers aiming to compromise the security and safety of smart cities. Manipulating the audio or video channels could create a false perception of captured events and bypass detection. This chapter presents an overview of the attack vectors designed to compromise intelligent surveillance systems and discusses existing detection techniques. With advanced machine learning (ML) models and computing resources, both attack vectors and detection techniques have evolved to use ML-based techniques more effectively, resulting in non-equilibrium dynamics. The current detection techniques vary from training a neural network to detect forgery artifacts to use the intrinsic and extrinsic environmental fingerprints for any manipulations. Therefore, studying the effectiveness of different detection techniques and their reliability against the defined attack vectors is a priority to secure the system and create a plan of action against potential threats.

**Keywords:** intelligent surveillance systems, internet of video things (IoVT), multimedia forgery, environmental fingerprints, forgery detection, DeepFake detection

#### **1. Introduction**

The modern smart city infrastructure has advanced by integrating multimediabased information input and the development of an edge computing paradigm [1, 2]. An increase in visual and auditory input from the deployed sensors has enabled multiple network layer-based processing of incoming information. While most of the intelligent infrastructure depends on a cloud computing-based architecture [3], edge computing has been attracting more and more attention to meet the increasing challenges in terms of scalability, availability, and the requirements of instant, on-site decision making [4–6]. Advancements in artificial intelligence (AI) have equipped the edge computers to process the incoming multimedia feed and deploy recognition and detection software. Machine learning (ML)-based models such as object detection, tracking, speech recognition, and people identification are commonly deployed to

enhance the security in infrastructure and private properties [7]. With an increase in such technological advancements, the system's reliability has also exponentially increased where the trust factor established on the system is directly depending on the information retrieved by the multimedia sensor nodes [8]. The edge devices are enhanced with multi-node communication and equipped with Internet connections to provide continuous functionality and security services.

Due to their significance in infrastructure security and functionality, edge computing devices are commonly targeted through networked attacks through Wi-Fi and RF links [9]. The devices are compromised through malicious firmware updates [10] and result in creating a backdoor with admin privileges. The perpetrators then control the device Input/Output (IO) and compromise the network and home security. Specifically, visual layer attacks are developed to manipulate the visual sensor in edge devices and create a false perception of live events monitored by the control station. Simple frame manipulation such as frame duplication or shuffling allows the perpetrator to mask the original frame, where the security of the infrastructure can be easily compromised [11]. There is also no evidence of crimes without the surveillance recordings, and it falters the need for such security devices. Along with the visual layer, the audio channel of the edge nodes is equally targeted. Modern home security is enabled with voice commands and a home assistant system that functions based on the voice commands received. The audio devices are equipped with voice-based home assistant computers and Voice Over IP (VoIP) surveillance recorders. The attackers can target the audio channel through hidden voice commands, control the system, or completely mask the audio channel with noise to disable its functionality [12].

As the ML-based models have enhanced the surveillance system's capabilities, it has also resulted in the development of frame manipulation attacks. Beginning with the traditional copy-move style forgery attacks in spatial regions of a frame, modern deep learning (DL) has enabled generative networks capable of creating a frame based on the user's input. Adversarial networks have rendered some ML models useless due to their targeted attack to disable their functionality. General adversarial networks (GANs) have created DeepFakes, which have become one of the most challenging problems in current multimedia forgery attacks [13]. DeepFake is trained to function in low computing systems such as edge devices and result in manipulations such as Face Swaps, Facial Re-enactments, and complete manipulation of the targeted person's movements resulting in a very realistic media output [14]. It is clear that both the visual and auditory channels require robust security measures and reliable authentication schemes to detect such malicious attacks and secure the network [15, 16].

Advancements in forgery attacks have always been countered with detection schemes. Traditional frame forgery attacks were first detected using watermark technology and compression artifacts [17]. However, when the edge device is compromised, the frames are manipulated at the source level, creating watermarks on false frames. Similarly, with DeepFake being developed, its counterpart detection schemes were also trained. The first stages of DeepFakes carry visual artifacts like face recordings without any eye blinking or face warping artifacts [18]. Still, with more training data and better networks, DeepFakes have evolved to a point where it is almost not distinguishable from real images [19]. Although the technology itself has its own merits when ethically used in the field of medical and entertainment, perpetrators can always use the DeepFake technology with malicious intent without a reliable detector. It is an ongoing effort to create a reliable detection scheme to clearly distinguish between real and fake.

*Evolution of Attacks on Intelligent Surveillance Systems and Effective Detection… DOI: http://dx.doi.org/10.5772/intechopen.105958*

This chapter provides an overview of the evolution of multimedia-based attacks to compromise the edge computing nodes such as surveillance systems and their counterpart forgery detection schemes. The essential features required by a reliable detection system are analyzed and a framework using an environmental fingerprint is introduced that has proven to be effective against such attacks.

#### **2. An overview of audio-visual layer attacks**

The networked edge devices are commonly deployed through Wi-Fi or RF links in a private network. The primary means of hijacking a secure device is through network layer attacks where the communication between the devices is intercepted and modified [20]. This allows the source and the destination to believe that the information exchange was secure, while a perpetrator alters the intercepted message as required. Malicious firmware is updated through direct physical access to the USB interface or remote web interface, which allows a perpetrator to gain admin privileges to the edge devices. Some devices are sold through legitimate channels with malicious firmware pre-installed [10]. With complete access to the visual and audio sensor nodes, the attacker can manipulate the media capturing module itself, making the network-level security measures compromised.

Surveillance systems are the most targeted edge devices due to their importance and access medium [11]. Network attacks like Denial of Service (DoS) can disable the network connections of the devices and negate their purposes. Common admin mistakes like using the default credentials on the networks and devices login are primary reasons for backdoor entry. Once the device or the network is compromised, the attacker typically encodes the trigger mechanism into the system. This allows the perpetrator to remotely trigger the selected attack based on remote commands without re-accessing the device. Malicious inputs can be encoded into the multimedia encoding scheme of the edge device. Trigger methods like QR-code-based input to the video recording interpret the command differently [21], face detection-based trigger [22], and hidden voice commands through the audio channels [12] are a few examples of how an attack can be remotely controlled. Wearable technologies like Google Glass are also affected through the backdoor firmware, where the QR-code-based input was used to hack the device [23].

With remote trigger mechanisms, a device can be controlled to manipulate the incoming media signal. Face detection software can be re-programmed to blur selected faces and car plate registrations or disable certain functionality like detecting prohibited items like guns [9]. Popular Xerox scanners and photocopiers were hacked to manipulate the contents of the documents that are scanned and insert random numbers instead of actual data [24]. Surveillance cameras with Pan-Tilt-Zoom (PTZ) capabilities can be controlled to re-position the cameras so that the number of blind spots is increased in a surveillance area [25]. Audio Event Detectors (AED) are commonly deployed in surveillance devices to raise the alarm based on suspicious audio activity or in-home assistant devices to detect the wake commands. Still, the AED system can be directly targeted using the hidden voice commands to interpret its input falsely [12, 26]. Using the adversarial networks, popular ML models on edge devices are targeted so that the input itself can be modified [27]. Frame-level pixel manipulations are made to confuse the ML models and result in the false categorization of object recognition models [28]. A wearable patch is trained to target the person

identification ML model, which can be worn by a perpetrator in the form of a t-shirt and escape the identification module [29].

Access to the multimedia sensor nodes can result in many variants of visual and audio layer attacks. To study the effective detection methods, we first narrow the video frame manipulation and audio overlay attacks commonly designed to target the edge-based media input such as surveillance devices and online conferencing technologies.

#### **2.1 Frame manipulation attacks**

Video recordings used for temporal correlation of the live events are primarily targeted using frame shuffling or duplication attacks [30]. The perception of live events is affected, which disables the effectiveness of live monitoring [31]. Adaptive replay attacks are designed such that the frame duplication attack can adapt to the changes in the environments such as light intensity variations, object displacement, and camera alignments. With adjusting frame masking, the operator in the monitoring station cannot distinguish between the real and fake images since the duplicated frames are originally copied from the same source camera [22]. The effect of source device identification and watermarking technique is negated since the frames originated in the same camera. **Figure 1** represents a frame replay attack where the attack is triggered remotely by either a QR-code or face detection module, and the resulting frame is masked with a static background [21, 32].

Spatial manipulation of a frame includes changes to the pixels like object addition or deletion, while the static frame is maintained. Frame-level manipulations are commonly made to deceive the viewer with the presence of a subject [33, 34]. The figure shows the spatial manipulation of the video frame.

**Figure 1.** *Frame duplication attack to manipulate the perception of live events triggered by the perpetrator's face detection.* *Evolution of Attacks on Intelligent Surveillance Systems and Effective Detection… DOI: http://dx.doi.org/10.5772/intechopen.105958*

#### **2.2 Audio masking and overlay**

Most edge nodes are equipped with audio recording capabilities making them a target for forgery attacks [3]. Every household is equipped with surveillance cameras, home assistants, and edge devices capable of two-way communications. The AED module is responsible for wake command detection or event detection based on audio like gunshot sounds. The input audio sensor nodes are disabled by compromising the AED module by replacing the actual event with the quiet static noise. The input is also affected by adding additional white noise to disrupt the AED module [26].

#### **2.3 DeepFake attack**

DeepFake attacks developed using GAN architecture [13] have resulted in a large quantity of fake media generation. With enough training data available and the computation resources, the quality of the generated media keeps improving to a point where a person cannot distinguish between the real and fake media [35]. Although DeepFake technology has its application merits, any technology can cause more harm than good in the wrong hands. The developing software technologies have made it easier and more convenient for the generation of DeepFake media using their mobile phone.

A simple face manipulation software where two people can swap their facial landmarks originated in the form of mobile applications. Soon, advanced technologies were made to make the swap more realistic [36]. Many organizations and institutions rely on online conferencing solutions for their daily communications. Face-swapping technologies allow perpetrators to mimic a source facial landmark and duplicate their online personality [37]. However, with the capability to extract facial landmarks and skeletal features from a source subject, a new form of DeepFake emerged to project source movement on a targeted subject (**Figure 2**).

The facial re-enactment software [38] allows the model to extract the face landmark movements from a source subject. These landmarks are projected on a targeted

**Figure 2.** *DeepFake Face Swap Attack to project a source face on a target.* victim resulting in a media where the victim is projected to act out however the perpetrator wishes. Although the model was created to demonstrate the capabilities of deep learning models, it resulted in targeting politicians and celebrities to develop fake media. A GAN model is created where the source body actions are projected on a targeted person [39]. The model introduced resulted in creating an entertainment application, and it could also be alternatively used to frame a victim by forging their actions in surveillance media. The style-based transfer learning has enabled the GAN technology to create more realistic and indistinguishable output [19].

Introducing perturbations in real objects or images can cause edge layer object classifiers to make incorrect predictions, which could have serious repercussions. A study showed that making small changes in a stop sign could cause an object detector to wrongly classify it as a different object as depicted in 3(a) [40]. This phenomenon has been analyzed and the Fast Gradient Sign Method attack was proposed, which uses the gradient of the loss function of the classifier to construct the perturbations necessary to carry out the attack [41]. The attack begins by targeting an image and observing the confidence of the classifier in its predictions of the class. Next, the minimum perturbation that maximizes the loss function of the classifier is found iteratively. Using this method, the image can be manipulated such that incorrect classification is achieved without producing any discernible difference to the human eye as shown in 3(b). The Jacobian-based Saliency Map Attack [42] algorithm computes the Jacobian matrix of the CNN being used for object classification and produces a salient map. The map denotes the scale of influence each pixel of the image has on the prediction of the CNN-based classifier. The original image is manipulated in every iteration, such that the two most influential pixels, which are chosen from the saliency map, are changed. The salient map is updated in each iteration, and each pixel is changed only once. This stops when the adversarial image is successfully classified to the target label (**Figure 3**).

**Table 1** summarizes the multimedia attack techniques and their respective targeted systems. Along with video manipulation, audio is also equally targeted when creating realistic fake media. Paired with technology like facial re-enactment, DeepFake audio can create an illusion of a targeted person with manipulated actions. Software like Descript [43] can recreate source audio with training data for few as 10 minutes. Emerging technologies like DeepFake need a reliable detector that can distinguish between real and fake media to preserve security and privacy in the modern digital era. Due to the inconsistencies in earlier stages of DeepFake media, many detector modules were created to identify the artifacts introduced during media generation. However, with more training data and advanced computing, the output benefited and rendered the previous detection scheme useless. In the following section, we study the key parameters required for a reliable detector to establish an authentication system for digital media.

#### **3. Detection techniques against multimedia attacks**

Countering forgery attacks led to the development of detection techniques relying on artifacts related to the in-camera processing module or the post-processing methods. The prior knowledge of the source of the media recordings has been an advantage in detecting forgery; however, without that knowledge, some techniques depend on the artifacts introduced by forgery itself. Techniques based on blind techniques, prior knowledge, and forgery artifacts using the conventional methods are first discussed, followed by neural networks trained to identify the forgery.

*Evolution of Attacks on Intelligent Surveillance Systems and Effective Detection… DOI: http://dx.doi.org/10.5772/intechopen.105958*

**Figure 3.**

*a) Adversarial patches cause the classifier to wrongly classify the stop sign. b) FGSM attack based on introducing pixel-based perturbations.*

#### **3.1 Conventional detection methods**

The processing modules present in-camera and post-processing of the media captured result in generating unique features and artifacts, which are exploited to identify frame forgeries. Each image capturing device is equipped with wide or telescopic lenses, where the unique interaction between the lens and the imaging sensor creates chromatic aberrations. A profile of unique chromatic aberrations is created to identify foreign frames inserted from a different lens and sensor [44, 45]. Along with lens distortion artifacts, another module present in in-camera processing after image acquisition is the Color Filter Array (CFA). The CFA is used to record light at a certain wavelength, and the demosaicing algorithm is used to interpolate the missing colors. A periodic pattern emerges due to the in-built CFA module, and whenever a frame is forged, it disrupts the periodic pattern. For frame region splicing attacks, the interrupted periodic pattern from CFA is analyzed to detect the forgery and localize the attack [46, 47].

Each camera sensor manufactured has a unique interaction with the light capturing mechanism due to its sensitivity and photodiode. A unique Sensor Pattern Noise (SPN) is generated for every source camera [48]. It can identify the image acquisition device based on prior knowledge of the camera's sensor noise fingerprint. The SPN noise is similar for RGB and Infrared video; however, it is weaker in Infrared due to low light [49]. Since SPN is used for source device identification, frames moved from an external camera can be identified with any localized in-frame manipulation. The frame and audio acquisition process introduce noise level to the media recordings based on the sensor light sensitivity and localized room reverberations. Using the Error Level Analysis, rich features can be extracted from the noise level present and reveal possible anomalies from image splicing [50].


*b Attack launching complexity—varied based on ease of access and computational requirements.*

#### **Table 1.**

*Summary of attack vectors and affected modules.*

In the media capturing post-processing, each compression algorithm uses unique encoding. Therefore, multiple processing of the same media and multiple compression can result in some artifacts identifying prior changes. Analyzing the compression algorithms used by H.264 coding, the presence of any recompression artifacts is used to identify frame manipulations [51]. The spatial and temporal correlation is used to create motion vector features [30, 52]. The de-synchronization caused by removing a group of frames introduces spikes in the Fourier transform of the motion vectors. However, these techniques are sensitive to resolution and noise in the recordings.

The frame manipulations have also inadvertently introduced their unique artifact, and attacks can be identified with prior knowledge of attack nature. Many types of research were developed using custom hand-crafted features. The scale-invariant feature transform key points are used as features for the comparison of duplicated frames in a video recording [53]. The features comprise illumination, noise, rotation, scaling, and small changes in viewpoint. For a continuous frame capture, the standard deviation of residual frames can result in inter-frame duplication detection [54, 55]. Histograms of Oriented Gradients (HOGs) are a unique presentation of pixel value fluctuations, which can be used to identify copy-move forgery based on the HOG feature fluctuation [56]. The optical flow represents the pattern of apparent motion of an image between consecutive frames and its displacement. Using the feature vector designed from the optical flow, copy-move forgery can be identified [31]. Features are generated for each frame and then lexicographically sorted [57]. The Root Mean Square Error (RMSE) is calculated for the frames, and any frame that crosses the threshold is identified as the duplicated frame. However, the technique takes higher processing time due to the sorting and RMSE algorithm and is not applicable in real-time applications.

*Evolution of Attacks on Intelligent Surveillance Systems and Effective Detection… DOI: http://dx.doi.org/10.5772/intechopen.105958*

#### **3.2 Machine learning-based detection methods**

The development of AI in computer vision has efficiently enabled media processing for forgery detection using trained neural networks. The anomalies introduced in the media recordings result in the forgery-specific artifact, which many research approaches exploit.

#### *3.2.1 Artifacts and feature-based detection*

Convolutional neural network (CNN) is the most commonly used frame processing feed-forwarding neural network model, enabling pixel data processing. Forgery attacks such as frame manipulation in the temporal and spatial domain and the DeepFake create an underlying artifact extracted to identify the forgery [58]. In the initial stages of DeepFake development, the resulting media generated visible frame-level artifacts such as inconsistent eye-blinking, face warping, and head-poses. Later, a CNN model is trained to identify the abnormalities introduced by DeepFakes by observing for face warping artifacts [59]. The synthesized face region is spliced into the original image, and a 3D head pose estimation model is created to identify the pose inconsistencies [18]. With the help of pixel information obtained from videos, filters can be designed to identify any tampering. Filters based on discrete cosine transform and video re-quantization errors combined with Deep CNN are used [60].

The DeepFake generation tools are integrated with online conferencing tools to create a fake virtual presence by mimicking a targeted person. The video chat liveness detection in [61] can identify the fake personality due to its fake behavior. The model is trained on behavioral expression in online presence, and any abnormality is marked as fake. For offline media, the audio and video are manipulated to create a video statement; however, the underlying synchronization error for the video lip sync and its corresponding audio are used to identify fake media [62]. To counter DeepFake videos in edge-based computers and online social media, lightweight machine learning models are trained based on the facial presence and its respective spatial and temporal features [63]. Video conferencing solutions are also protected by analyzing the live video stream and passing it through a 3D convolution neural network to predict video segment-wise fakeness scores. The fake online person is identified by the CNN trained on large DeepFake datasets such as Deeperforensics, DFDC, and VoxCeleb.

Along with video forgeries, audio forgeries targeting the AED system in IoT devices like Echo dot by Amazon and Nest Hub by Google are designed. Using the audio perturbations, the AED system misclassifies the incoming voice commands or completely ignores the commands [64]. Training a CNN and recurrent neural network (RNN) [26] has secured the AED system from white noise to disrupt the commands.

#### *3.2.2 Fingerprint-based detection*

Modern DeepFake videos are almost perfect without any visual inconsistencies. However, the underlying pixel information is modified due to the project of foreign information on existing media. With advancing DeepFake technology, the current research has developed techniques to identify the underlying pixel fluctuations and use unique fingerprints due to GAN models and in-camera processing. Authors in [65, 66] have identified that GAN leaves unique fingerprints in the media generated from its network. By creating a profile of these unique fingerprints, the forgery can be detected, and the source GAN model used to create the forgery can also be identified. The DeepFake models introduce pixel-level frequency fluctuations, which result in spectral inconsistencies. Inspecting the spectral inconsistencies in a fake image shows that due to the up-sampling convolution of a CNN model in GAN, the frequency artifact is introduced [67, 67]. A filter-based design is used in [68] to highlight the frequency component artifacts introduced by GAN. The two filters used are used in the high-frequency region of an image and the pixel level to observe the changes in the pixels in the background of the image. A biological signature is created from the portrait videos by collecting the signals from different image sections such as facial regions and under image distortions [69].

#### *3.2.3 Adversarial training-based detection*

Deep neural networks have been proven to be effective tools in extracting features exclusive to DeepFaked images and can thus detect DeepFake-based image forgery. The traditional approach uses a dataset containing real and fake images to train a CNN model, and to identify artifacts that point to forgery. However, this could lead to the problem of generalization as the validation dataset is often a subset of the training dataset. To avoid this, the images can be preprocessed by using Gaussian Blur and Gaussian noise [70]. Doing so suppresses noise due to pixel-level-high-frequency artifacts. Hybrid models have also been proposed that use multiple streams in parallel to detect fake images [71]. It uses one branch to prepare a model trained on the GoogleNet dataset to differentiate between benign and faked images, and another branch that uses a steganalysis feature extractor to capture low-level details. Results from both the branches are then fused together to formulate the ultimate decision on whether a particular image has been tampered with or not.

There are various approaches to detecting fake or tampered videos using machine learning techniques and can be broadly categorized into those that use biological features for detection, and those that observe spatial and temporal relationships to achieve the same objective. A study proposed a novel approach based on eye blinking to detect tampered videos [72]. It is common knowledge that forgery techniques such as DeepFakes produce little-to-no eye blinking in the fake videos that they produce. Using a combination of CNNs and RNNs that were trained on an eye blinking-based dataset, a binary classifier can be produced, which in turn can be used to detect fake videos with reasonable accuracy. Facial regions of interest were used to train models to differentiate between real and DeepFaked videos [73]. Specifically, photoplethysmography (PPG), which uses color intensities to detect heartbeat variations, was used to train a GAN to distinguish between real and fake face videos. However, the drawback lies in the fact that this method is limited to high-resolution videos containing faces only.

Spatiotemporal analysis-based methods treat videos as a collection of frames related to time. Here, in addition to CNNs, Long-Short Term Memory (LSTM) models are used due to their ability to learn temporal characteristics. One such combination that used a CNN to extract frame level features and an LSTM for temporal sequence analysis was proposed [74]. Simply put, the input to the LSTM is a concatenation of features extracted per frame by the CNN. The final output is a binary prediction as to whether the video is genuine or not. GANs have also been proposed as means of analyzing spatiotemporal relationships of videos. An information theory-based approach was used to study the statistical distribution of fake and real frames, and the differential between them was used to make a decision [75].

#### **4. Measure of effective detection techniques**

Evaluating the state of the current media authentication system, the existing stateof-the-art technique relies on a fundamental forgery-related artifact or training a deep neural network to identify specific forgery. However, the same deep learning technology has allowed the perpetrator to hijack the existing detection scheme and counteract its purpose. A source device identification methodology used to locate the device used to capture a certain media recording by leveraging the Sensor Pattern Noise fingerprint can be spoofed. The counter method uses a GAN-based approach to inject camera traces into synthetic images, deceiving the detectors into realizing that the synthetic images are real [76]. Development in GAN technology and abundantly available computing resources have generated many fake media that are indistinguishable. A style transfer technique can project facial features into a targeted person and re-create a realistic image [19].

Modern infrastructure relying on machine learning algorithms for seamless people detection and tracking are targeted by adversarial training. A wearable patch can be trained and used to escape the detection or fool the detector into misclassifying the object [29]. The remote trigger mechanism for frame-level attacks is triggered using visual cues and avoids detection by face blur or frame duplication [22]. Tools with simple instructions are designed to allow users to create DeepFake in online video conferences by portraying a targeted person [77].

The need for secure media authentication that spans multiple media categories becomes more and more compelling because of an increase in counterattacks on existing detection techniques. Based on our analysis of the current state-of-the-art detection methods and their counterattacks, here we highlight the key ingredients of the most successful and reliable approaches:


The forgery detection technique should account for enabling its authentication measures across all devices capable of capturing any multimedia.


Analyzing the critical traits of a reliable detection system, we propose an environmental fingerprint capable of justifying the qualities aforesaid using the power system frequency. The following section discusses the rationale behind our fingerprint-based authentication system for edge-based IoT devices.

#### **5. Environmental fingerprint-based detection**

Electrical Network Frequency (ENF) is a power system frequency with a nominal value of 60 Hz in the United States and 50 Hz in most European and Asian countries. The power system frequency fluctuates around its nominal values, making it a time variant, and the resulting signal is referred to as the ENF signal. The ENF-based media authentication was first introduced for audio forgery detection in law enforcement systems [78]. The fluctuations in ENF are similar to a power grid interconnect and originate from the power supply demand, making the fluctuations unique, random, and unpredictable. For audio recordings, ENF is induced in the recordings through electromagnetic induction from being connected to the power grid [78]. Later, it was discovered that battery-operated devices could also capture ENF fluctuations due to the background hum generated by grid-powered devices [79]. In the case of video recordings, ENF is captured in the form of illumination frequency from artificially powered light sources [80]. The capturing of ENF signal through photos depends on the type of imaging sensor used in the camera. For a CCD sensor with a global shutter mechanism, one sample is captured per frame since the whole sensor is exposed at one time instant. However, for a CMOS sensor with a rolling shutter mechanism, each row in the sensor is exposed sequentially, resulting in collecting the ENF samples from spatial and temporal regions of a frame [81, 82].

ENF estimation from media recordings allows many applications due to its timevarying unique nature. For geographical tagging of media recordings, the ENF signal estimated is compared with the global reference database, and its recording location can be identified [83]. Similar fluctuations in ENF signal throughout the power grid are used to synchronize the multimedia recordings in audio and video channels [84]. The fluctuations in ENF and the standard deviations of the signal from its nominal value are observed to study the load effects on the grid and predict blackouts [85].

The estimation of ENF from media recordings is thoroughly studied for a reliable signal estimation [86, 87] and the factors that affect its embedding process [82, 88]. An ENF-based authentication system is integrated for false frame forgery detection in *Evolution of Attacks on Intelligent Surveillance Systems and Effective Detection… DOI: http://dx.doi.org/10.5772/intechopen.105958*

both spatial and temporal regions due to the nature of the ENF signal. In DeFake [77, 89], the distributed nature of ENF is exploited by utilizing ENF as a consensus mechanism for distributed authentication among the edge-based system. The media collected from online systems are processed, and the ENF signal is estimated along with the consensus ground truth signal. With the help of the correlation coefficient, any mismatch in the signal is located, and an alarm is raised. For detailed system implementation and ENF integration techniques, interested readers are referred to papers on the ENF-based authentication system [90, 91].

#### **6. State of multimedia authentication**

The state of the detection system and forgery attacks never reach an equilibrium where the presented detection scheme can function as a solution for all types of attacks. This chapter discussed the evolution of forgery attacks from subtle framelevel modifications to advanced generated images with fake people, along with its parallel development in detection methods. Based on the critical observations discussed in Section 4, **Table 2** presents a comparison of several current forgery detection techniques.

ENF is a reliable detection method given the signal embedded in the media recordings. The current limitation of this approach involves the recording environment where the ENF-inducing equipment is not present. Due to the absence of artificial lights for outdoor recording, the ENF is not captured in the video recordings. However, in the case of outdoor surveillance recordings, the device is connected to the power grid directly, and the ENF signal is induced in the audio recordings.

Most of the DeepFake detection techniques presented utilize higher computational resources for each frame analysis, and in general, edge devices are not equipped with such power. A different approach would be to design lightweight algorithms utilizing the artifacts or fingerprints for its detection. However, the DeFake approach avoids any training step, and the ENF estimation can be performed in low-computing hardware like Raspberry Pi [91]. Although computer vision has advanced with the emergence of deep learning architecture, DeFake is an environmental fingerprint-based approach relying on signal processing technologies and with encouraging results.


#### **Table 2.**

*A comparison of recently proposed forgery detection techniques.*

#### **7. Conclusions**

The development of forgery attacks has exponentially accelerated with growing computer vision technologies, and the need for a reliable and secure authentication system becomes more compelling. Most detection systems are exploited for their weakness, and attackers frequently launch attacks targeting the system and its security system. This chapter studied the evolution of multimedia attacks using traditional frame-level modification and advanced machine learning-based techniques like DeepFakes. Countering each forgery, we analyzed the detection techniques proposed over time and their progress with the attacks. For a reliable detection and authentication system, we constitute vital ingredients that a system should possess to counter forgery attacks. A thorough analysis and comparison of existing detection techniques are performed to understand the current state of multimedia authentication. Based on the key qualities introduced for a reliable system, we highlight DeFake, an environmental fingerprint-based authentication system, and describe its applications for frame forgeries like a DeepFake attack. Given the state of current edge computing technologies and the constant attacks targeted to disable the system, DeFake is the potential to provide a unique approach for detecting such forgery attacks and protecting the information integrity.

#### **Acknowledgements**

This work is supported by the U.S. National Science Foundation (NSF) via grant CNS-2039342 and the U.S. Air Force Office of Scientific Research (AFOSR) Dynamic Data and Information Processing Program (DDIP) via grant FA9550-21-1-0229. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Air Force.

#### **Abbreviations**


*Evolution of Attacks on Intelligent Surveillance Systems and Effective Detection… DOI: http://dx.doi.org/10.5772/intechopen.105958*


#### **Author details**

Deeraj Nagothu, Nihal Poredi and Yu Chen\* Department of Electrical and Computer Engineering, Binghamton Univerisity, Binghamton, New York, USA

\*Address all correspondence to: ychen@binghamton.edu

© 2022 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

### **References**

[1] Chen J, Li K, Deng Q, Li K, Philip SY. Distributed Deep Learning Model for Intelligent Video Surveillance Systems with Edge Computing. NY, United States: IEEE Transactions on Industrial Informatics; 2019

[2] Nikouei SY, Chen Y, Song S, Choi BY, Faughnan TR. Toward intelligent surveillance as an edge network service (isense) using lightweight detection and tracking algorithms. IEEE Transactions on Services Computing. 2019;**14**(6): 1624-1637

[3] Obermaier J, Hutle M. Analyzing the security and privacy of cloud-based video surveillance systems. In: Proc. 2nd ACM Int. Work. IoT Privacy, Trust. Secur. NY, United States: ACM; 2016. pp. 22-28

[4] Shi W, Cao J, Zhang Q, Li Y, Xu L. Edge computing: Vision and challenges. IEEE Internet of Things Journal. 2016; **3**(5):637-646

[5] Chen N, Chen Y, Blasch E, Ling H, You Y, Ye X. Enabling smart urban surveillance at the edge. In: 2017 IEEE International Conference on Smart Cloud (Smart Cloud). NY, United States: IEEE; 2017. pp. 109-119

[6] Chen N, Chen Y, Ye X, Ling H, Song S, Huang CT. Smart city surveillance in fog computing. In: Advances in Mobile Cloud Computing and Big Data in the 5G Era. Cham, Switzerland: Springer; 2017. pp. 203-226

[7] Nikouei SY, Xu R, Nagothu D, Chen Y, Aved A, Blasch E. Real-time index authentication for event-oriented surveillance video query using blockchain. In: 2018 IEEE Int. Smart Cities Conf. ISC2 2018. 2019

[8] Xu R, Nagothu D, Chen Y. Decentralized video input authentication as an edge Service for Smart Cities. IEEE Consumer Electronics Magazine. 2021; **10**(6):76-82

[9] Mowery K, Wustrow E, Wypych T, Singleton C, Comfort C, Rescorla E, et al. Security analysis of a full-body scanner. In: 23rd USENIX Security Symposium (USENIX Security 14). San Diego, CA, United States; 2014. pp. 369-384

[10] Olsen M. Beware, Even Things on Amazon Come with Embedded Malware. 2016. Available from: https:// artfulhacker.com/post/142519805054/ beware-even-things-on-amazon-come

[11] Costin A. Security of Cctv and video surveillance systems: Threats, vulnerabilities, attacks, and mitigations. In: Proc. 6th Int. Work. Trust. Embed. Devices. NY, United States: ACM; 2016. pp. 45-54

[12] Carlini N, Mishra P, Vaidya T, Zhang Y, Sherr M, Shields C, et al. Hidden voice commands. In: 25th USENIX Security Symposium (USENIX Security 16). Austin, TX, United States; 2016. pp. 513-530

[13] Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative Adversarial Nets. In: Advances in Neural Information Processing Systems. Montreal, Canada; 2014

[14] Verdoliva L. Media forensics and deepfakes: an overview. IEEE Journal of Selected Topics in Signal Processing. NY, United States. 2020;**14**(5):910-32

[15] Westerlund M. The emergence of Deepfake technology: A review. Technology Innovation and Management Review. 2019;**9**(11):40-53 *Evolution of Attacks on Intelligent Surveillance Systems and Effective Detection… DOI: http://dx.doi.org/10.5772/intechopen.105958*

[16] Nagothu D, Chen YY, Blasch E, Aved A, Zhu S. Detecting malicious false frame injection attacks on surveillance Systems at the Edge Using Electrical Network Frequency Signals. Sensors (Basel). 2019;**19**(11):1-19

[17] Wolfgang RB, Delp EJ. A watermark for digital images. In: Proceedings of 3rd IEEE International Conference on Image Processing. Lausanne, Switzerland: IEEE; 1996. pp. 219-222

[18] Yang X, Li Y, Lyu S. Exposing deepfakes using inconsistent head poses. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, United Kingdom: IEEE; 2019. pp. 8261-8265

[19] Karras T, Laine S, Aila T. A stylebased generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Long beach, CA, United States; 2019. pp. 4401-4410

[20] Mena DM, Papapanagiotou I, Yang B. Internet of things: Survey on security. Information Security Journal: A Global Perspective. Philadelphia, PA, United Stated. 2018;**27**(3):162-82. DOI: 10.1080/ 19393555.2018.1458258

[21] Kharraz A, Kirda E, Robertson W, Balzarotti D, Francillon A. Optical delusions: A study of malicious QR codes in the wild. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. Atlanta, GA, United States; 2014. pp. 192-203

[22] Nagothu D, Schwell J, Chen Y, Blasch E, Zhu S. A study on smart online frame forging attacks against video surveillance system. In: Proc. SPIE - Int. Soc. Opt. Eng. Bellingham, Washington, United States; 2019

[23] Zhang C, Shahriar H, Riad ABMK. Security and privacy analysis of wearable health device. In: 2020 IEEE 44th Annual Computers, Software, and Applications Conference (COMPSAC). Los Alamitos, CA, United States; 2020. pp. 1767-1772

[24] Kriesel D. Xerox Scanners/ Photocopiers Randomly Alter Numbers in Scanned Documents. 2017. https:// www.dkriesel.com/en/blog/ 2013/0802\_ xerox-workcentres\_are\_switching\_ written\_numbers\_when\_scanning

[25] Stamm MC, Lin WS, Liu KR. Temporal forensics and anti-forensics for motion compensated video. IEEE Transactions on Information Forensics and Security. 2012;**7**(4):1315-1329

[26] dos Santos R, Kassetty A, Nilizadeh S. Disrupting audio event Detection deep neural networks with white noise. Technologies. 2021;**9**(3):64

[27] Akhtar N, Mian A. Threat of adversarial attacks on deep learning in computer vision: A survey. IEEE Access. 2018;**6**:14410-14430

[28] Quan W, Nagothu D, Poredi N, Chen Y. Cri PI: An efficient critical pixels identification algorithm for fast onepixel attacks. In: Sensors and Systems for Space Applications. Bellingham, Washington, United States: SPIE; 2021. pp. 83-99

[29] Thys S, Van Ranst W, Goedeme T. Fooling automated surveillance cameras: Adversarial patches to attack person Detection. In: Proceedings of the IEEE/ CVF Conference on Computer Vision and Pattern Recognition Workshops. Long Beach, CA, United States; 2019

[30] Wang W, Farid H. Exposing digital forgeries in video by detecting

duplication. In: Proc. 9th Work. Multimed. Secur. Dallas, Texas, United States: ACM; 2007. pp. 35-42

[31] Al-Sanjary OI, Abdullah Ahmed A, Ahmad HB, Ali MAM, Mohammed MN, Irsyad Abdullah M, et al. Deleting object in video copy-move forgery Detection based on optical flow concept. In: 2018 IEEE Conference on Systems, Process and Control (ICSPC). Melaka, Malaysia; 2018. pp. 33-38

[32] Ulutas G, Ustubioglu B, Ulutas M, Nabiyev V. Frame duplication/mirroring Detection method with binary features. IET Image Processing. 2017;**11**(5): 333-342

[33] Su L, Li C. A novel passive forgery Detection algorithm for video region duplication. Multidimensional Systems and Signal Processing. 2018;**29**(3): 1173-1190

[34] Wahab AWA, Bagiwa MA, Idris MYI, Khan S, Razak Z, Ariffin MRK. Passive video forgery detection techniques: A survey. In: 2014 10th Int. Conf. Inf. Assur. Al Ain, United Arab Emirates; 2014. pp. 29-34

[35] Korshunov P, Marcel S. Deep fakes: a new threat to face recognition? Assessment and detection. In: 2018 Computer Vision and Pattern Recognition. Salt Lake City, Utah, United States; 2018. DOI: 10.48550/ arXiv.1812.08685

[36] Bitouk D, Kumar N, Dhillon S, Belhumeur P, Nayar SK. Face swapping: Automatically replacing faces in photographs. In: ACM SIGGRAPH 2008 Papers. SIGGRAPH '08. New York, NY, USA: Association for Computing Machinery; 2008. pp. 1-8

[37] Perov I, Gao D, Chervoniy N, Liu K, Marangonda S, Umé C, et al. Deep face

lab: Integrated, flexible and extensible face-swapping framework. In: 2021 Computer Vision and Pattern Recognition. NY, United States; 2021

[38] Thies J, Zollhofer M, Stamminger M, Theobalt C, Niessner M. Face 2Face: Real-time face capture and Reenactment of RGB videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, Nevada; 2016. pp. 2387-2395

[39] Chan C, Ginosar S, Zhou T, Efros AA. Everybody dance now. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. Seoul, South Korea; 2019. pp. 5933-5942

[40] Eykholt K, Evtimov I, Fernandes E, Li B, Rahmati A, Xiao C, et al. Robust physical-world attacks on deep learning visual classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, Utah; 2018. pp. 1625-1634

[41] Goodfellow IJ, Shlens J, Szegedy C. Explaining and harnessing adversarial examples. In: 6th International Conference on Learning Representations (ICLR). San Diego, CA, United States; 2015. DOI: 10.48550/arXiv.1412.6572

[42] Papernot N, McDaniel P, Jha S, Fredrikson M, Celik ZB, Swami A. The limitations of deep learning in adversarial settings. In: IEEE European Symposium on Security and Privacy (Euro S and P). Saarbrücken, Germany: IEEE; 2016. pp. 372-387

[43] Descript | Create Podcasts, Videos, and Transcripts. 2021. Available from: https://www.descript.com/

[44] Yerushalmy I, Hel-Or H. Digital image forgery detection based on lens and sensor aberration. International

*Evolution of Attacks on Intelligent Surveillance Systems and Effective Detection… DOI: http://dx.doi.org/10.5772/intechopen.105958*

Journal of Computer Vision. 2011;**92**(1): 71-91

[45] Fu H, Cao X. Forgery authentication in extreme wide-angle Lens using distortion Cue and fake saliency map. IEEE Transactions on Information Forensics and Security. 2012;**7**(4): 1301-1314

[46] Bayram S, Sencar H, Memon N, Avcibas I. Source camera identification based on CFA interpolation. In: IEEE International Conference on Image Processing. Genoa, Italy; 2005

[47] Cao H, Kot AC. Accurate Detection of Demosaicing regularity for digital image forensics. IEEE Transactions on Information Forensics and Security. 2009;**4**(4):899-910

[48] Lukas J, Fridrich J, Goljan M. Digital camera identification from sensor pattern noise. IEEE Transactions on Information Forensics and Security. 2006;**1**(2):205-214

[49] Hyun DK, Lee MJ, Ryu SJ, Lee HY, Lee HK. Forgery detection for surveillance video. In: Jin JS, Xu C, Xu M, editors. The Era of Interactive Media. New York, NY: Springer; 2013. pp. 25-36

[50] Cozzolino D, Poggi G, Verdoliva L. Splicebuster: A new blind image splicing detector. In: 2015 IEEE International Workshop on Information Forensics and Security (WIFS). Rome, Italy; 2015. pp. 1-6

[51] González Fernández E, Sandoval Orozco AL, García Villalba LJ. Digital video manipulation Detection technique based on compression algorithms. IEEE Transactions on Intelligent Transportation Systems. 2022;**23**(3): 2596-2605

[52] Wang W, Farid H. Exposing digital forgeries in interlaced and deinterlaced video. IEEE Transactions on Information Forensics and Security. 2007;**2**(3): 438-449

[53] Kharat J, Chougule S. A passive blind forgery Detection technique to identify frame duplication attack. Multimedia Tools and Applications. 2020;**79**(11): 8107-8123

[54] Fadl SM, Han Q, Li Q. Authentication of surveillance videos: Detecting frame duplication based on residual frame. Journal of Forensic Sciences. 2018;**63**(4): 1099-1109

[55] Bestagini P, Milani S, Tagliasacchi M, Tubaro S. Local tampering Detection in video sequences. In: 2013 IEEE 15th International Workshop on Multimedia Signal Processing (MMSP). Pula, Sardinia, Italy; 2013. pp. 488-493

[56] Subramanyam AV, Emmanuel S. Video forgery Detection using HOG features and compression properties. In: 2012 IEEE 14th International Workshop on Multimedia Signal Processing (MMSP). 2012. pp. 89-94

[57] Singh VK, Pant P, Tripathi RC. Detection of frame duplication type of forgery in digital video using sub-block based features. Int. Conf. Digit. Forensics Cyber Crime. Seoul, South Korea: Springer; 2015. pp. 29–38

[58] Güera D, Delp EJ. Deepfake video detection using recurrent neural networks. In: 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). Auckland, New Zealand; 2018

[59] Li Y, Lyu S. Exposing deepfake videos by detecting face warping

artifacts. 2018. In Computer Vision and Pattern Recognition. Salt Lake City, Utah, United Stated; 2018

[60] Zampoglou M, Markatopoulou F, Mercier G, Touska D, Apostolidis E, Papadopoulos S, et al. Detecting tampered videos with multimedia forensics and deep learning. In: Kompatsiaris I, Huet B, Mezaris V, Gurrin C, Cheng WH, Vrochidis S, editors. Multi Media Modeling. Lecture Notes in Computer Science. Cham: Springer International Publishing; 2019. pp. 374-386

[61] Liu H, Li Z, Xie Y, Jiang R, Wang Y, Guo X, et al. Live Screen: Video Chat Liveness Detection Leveraging Skin Reflection. In: IEEE INFOCOM 2020 - IEEE Conference on Computer Communications. Toronto, ON, Canada: IEEE; 2020. pp. 1083-1092

[62] Zhou Y, Lim SN. Joint audio-visual Deepfake Detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal, Canada; 2021. pp. 14800-14809

[63] Chen HS, Rouhsedaghat M, Ghani H, Hu S, You S, CCJ K. Defake Hop: A Light-Weight High-Performance Deepfake Detector. In: 2021 IEEE International Conference on Multimedia and Expo (ICME). Shenzhen, China; 2021

[64] Santos RD, Nilizadeh S. Audio attacks and defenses against AED systems – A practical study. In: Audio and Speech Processing. Ithaca, NY, United States; 2021. DOI: 10.48550/ arXiv.2106.07428

[65] Marra F, Gragnaniello D, Verdoliva L, Poggi G. Do GANs Leave Artificial Fingerprints? In: 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR). San Jose, CA, United States; 2019. pp. 506-511

[66] Cozzolino D, Verdoliva L. Noiseprint: A CNN-based camera model fingerprint. IEEE Transactions on Information Forensics and Security. 2020;**15**:144-159

[67] Durall R, Keuper M, Pfreundt FJ, Keuper J. Unmasking deep fakes with simple features. In: Computer Vision and Pattern Recognition. Seattle, Washington, United States; 2020

[68] Jeong Y, Kim D, Min S, Joe S, Gwon Y, Choi J. BiHPF: Bilateral high pass filters for robust deepfake detection. In: Proceedings of the IEEE/ CVF Winter Conference on Applications of Computer Vision. Waikoloa, HI, United States; 2022. pp. 48-57

[69] Ciftci UA, Demir I, Yin L. Fake catcher: Detection of synthetic portrait videos using biological signals. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2020; **2020**:1-1

[70] Xuan X, Peng B, Wang W, Dong J. On the generalization of GAN image forensics. In: Chinese Conference on Biometric Recognition. Zhuzhou, China: Springer; 2019. pp. 134-141

[71] Zhou P, Han X, Morariu VI, Davis LS. Two-stream neural networks for tampered face detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Honolulu, Hawaii, United States: IEEE; pp. 1831-1839

[72] Li Y, Chang MC, Lyu S. In ictu oculi: Exposing ai created fake videos by detecting eye blinking. In: 2018 IEEE International workshop on information

*Evolution of Attacks on Intelligent Surveillance Systems and Effective Detection… DOI: http://dx.doi.org/10.5772/intechopen.105958*

forensics and security (WIFS). Hong Kong, China: IEEE; 2018. pp. 1-7

[73] Ciftci UA, Demir I, Yin L. How do the hearts of deep fakes beat? deep fake source detection via interpreting residuals with biological signals. In: 2020 IEEE International Joint Conference on Biometrics (IJCB). Houston, TX, United States: IEEE; 2020. pp. 1-10

[74] Güera D, Delp EJ. Deepfake video detection using recurrent neural networks. In: 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). Auckland, New Zealand: IEEE; 2018. pp. 1-6

[75] Agarwal S, Girdhar N, Raghav H. A novel neural model based framework for detection of GAN generated fake images. In: 2021 11th International Conference on Cloud Computing, Data Science Engineering (Confluence). Uttar Pradesh, India; 2021. pp. 46-51

[76] Cozzolino D, Thies J, Rossler A, Niessner M, Verdoliva L. Spo C: Spoofing camera fingerprints. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. NY, United States; 2021. pp. 990-1000

[77] Nagothu D, Xu R, Chen Y, Blasch E, Aved A. DeFake: Decentralized ENFconsensus based deep fake detection in video conferencing. In: IEEE 23rd International Workshop on Multimedia Signal Processing. Tampere, Finland; 2021

[78] Grigoras C. Applications of ENF criterion in forensic audio, video, computer and telecommunication analysis. Forensic Science International. 2007;**167**(2–3):136-145

[79] Chai J, Liu F, Yuan Z, Conners RW, Liu Y. Source of ENF in battery-powered digital recordings. In: Audio Eng. Soc. Conv. Rome, Italy: Audio Engineering Society; 2013

[80] Garg R, Varna AL, Hajj-Ahmad A, Wu M. "Seeing" ENF: Power-signaturebased timestamp for digital multimedia via optical sensing and signal processing. IEEE Transactions on Information Forensics and Security. 2013;**8**(9):1417-1432

[81] Vatansever S, Dirik AE, Memon N. Analysis of rolling shutter effect on ENF based video forensics. IEEE Transactions on Information Forensics and Security. 2019;**14**(7):2262-2275

[82] Nagothu D, Chen Y, Aved A, Blasch E. Authenticating video feeds using electric network frequency estimation at the edge. EAI Endorsed Transactions on Security and Safety. 2021;**7**(24):e4-e4

[83] Wong CW, Hajj-Ahmad A, Wu M. Invisible geo-location signature in a single image. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary, Alberta, Canada: 2018. pp. 1987-1991

[84] Vidyamol K, George E, Jo JP. Exploring electric network frequency for joint audio-visual synchronization and multimedia authentication. In: 2017 International Conference on Intelligent Computing, Instrumentation and Control Technologies (ICICICT). Kannur, Kerala, India; 2017. pp. 240-246

[85] Liu Y, You S, Yao W, Cui Y, Wu L, Zhou D, et al. A distribution level wide area monitoring system for the electric power grid–FNET/grid eye. IEEE Access. 2017;**5**:2329-2338

[86] Hua G, Zhang H. ENF signal enhancement in audio recordings. IEEE Transactions on Information Forensics and Security. 2020;**15**:1868-1878

[87] Hajj-Ahmad A, Garg R, Wu M. Spectrum combining for ENF signal estimation. IEEE Signal Processing Letters. 2013;**20**(9):885-888

[88] Hajj-Ahmad A, Wong CW, Gambino S, Zhu Q, Yu M, Wu M. Factors affecting ENF capture in audio. IEEE Transactions on Information Forensics and Security. 2019;**14**(2): 277-288

[89] Xu R, Nagothu D, Chen Y. Econ ledger: A proof-of-ENF consensus based lightweight distributed ledger for IoVT networks. Future Internet. 2021; **13**(10):248

[90] Nagothu D, Xu R, Chen Y, Blasch E, Aved A. Detecting compromised edge smart cameras using lightweight environmental fingerprint consensus. In: Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems. New York, NY, USA: Association for Computing Machinery; 2021. pp. 505-510

[91] Nagothu D, Xu R, Chen Y, Blasch E, Aved A. Deterring Deepfake attacks with an electrical network frequency fingerprints approach. Future Internet. 2022;**14**(5):125

[92] Mehta V, Gupta P, Subramanian R, Dhall A. FakeBuster: a DeepFakes detection tool for video conferencing scenarios. In: 26th International Conference on Intelligent User Interfaces-Companion. College Station, TX, United States; 2021. pp. 61-63

[93] Durall R, Keuper M, Keuper J. Watch your up-convolution: CNN based generative deep neural networks are failing to reproduce spectral distributions. In: Proceedings of the IEEE/CVF Conference on Computer

Vision and Pattern Recognition. NY, United States; 2020. pp. 7890-7899

[94] Afchar D, Nozick V, Yamagishi J, Echizen I. Meso net: A compact facial video forgery Detection network. In: 2018 IEEE International Workshop on Information Forensics and Security (WIFS). Hong Kong, China; 2018. pp. 1-7

## **Chapter 5** Surveillance with UAV Videos

*İbrahim Delibaşoğlu*

#### **Abstract**

Unmanned aerial vehicles (UAVs) and drones are now accessible to everyone and are widely used in civilian and military fields. In military applications, UAVs can be used in border surveillance to detect or track any moving object/target. The challenge of processing UAV images is the unpredictable background motions due to camera movement and small target sizes. In this chapter, a short literature brief will be discussed for moving object detection and long-term object tracking. Publicly available datasets in the literature are introduced. General approaches and success rates in the proposed methods are evaluated and approach to how deep learning-based solutions can be used together with classical methods are discussed. In addition to the methods in the literature for moving object detection problems, possible solution approaches for the challenges are also shared.

**Keywords:** surveillance, moving object, motion detection, foreground detection, object tracking, long-term tracking, UAV video, drones

#### **1. Introduction**

Unmanned aerial vehicles (UAV) and drones are now accessible to everyone and are widely used in civilian and military fields. Considering security applications, drones could be used in applications such as surveillance, target detection and tracking. Drone surveillance allows us to continuously gather information about a tracked target from a distance. So drones with the capabilities of features such as object tracking, autonomous navigation, and event analysis are a hot topic in computer vision society. The challenge of processing drone videos is the unpredictable background motion due to camera movement. In this chapter, a short literature brief, potential approaches to improve the moving object detection performance, will be discussed and publicly available datasets in the literature will be introduced. In addition, the current situation of deep learning-based solutions, which give good results in many research areas, in motion detection and potential solutions will be discussed. General approaches and success rates in the proposed methods will be shared, and approaches on how deep learning-based solutions can be used together with classical methods will be proposed. In brief, we propose some post-processing techniques to improve the performance of background modeling-based methods, and software architecture to speed up operations by dividing them into small parts.

Section 2 represents moving target detection issues from UAV videos, while Section 2.1 represents how to build a simple background model. Section 2.2 introduces sample datasets for moving target detection and Section 2.3 gives potential approaches to enhance the background modeling approach for moving target detection. Some object tracking methods that can be used together with moving object detection and Convolutional Neural Network (CNN) based methods are emphasized in Sections 3 and 4, respectively. Finally, the conclusion is discussed in Section 5.

#### **2. Moving object detection**

The problem of detecting moving objects is a computer vision issue that is needed in areas such as real-time object tracking, event analysis and security applications. Based on the computer vision literature carried out in recent years, it is a problem that has been studied extensively [1]. The purpose of moving object detection is to classify the image as foreground and background. The classification could be challenging according to factors such as the motion state of the camera, ambient lighting, background cluttering, and dynamic changes in the background. Images obtained from cameras mounted on drones have a free motion, and it causes much background motion (also called global motion in the literature). Another important issue is that these images could be taken in very different regions such as mountains, nature, forests, cities, rural areas, and they can contain very small targets according to the altitude of the UAV.

In moving object detection applications, the aim is to have high accuracy as well as real-time operation of the application. When the studies carried out in the literature are examined, it is seen that subtraction of consecutive frames, background modeling and optical flow-based methods are used. Although the subtraction of consecutive frames method works fast and can adapt quickly to background changes, the success rate is very low [2]. In the background modeling approach, a background model (an image formed as a result of the average of the previous *n* frames) is extracted using frames history [3]. Classical image processing techniques [4], statistical methods [5–7] and neural networks [8] have been used for background modeling in the literature. Gaussian mixture model (GMM) [9] builds a Gaussian distribution model for each pixel and adaptive GMM [7] improves it for dynamic background. Kim et al. [10] propose a spatio-temporal Gaussian model minimizing image registration errors. Zhong Z. et al. [11] propose a background updating strategy performing at both pixel and object levels and apply a pixel-based adaptive segmentation method. Dual-target non-parametric background modeling method [12] proposes a dual-target updating strategy to eliminate false detection caused by background movements and illumination changes. Scene conditional background update method [13], named SCBU, builds a statistical background model without contamination of the foreground pixels. Background subtraction is applied between current frame and updated background model while calculated foreground likelihood map is used to extract initial foreground regions by applying high and low threshold values. In MCD method [6], a dual-mode single Gaussian model is proposed with age, mean and variance of each pixel, and it compensates for the camera motion by mixing neighboring approaches. Simple threshold with respect to the variance is applied in MCD method for foreground detection. Yu et al. [14] use a candidate background model similar to MCD and they propose a method to update candidate or main background model pixels in each frame. In background subtraction step, they apply the neighborhood subtraction approach, which takes into account the neighbors of each pixel. BSDOF [15] method extracts candidate foreground masks with background subtraction and applies threshold for variance value of each pixel. In background subtraction process, also

uses dense optical flow to weigh the difference for each pixel. Then, it obtains a final mask with the combination of candidate masks and region growing strategy using candidate masks. Thus, false detection is largely eliminated.

For background modeling approach in moving cameras (such as cameras mounted to UAVs), global motion is generally eliminated by using homography matrix obtained by Lucas Kanade [16] (KLT) and RANSAC [17] methods. Selected points in the previous frame are tracked in the current frame with KLT and homography matrix representing global (camera) motion is calculated with RANSAC method. Then, previous frame or background model is warped to the current frame to eliminate the global motion. Sample grid-based selected points and estimated positions are visualized as flow vectors in **Figure 1**.

One of the biggest problems in using only pixel intensity values is that these kinds of methods are so sensitive to illumination changes and registration errors caused by homography errors. As a solution to these issues, different features such as texture [18], edge [19] and haar-like [20] are proposed in the literature. Edge and texture features can better address the illumination change issue and also eliminate the ghosting effect left by foreground objects. Local Binary Pattern (LBP) and its variants [21, 22] are other types of texture feature used for foreground detection in the literature. In addition to such additional features, deep learning methods that offer effective solutions to many problems have also been used in the foreground detection problem. For this purpose, FlowNet2 [23] architecture estimating optical flow vectors are used in foreground detection problems [24]. Optical flow means the displacement of each pixel in consecutive frames. KLT method is also an optical flow method that tracks the given points in the consecutive frame and it is categorized as sparse optical flow. On other hand, estimating pixel displacement of each pixel is called dense optical flow. FlowNet2 is one of the most known architectures which also has publicly available pre-trained weights. The disadvantage of deep learning methods is that they require much computational cost, especially for high-dimensional images, and may not perform well for so small targets due to the training image dimensions and

**Figure 1.** *Visualization of flow vectors for grid points.*

contents. Considering that UAV images may contain a lot of small targets, it can be thought that the optical flow model to be trained with small moving object images could perform better. On the other hand processing, high-dimensional input images require much RAM in the GPU. **Figure 2** shows sample visualization of optical flows for FlowNetCSS (which is pre-trained model that mostly detects the small changes and more lightweight according to FlowNet2), Farneback and Nvidia Optical Flow (NVOF). FlowNetCSS is a sub-network of Flownet2.

In this work, we have used FlowNet pre-trained weights which have been trained on MPI-Sintel dataset [25] containing images with the resolution of 1024 436. **Figure 3** shows the FlowNetCSS output on 1920 1080 resolution images from PESMOD dataset [26]. In **Figure 4**, the model is runned for a patch of the frames instead of full resolution and it performs better for the small targets (two people hiking in the mountain). Simple thresholding could be applied for optical flow matrices and the foreground mask showing the moving pixels could be obtained directly. But for small targets, it may be useful to process the small regions as shown in **Figure 4**. Global motion compensation with homography matrix may also be used

**Figure 2.** *Visualization of optical flow vectors of FlowNetCSS, Farneback and NVOF.*

**Figure 3.** *FlowNet visualization on PESMOD [26] sample frames.*

#### **Figure 4.**

*FlowNet visualization on a patch of PESMOD [26] sample frames.*

**Figure 5.** *(a) Sample frame (b) Background model μ image.*

before estimating dense optical flow, so simple thresholding can give the moving pixels with better accuracy.

#### **2.1 Building a background model**

Consider that *H* represents homography matrix between frames in time *t* � 1 and *t*. The background model *B* at time of *t-1* is warped to the current frame by using Eq. (1). Thus, the pixels in the background and current frame are aligned to handle global motion. *α*ð Þ*i* represents the learning rate of each pixel while *μt*ð Þ*i* represents average pixel values. The background model *B* consists of mean and learning values as shown in Eq. (2) and (3).

$$B\_t = H\_{t-1} B\_{t-1} \tag{1}$$

$$a(i) = \frac{1}{\text{age}(i)}\tag{2}$$

$$
\mu\_t(i) = (1 - a\_t(i))\mu\_{t-1}(i) + a(\infty)\ I\_t(i) \tag{3}
$$

In the equations, *I* represents a frame while *i* represents a pixel in a frame. Learning rates (*α*) is determined with the *age* value of each pixel. Sample frame and background image is shown in the **Figure 5** for maximum age value 30. It is also important to set pixels whose age is less than a fixed threshold value to zero. Because the pixels that have just entered the frame need to wait for a while to be evaluated. After building the background model, current frame is subtracted from *μ* image to obtain a foreground mask. But using a simple model with only RGB coloir features is so sensitive to errors like shadow, ghost effect, illumination changes and background motion. Thus, it is important to use extra texture features for background modeling as mentioned in Section 2. In Chapter 2.3 we discuss some approaches to improve the performance of BSDOF method using color features effectively.

#### **2.2 Datasets**

Changedetection.net (CDNET) [27] dataset is a large-scale video dataset consisting of 11 different categories, but only PTZ subsequence consists of images taken by moving camera. PTZ sequence does not include free motion so it is not so appropriate to evaluate motion detection problem for UAV images. SCBU dataset [13] includes images of walking pedestrians with a free motion camera. The VIVID [28] dataset consisting of aerial images is a good candidate to evaluate moving object detection methods. It consists of moving vehicle images and has a resolution of 640x480. PESMOD [15] dataset represents a new challenging high-resolution dataset for evaluation of small moving object detection methods. It includes eight different sequences with a resolution of 1920x1080 and consists of small moving targets (vehicles and humans). PESMOD dataset contains totally of 4107 frames and 13,834 labeled bounding boxes for moving targets. The details of each sequence is given in **Table 1**.

Average precision (*Pr*), recall (*R*) and f1 (*F*1) score values of MCD, SCBU and BSDOF methods for PESMOD dataset are given in **Table 2**. In the Eq. (4), *FP* refers to wrongly detected boxes, *TP* refers to the number of true detections, and *FN* refers to ground truth boxes that is missed by the method. *Pr* indicates the accuracy of positive predictions (estimated as motion) while *R* (also named sensitivity) represents the ratio of the number of pixels correctly classified as foreground (motion) to the actual


#### **Table 1.**

*The details of PESMOD dataset.*


#### **Table 2.**

*Comparison of average precision, recall and f1 score values of MCD, SCBU and BSDOF methods on PESMOD dataset.*

number of foreground pixels. *F*<sup>1</sup> score is the combination of *Pr* and *R*, and is equal to 1 for perfect classification

$$Pr = \frac{TP}{TP + FP}, R = \frac{TP}{TP + FN}, F\_1 = \frac{2^\* Pr^\* R}{Pr + R} \tag{4}$$

The BSDOF method is suitable to implement in the GPU. It runs at about 26 fps for 1920x1080 on a PC with Ubuntu 18.04 operation system, AMD Ryzen 53,600 processor with 16 GB RAM, and Nvidia GeForce RTX2070 graphic card. MCD runs at about 8 fps on the same machine. SCBU is also implemented for CPU and we have used the binary files. So that we could not measure the processing time of the SCBU method.

#### **2.3 Prospective solutions for challenges**

As mentioned in the detailed review article [29], we can say that the main challenges are still dynamic backgrounds, registration errors and small targets. Using extra features like LBP for better performance also increases the computational cost, it is not suitable for real-time requirements of high dimensional videos. Therefore, an alternative solution might be to create a background model by only using color features and process the texture features only for the extracted candidate target regions. This allows to avoid extracting texture features for each pixel. In addition to texture features, classical methods and/or Deep Neural Networks (DNN) can be used to find a similarity score between background image and current frame for candidate target regions. Structural Similarity (SSIM) score [30] can be used to measure the similarity between image patches. As an alternative, any pre-trained CNN model could be used for feature extraction. But using a lightweight sub-network is important since it will be applied to many candidate regions. **Figure 6** shows sample detected bounding boxes with BSDOF method on PESMOD dataset. **Table 3** represents average SSIM scores between current frame and background image patches for ground truth and false positives (FP).

**Figure 6.** *Moving object detection output of BSDOF for Pexels-Shuraev-trekking sequence.*


#### **Table 3.**

*SSIM scores for ground truth (GT) and false positives (PF) of BSDOF method.*

Experiments with similarity comparison results show that it can be useful to eliminate some false detections caused by registration errors and illumination changes. Similarity score is expected high for false detection (no moving objects) and low for moving object regions. But, as a result of our observations, it has been observed that the similarity measure can be low in very small areas such as 5x5 pixels and in regions with no moving object. The background model can be blurred for some pixels due to registration error and/or moving background. It results low similarity score for these cases. In general, extreme wrong detections could be eliminated with a high threshold value not to lose the true detections.

Image registration errors cause possible false detection, especially for objects with sharp edges. Even if similarity comparison can help to eliminate false detection, simple tracking approaches could also be used for this issue. Historical points of each detection are stored in a *tracker list*, and detection for each frame is compared to the *tracker list*. So, tracked regions can be classified by hit count (number of detection consecutive frames) and total pixel displacements. However, it should be noted that coordinate values in the *tracker list* must be adjusted for each frame to eliminate global motion. This approach will work well if the moving target region can be extracted successfully in consecutive frames and the bounding boxes overlap with high intersection of union (IOU) value for good matching. As an alternative approach, a robust tracking method can be used but probably requires more computational cost. Targets detected with the moving object detection algorithm can be tracked with a robust tracker to obtain more precise results, and thus the tracking process continues in case the target stops.

As another approach, classical background modeling and deep learning-based methods can be used in collaboration with different processes. Our experiments show that classical methods suffer more from image registration errors, especially for fast camera movements. Therefore, the classical method and deep learning results can be combined using different strategies according to camera movement speed. Alternatively, dense optical flow with deep learning could be applied only for small patches detected by classical background modeling. In order to implement such an approach a software infrastructure in which background modeling and deep learning methods working in different processes communicate with each other and share data is essential in terms of speed. It allows us to run the processes in a pipeline logic to speed up the algorithm as shown in **Figure 7**. In the proposed architecture, process-1 applies classical background modeling approach and informs process-2 to start via zeroMQ.

#### **Figure 7.**

*Software architecture to run processes in pipeline logic.*

ZeroMQ messaging library is used to transfer meta-data and inform the other processes to begin to process the frame that is ready. The foreground mask cannot be shared via messaging protocols in real-time, so that shared memory (shmem) is used to transfer this huge data between processes. Accordingly, the foreground mask is transferred to process-2 with shared memory and process-2 applies deep learning based dense optical flow only for patches extracted from input foreground mask. Finally, process-3 estimates moving target bounding boxes by processing dense optical flow output. Process-1 processes *It* while process-2 processes *It*�<sup>1</sup> with such a parallel working structure created in the pipeline logic.

#### **3. Object tracking with UAV images**

Object tracking is the re-detection of a target in consecutive frames after the tracker is initialized with the first bounding box as an input. It is a challenging problem for situations such as fast camera movement, occlusion, background movements, cluttering, illumination and scale changes. Tracking methods can be evaluated in different categories such as detection-based tracking, detection-free tracking, 3D object tracking, short-term tracking and long-term tracking. Detection-based tracking requires an object detector and tracking indicates assigning ID for each object. Detection-free tracking can be preferred for UAV images to handle any kind of targets and small-sized objects which is hard to detect with an object detector. As a simple approach, we can consider that we can eliminate the wrong detections after following each candidate moving object region and confirming the movement of the object with the tracker. Then we can decide for moving object with the output of the tracker. Thus, target tracking can be used in cooperation with motion detection to increase accuracy and provide better tracking.

The software architecture suggested in the previous section also seems reasonable to implement the tracking method applied after the motion detector. In this section,


#### **Table 4.**

*Performance comparison of tracker methods on UAV123 long-term tracking sequences.*

we compare the performances of some tracker methods on UAV123 dataset [31].The dataset consists of a total of 123 video sequences obtained from low-altitude UAVs. The 20 subset images in the dataset are evaluated separately for long-term object tracking, in which targets sometimes occludes, appear and disappear, providing a better benchmark for long-term tracking. We compare performances of classical methods such as TLD [32], KCF [33], CSRT [34], ECO [35] and deep learning based method Re3 [36]. In classical methods, only TLD can handle disappeared targets in long-term tracking. Even if ECO and CSRT trackers are successful for tracking nonoccluded objects, they do not have a mechanism to re-detect the object after failed. TLD can recover from full occlusion but produces frequent false positives. KCF is faster than TLD, CSRT and ECO but has lower performance. ECO and CSRT has reasonable performances except oclusion and recovering case specially important in long-term tracking. On the other hand, lightweight Re3 model can track objects at higher FPS (about 100�150 according to the GPU specifications). It allows us to track multiple objects in real-time. Average tracker performances are represented in **Table 4** for UAV123 long-term subset sequences.

Re3(S) indicates the small (lightweight) re3 model in the **Table 4** and average score shows that Re3 has the best recall score by far. In the performance comparison, the moving target detection is considered true (TP) if the intersection of union (IOU) between predicted and ground truth bounding box is greater than 0.5. Experiments show us that a moving object algorithm with tracking method support will provide significant advantages both in eliminating wrong detection and in continuous tracking.

#### **4. Training CNN for moving object detection**

Deep learning based solutions are an important alternative to eliminate the disadvantage of classical methods for moving object detection problem, because background modeling based methods suffer from high number of false detections. We have mentioned the deep learning based optical flow studies at the beginning of the chapter. This section summarizes the situation for supervised deep learning methods performed in the problem of moving object detection.

Deep learning based methods outperform the classical image processing based methods in CDNET dataset, but CDNET does not contain free motion images/videos. CDNET ground truths are pixel-wise masks of moving objects. FgSegNetV2 [37] is a encoder-decoder type deep neural network, and performs well on the CDNET dataset. MotionRec [38] is a single-stage deep learning framework proposed for moving object detection problem. It firstly estimates the background representation from past

*Surveillance with UAV Videos DOI: http://dx.doi.org/10.5772/intechopen.105959*

history frames with a temporal depth reduction block. The temporal and spatial features are used to generate multi-level feature pyramids with a backbone model. Finally, multi-level feature pyramid is used in the regressing and classification layers. MotionRec runs in the range of 2 to 5 fps depending on the selected temporal history depth from 10 to 30, over Nvidia Titan Xp GPU. JanusNet [39] is another deep network trained for moving object detection problem from UAV images. It tries to extract and combine dense optical flow and generates a coarse foreground attention map. Experiments show that it efficiently detects small moving targets. JanusNet is trained with a simulated dataset, which is generated using Unreal Engine 4. It runs at 25fps on Nvidia GTX1070 GPU and 3.1 fps on Nvidia Jetson Nano for 640 640 resolution images. JanusNet has also a performance comparison with the FgSegNetV2, and it shows that FgSegNetV2 cannot perform well for UAV videos due to requiring to be trained on a specific scene to work well on that scene. Considering the deep learning studies in the literature and the datasets used for training the model, it can be said that there is still a long way to go for a general-purpose supervised moving object detection method. On the other hand, classical methods can achieve reasonable results with the additional post-processing techniques and most importantly, they can work in real-time even at Nvidia modules at the edge.

#### **5. Conclusions**

This chapter discusses the moving object detection problem for UAV videos. We represent datasets, the performance of some methods in the literature, the challenges, and prospective solutions. For motion detection, especially background modelingbased methods are emphasized, and some post-processing methods are proposed to improve the performance as a solution to the challenges. We propose dense optical flow and simple tracking as a post-processing step with specific software architecture. Moreover, we evaluate selected trackers on a long-term object tracking dataset to analyze the performances of the trackers. Finally, we introduce some deep learning architectures and compare traditional methods in terms of general-purpose and reallife use.

#### **Author details**

İbrahim Delibaşoğlu Faculty of Computer and Information Sciences, Department of Software Engineering, Sakarya University, Sakarya, Turkey

\*Address all correspondence to: ibrahimdelibasoglu@sakarya.edu.tr

© 2022 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

### **References**

[1] Chapel M, Bouwmans T. Moving objects detection with a moving camera: A comprehensive review. Computer Science Review. 2020;**38**:100310

[2] Collins R, Lipton A, Kanade T, Fujiyoshi H, Duggins D, Tsin Y, et al. A system for video surveillance and monitoring. VSAM Final Report. 2000; **2000**:1

[3] Bouwmans T, Hofer-lin B, Porikli F, Vacavant A. Traditional approaches in background modeling for video surveillance. Handbook Background Modeling And Foreground Detection For Video Surveillance. Taylor & Francis Group; 2014

[4] Allebosch G, Deboeverie F, Veelaert P, Philips W. EFIC: Edge based foreground background segmentation and interior classification for dynamic camera viewpoints. International Conference On Advanced Concepts For Intelligent Vision Systems. 2015. pp. 130-141

[5] Zivkovic Z, Van Der Heijden F. Efficient adaptive density estimation per image pixel for the task of background subtraction. Pattern Recognition Letters. 2006;**27**:773-780

[6] Moo Yi K, Yun K, Wan Kim S, Jin Chang H, Young Choi J. Detection of moving objects with non-stationary cameras in 5.8 ms: Bringing motion detection to your mobile device. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2013. pp. 27-34

[7] Zivkovic Z. Improved adaptive Gaussian mixture model for background subtraction. Proceedings of the 17th International Conference on Pattern Recognition. 2004. pp. 28-31

[8] De Gregorio M, Giordano M. WiSARDrp for Change Detection in Video Sequences. ESANN; 2017

[9] Stauffer C, Grimson W. Adaptive background mixture models for realtime tracking. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149). 1999. pp. 246-252

[10] Kim S, Yun K, Yi K, Kim S, Choi J. Detection of moving objects with a moving camera using non-panoramic background model. Machine Vision and Applications. 2013;**24**:1015-1028

[11] Zhong Z, Zhang B, Lu G, Zhao Y, Xu Y. An adaptive background modeling method for foreground segmentation. IEEE Transactions on Intelligent Transportation Systems. 2016;**18**: 1109-1121

[12] Zhong Z, Wen J, Zhang B, Xu Y. A general moving detection method using dual-target nonparametric background model. Knowledge-Based Systems. 2019; **164**:85-95

[13] Yun K, Lim J, Choi J. Scene conditional background update for moving object detection in a moving camera. Pattern Recognition Letters. 2017;**88**:57-63

[14] Yu Y, Kurnianggoro L, Jo K. Moving object detection for a moving camera based on global motion compensation and adaptive background model. International Journal of Control, Automation and Systems. 2019;**17**: 1866-1874

[15] Delibasoglu I. Real-time motion detection with candidate masks and region growing for moving cameras. *Surveillance with UAV Videos DOI: http://dx.doi.org/10.5772/intechopen.105959*

Journal of Electronic Imaging. 2021;**30**: 063027

[16] Tomasi C, Kanade T. Detection and tracking of point. International Journal of Computer Vision. 1991;**9**:137-154

[17] Fischler M, Bolles R. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM. 1981;**24**: 381-395

[18] Heikkilä M, Pietikäinen M, Heikkilä J. A texture-based method for detecting moving objects. BMVC. 2004; **401**:1-10

[19] Huerta I, Rowe D, Viñas M, Mozerov M, Gonzàlez J. Background Subtraction Fusing Colour, Intensity and Edge Cues. Proceedings of the Conference on AMDO. 2007. pp. 279-288

[20] Zhao P, Zhao Y, Cai A. Hierarchical codebook background model using haarlike features. IEEE International Conference on Network Infrastructure and Digital Content. 2012. pp. 438-442

[21] Bilodeau G, Jodoin J, Saunier N. Change detection in feature space using local binary similarity patterns. International Conference on Computer and Robot Vision. 2013. pp. 106-112

[22] Wang T, Liang J, Wang X, Wang S. Background modeling using local binary patterns of motion vector. Visual Communications and Image Processing. 2012. pp. 1-5

[23] Ilg E, Mayer N, Saikia T, Keuper M, Dosovitskiy A, Brox T. Flownet 2.0: Evolution of optical flow estimation with deep networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. pp. 2462-2470 [24] Huang J, Zou W, Zhu J, Zhu Z. Optical flow based real-time moving object detection in unconstrained scenes 2018

[25] Butler D, Wulff J, Stanley G, Black M. A naturalistic open source movie for optical flow evaluation. European Conference on Computer Vision (ECCV). 2012. pp. 611-625

[26] Delibasoglu I. UAV images dataset for moving object detection from moving cameras. 2021

[27] Wang Y, Jodoin P, Porikli F, Konrad J, Benezeth Y, Ishwar P. CDnet 2014: An expanded change detection benchmark dataset. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2014. pp. 387-394

[28] Collins R, Zhou X, Teh S. An open source tracking testbed and evaluation web site. IEEE International Workshop on Performance Evaluation of Tracking and Surveillance. 2005. p. 35

[29] Garcia-Garcia B, Bouwmans T, Silva A. Background subtraction in real applications: Challenges, current models and future directions. Computer Science Review. 2020;**35**:100204

[30] Wang Z, Bovik A, Sheikh H, Simoncelli E. Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing. 2004;**13**:600-612

[31] Mueller M, Smith N, Ghanem B. A benchmark and simulator for uav tracking. European Conference on Computer Vision. 2016;**2016**:445-461

[32] Kalal Z, Mikolajczyk K, Matas J. Tracking-learning-detection. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2011;**34**:1409-1422 [33] Henriques J, Caseiro R, Martins P, Batista J. High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2014;**37**:583-596

[34] Luke A, Voji T, Zajc L, Matas J, Kristan M. Discriminative correlation filter tracker with channel and spatial reliability. International Journal of Computer Vision. 2018;**126**(7):671-688

[35] Danelljan M, Bhat G, Shahbaz Khan F, Felsberg M. Eco: Efficient convolution operators for tracking. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. pp. 6638-6646

[36] Farhadi D, Fox D. Re 3: Real-time recurrent regression networks for visual tracking of generic objects. IEEE Robotics and Automotive Letters. 2018; **3**:788-795

[37] Lim L, Keles H. Learning multi-scale features for foreground segmentation. Pattern Analysis and Applications. 2020; **23**:1369-1380

[38] Mandal M, Kumar L, Saran M. MotionRec: A unified deep framework for moving object recognition. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2020. pp. 2734-2743

[39] Zhao Y, Shafique K, Rasheed Z, Li M. JanusNet: Detection of moving objects from UAV platforms. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. pp. 3899-3908
