*4.5.1 State of the art*

Video retrieval has a long history [123–125]. According to the type of video at hand (e.g., film, news, CCTV recording, etc.), different retrieval tasks can be defined both in terms of the type of query and in terms of the processing techniques that are suited for extracting meaningful concepts. For example, it is easy to see that the making of a film comprises the use of techniques whose goal is to provoke sentiments in the watcher. Thus, in order to retrieve concepts from videos, automatic techniques must take into account not only the characteristic of the scene but also the movements of the camera and video editing techniques. On the other hand, still cameras used for video-surveillance purposes allow for the detection of persons and objects moving within the monitored area, as the characteristic of the scene is well known in advance. On these topics, a vast corpus of research has been carried out in the past years, and a number of automatic analysis techniques are embedded into commercial products [126].

One of the first steps in video analysis is the detection of shots, that is, video sequences that contain a continuous camera action in time and space [127, 128].

**85**

processing systems.

*Novel Methods for Forensic Multimedia Data Analysis: Part I*

In the case of films, broadcasted news, and sport videos, shot detection is performed by looking at well-known separators, such as fading and black frames. Each shot is then characterized by one or more key frames, that is, those frames that can be used to characterize the shot. Shot classification can be performed by extracting suitable features and using machine-learning techniques for concept classification. Features can be either extracted from key frames, as well as by looking at global characteristics of the video sequence. They can represent low-level information of such as color and textures as well as characteristics of the

A number of techniques for carrying out these steps have been developed for TV broadcasters, in particular for sport as well as news programs [123, 124]. In these areas, the knowledge of the rules of the game and the rules of video shooting allowed for building a reliable ground truth that allows to make objective comparisons of different algorithms. The classification of video shots can be used for retrieval purposes, as soon as the goal is to retrieve all videos related to a particular class. On the other hand, the use of these techniques for forensic applications still needs more investigation due to the low resolution of the cameras, the variability of the recorded scenes, and the presence of person and objects typically in nonfrontal

Today, it is of particular interest the reidentification of people in videos [129, 130]. This problem can be formulated as follows. In many real scenarios, an area is monitored by a number of cameras. When persons move in the monitored environment, they can be identified by their face only if they appear in the video in some pose. After they have been identified in one of the videos, they can be tracked (i.e., reidentified) according to their global appearance (e.g., their clothes) rather than by their face.

Speech and sound files constitute an important part of the data collected by Law Enforcement Agencies. For the last 35 years, practical speech recognition systems have been based on Hidden Markov Models (HMMs), which model the training data using the Baum-Welch algorithm in a global manner. Markov state probability distributions are also represented using Gaussian Mixture Models (GMMs). HMMs try to represent the time-varying speech and sound files [131, 132]. This approach is successful to some extent in controlled environments and dictation systems in

HMMs and GMMs use features extracted from temporal speech windows. Current speech and sound feature extraction schemes are based on Fourier analysis [131, 134, 135]. Temporal information is only incorporated to the automatic speech recognition systems by only dividing speech into temporal analysis windows. Unfortunately, this global approach loses keyword or speaker-specific features, which are needed in forensic applications. For example, a person cannot modify his or her own average temporal zero crossing rate, even if he or she tries to change his or her own voice by mumbling, or talking with a mouth full of food or cotton balls, etc. [136]. This kind of temporal and person specific information is not used in

today's systems, which are globally trained using all the available data.

Global approaches provide good speech and speaker recognition and identification results as long as it is possible to have a good description of the unobserved data. However, continuous spontaneous speech recognition is still an unsolved problem [133, 137]. Unfortunately, most of the speech data in legal cases are spontaneous speech data. In many applications, it is required to retrieve keywords, phrases, names, and speakers from spontaneous speech in real time. Therefore, it is necessary to develop not only new feature extraction and speech and sound representation schemes but also exemplar type case and similarity-based reasoning methods to improve the current speech and sound

*DOI: http://dx.doi.org/10.5772/intechopen.92167*

shot such as temporal features.

positions and with many occlusions.

which people clearly speak to the machines [133].

### *Novel Methods for Forensic Multimedia Data Analysis: Part I DOI: http://dx.doi.org/10.5772/intechopen.92167*

*Digital Forensic Science*

tweets themselves.

**4.5 Video analysis**

*4.5.1 State of the art*

commercial products [126].

regions of high interest [AP Exclusive].

easily expanded to other text mining applications.

the crowd on actions taken by the law officers. Such approaches have already been deployed for finance and marketing applications to understand the mood of financial markets and consumer opinions [92, 121, 122]. Similar concepts can be adapted for forensic applications. In fact, FBI and Pentagon have already started to utilize these methods to predict criminal and terrorist activities and monitor persons and

The innovativeness of tool in this area lays in the fact that the combination of the discussed methods has never been proposed for visualizing and clustering data, nor integrated in a software system. It will be the first integrated human centered data discovery environment that combines both statistical methods from machine learning with order-theoretic methods such as concept lattices. The self-organizing map that can handle high-dimensional data spaces and, as a consequence, is an ideal tool for an initial preprocessing is at the start of the human centered discovery process. FCA can then be used to explore dependencies and information links in a smaller subset. TCA and TRSS are used for in-depth profiling of identified individuals and communities. In particular, we focus on the niche of twitter user and feed mining in the broader text-mining field. State-of-the-art domain adaptation methods will be tested to improve the accuracies of the linguistic annotation tools on Twitter data, and customized term-extraction methods will be devised in order to reliably extract relevant keywords from tweets. Needless to say that the proposed system can be

A web crawler will be designed to collect the feeds from the twitter website. This is a technically challenging yet known task to the scientific community (see e.g., [107]). The data collection can be done by an employee hired by the police who received a type P screening. The type of data is fragments of texts. Concerning languages, we will first focus on Dutch tweets. This may later be extended to Hungarian and Bulgarian since most organized crime in areas such as human trafficking is committed by these nationalities in Amsterdam. Since a tweet consists of among others a user name, his twitter ID and the posted text, as well as potentially ID and name of other users, we will first replace these user-identifiable information items by numeric values using regular expressions. In the second step, we will use available Named Entity Recognition methods for removing person names from the

Video retrieval has a long history [123–125]. According to the type of video at hand (e.g., film, news, CCTV recording, etc.), different retrieval tasks can be defined both in terms of the type of query and in terms of the processing techniques that are suited for extracting meaningful concepts. For example, it is easy to see that the making of a film comprises the use of techniques whose goal is to provoke sentiments in the watcher. Thus, in order to retrieve concepts from videos, automatic techniques must take into account not only the characteristic of the scene but also the movements of the camera and video editing techniques. On the other hand, still cameras used for video-surveillance purposes allow for the detection of persons and objects moving within the monitored area, as the characteristic of the scene is well known in advance. On these topics, a vast corpus of research has been carried out in the past years, and a number of automatic analysis techniques are embedded into

One of the first steps in video analysis is the detection of shots, that is, video sequences that contain a continuous camera action in time and space [127, 128].

**84**

In the case of films, broadcasted news, and sport videos, shot detection is performed by looking at well-known separators, such as fading and black frames. Each shot is then characterized by one or more key frames, that is, those frames that can be used to characterize the shot. Shot classification can be performed by extracting suitable features and using machine-learning techniques for concept classification. Features can be either extracted from key frames, as well as by looking at global characteristics of the video sequence. They can represent low-level information of such as color and textures as well as characteristics of the shot such as temporal features.

A number of techniques for carrying out these steps have been developed for TV broadcasters, in particular for sport as well as news programs [123, 124]. In these areas, the knowledge of the rules of the game and the rules of video shooting allowed for building a reliable ground truth that allows to make objective comparisons of different algorithms. The classification of video shots can be used for retrieval purposes, as soon as the goal is to retrieve all videos related to a particular class. On the other hand, the use of these techniques for forensic applications still needs more investigation due to the low resolution of the cameras, the variability of the recorded scenes, and the presence of person and objects typically in nonfrontal positions and with many occlusions.

Today, it is of particular interest the reidentification of people in videos [129, 130]. This problem can be formulated as follows. In many real scenarios, an area is monitored by a number of cameras. When persons move in the monitored environment, they can be identified by their face only if they appear in the video in some pose. After they have been identified in one of the videos, they can be tracked (i.e., reidentified) according to their global appearance (e.g., their clothes) rather than by their face.

Speech and sound files constitute an important part of the data collected by Law Enforcement Agencies. For the last 35 years, practical speech recognition systems have been based on Hidden Markov Models (HMMs), which model the training data using the Baum-Welch algorithm in a global manner. Markov state probability distributions are also represented using Gaussian Mixture Models (GMMs). HMMs try to represent the time-varying speech and sound files [131, 132]. This approach is successful to some extent in controlled environments and dictation systems in which people clearly speak to the machines [133].

HMMs and GMMs use features extracted from temporal speech windows. Current speech and sound feature extraction schemes are based on Fourier analysis [131, 134, 135]. Temporal information is only incorporated to the automatic speech recognition systems by only dividing speech into temporal analysis windows. Unfortunately, this global approach loses keyword or speaker-specific features, which are needed in forensic applications. For example, a person cannot modify his or her own average temporal zero crossing rate, even if he or she tries to change his or her own voice by mumbling, or talking with a mouth full of food or cotton balls, etc. [136]. This kind of temporal and person specific information is not used in today's systems, which are globally trained using all the available data.

Global approaches provide good speech and speaker recognition and identification results as long as it is possible to have a good description of the unobserved data. However, continuous spontaneous speech recognition is still an unsolved problem [133, 137]. Unfortunately, most of the speech data in legal cases are spontaneous speech data. In many applications, it is required to retrieve keywords, phrases, names, and speakers from spontaneous speech in real time. Therefore, it is necessary to develop not only new feature extraction and speech and sound representation schemes but also exemplar type case and similarity-based reasoning methods to improve the current speech and sound processing systems.

### *4.5.2 Beyond the state of the art*

The analysis of videos for forensic applications can be carried out by relying on some of the above techniques, provided they are tailored to the scenario at hand. It is easy to see that in the case of surveillance videos, we cannot define a shot according to the paradigm used to segment a film or a sport video [126, 138]. Rather, the definition of "shot" can be driven by the event that is looked for in the video. In particular, the video analyst should be able to query the system, so that the video is first segmented according to the particular event, and then, the shots that can contain the event of interest with high probability are further analyzed by more sophisticated technique in order to detect the object of interest [139]. The development of such a system is beyond the current state of the art, and it will be carried out within this project.

The development of reidentification techniques may allow tracking a person in videos collected by multiple cameras at different locations and in different periods. Detecting people can be carried out by face detection. Many of the existing facial recognition systems are sensitive to variations in the enrolment phase [140–145]. Often these systems have been trained by a huge number of pictures of the same person to estimate reliable values of the parameters for statistical classifiers. The current state of the art does not include a suitable system for the generation of a prototype picture of a person nor a suited prototype-based classifier [146, 147]. Some automatic prototype generation developed in the area of pattern recognition could be used for face recognition [148–150].

Prototype-based system could effectively handle changes in illumination, as they can perform recognition by part resemblance [151, 152]. For the above reasons, most of the facial recognition systems available today assume a standardized enrolment procedure to be performed in a controlled environment (e.g., a cabin), where a number of pictures of the face in a frontal position (2-D) with respect to the camera are taken. In addition, the picture is renewed whenever the recognition accuracy decreases.

Many different methods have been used so far for face recognition and cover a wide spectrum of methods in the pattern recognition field: geometrical representation of the face [153], templates [154, 155], hidden Markov models [156], principal component analysis [157], independently component analysis [158], elastic graph matching [159], trace transform [160], and SVM [161]. None of the methods can be seen as the most promising method because the performance depends on the scenario at hand, and the assumption behind the proposed theoretical models might not be met in real scenarios. Thus, new techniques based on the exploitation of different picture representations, such as shape, texture, signs for skin, eyes and spatial, sign-based connections, and the prototype-based system, have to be investigated.

Case and similarity-based recognition and sensing methods for speech, sound, and audio recognition using both temporal and frequency domain information will be developed. Development of "query by example," keyword, and phrase-based retrieval schemes using exemplar-based schemes, which will be capable of part and whole similarity matching, will be a significant contribution to the existing speech recognition systems.

Current methods for speech and audio analysis emphasize spectral methods. For example, well-known Shazam music recognition method uses only spectral peaks [162]. Commonly used mel-cepstral coefficients, line spectral frequencies, and RASTA features [134, 135] do not have any temporal information, either. We believe that temporal information is not fully utilized in current methods. Temporal information will provide critical information for speaker recognition and keyword spotting

**87**

performance.

*Novel Methods for Forensic Multimedia Data Analysis: Part I*

applications. We are developing temporal speech representation methods based on delta modulation [163, 164], zero-crossing, and wavelet scattering [165, 166] information will be incorporated into content based audio and sound retrieval and speech and

As pointed above, another important avenue, which is not explored by current methods, is compressive recognition, similarity-based reasoning, and case-based reasoning. Current data modeling methods assume a global representation. On the other hand, case and similarity-based reasoning methods will be able to incorporate fine details of the test case and will likely to provide better recognition results, especially in spontaneous speech. Temporal representation methods such as delta modulation and zero-crossing information are ideal for exemplar and similarity-based reasoning approaches. It is also possible to combine the differential representation of temporal data with the spectral data using compressive sensing [167], which extends this differential data processing concept by using random weights adding to zero to linearly combine the data and/or features. In this way, similarity learning, case generalization and case storage, and compressive learning and sensing will allow the handling of very large amount (terabytes) of data. Once the keyword and phrases are detected, analysts can manually process the proposed retrieval results. Cut-and-paste locations in speech can be also detected using delta modulation and wavelet scattering, providing a differential representation of speech, sound, and audio data. Fragile watermarking schemes based on wavelet scattering and delta modulation will be developed to prevent tampering. Resulting representation

can be easily stored, and it will be ideal for different forensic purposes.

Forensic investigations on multimedia evidence usually develop along four different steps: analysis, selection, evaluation, and comparison. During the analysis step, technicians typically look at huge amounts of different multimedia data (e.g., hours of video or audio recordings, pages and pages of text, and hundreds and hundreds of pictures) to reconstruct the dynamic of the event and collect any piece of relevant information. This step obviously requires a lot of time, and many factors can make it difficult, among which data heterogeneity, quality, and quantity are the most relevant. Afterward, during the selection step, technicians select and acquire the most meaningful pieces of information from the different multimedia data (e.g., frames from videos, audio fragments, and documents). Then, in the evaluation step, they look for relevant elements in the selected data, which will be further investigated in the comparison step. They can select heads, vehicles, license plates, guns, sentences, sounds, and all other elements that can link a person to the event. The main problems are the low quality of media data due to high compression, adverse environmental conditions (e.g., noise, bad lighting condition), camera/object position, and facial expressions. Finally, during the comparison step, technicians place the extracted elements side by side with a known element of comparison. From the comparison of general and particular characteristics, the operators give a level of similarity. In forensic application, the use of automatic pattern recognition system gives poor performance because of the high variability of data recording. On the other hand, human perception is a great pattern recognition system but is characterized by high subjectivity and unknown reproducibility and

In this chapter, we propose to develop a toolkit of methods and instruments that will be able to support analysts along all these steps, strongly reducing human intervention. First of all, it will include instruments to process different kinds of

*DOI: http://dx.doi.org/10.5772/intechopen.92167*

audio recognition applications.

**5. Conclusions**

applications. We are developing temporal speech representation methods based on delta modulation [163, 164], zero-crossing, and wavelet scattering [165, 166] information will be incorporated into content based audio and sound retrieval and speech and audio recognition applications.

As pointed above, another important avenue, which is not explored by current methods, is compressive recognition, similarity-based reasoning, and case-based reasoning. Current data modeling methods assume a global representation. On the other hand, case and similarity-based reasoning methods will be able to incorporate fine details of the test case and will likely to provide better recognition results, especially in spontaneous speech. Temporal representation methods such as delta modulation and zero-crossing information are ideal for exemplar and similarity-based reasoning approaches. It is also possible to combine the differential representation of temporal data with the spectral data using compressive sensing [167], which extends this differential data processing concept by using random weights adding to zero to linearly combine the data and/or features. In this way, similarity learning, case generalization and case storage, and compressive learning and sensing will allow the handling of very large amount (terabytes) of data. Once the keyword and phrases are detected, analysts can manually process the proposed retrieval results.

Cut-and-paste locations in speech can be also detected using delta modulation and wavelet scattering, providing a differential representation of speech, sound, and audio data. Fragile watermarking schemes based on wavelet scattering and delta modulation will be developed to prevent tampering. Resulting representation can be easily stored, and it will be ideal for different forensic purposes.
