**5. Conclusions**

*Digital Forensic Science*

out within this project.

accuracy decreases.

investigated.

recognition systems.

could be used for face recognition [148–150].

*4.5.2 Beyond the state of the art*

The analysis of videos for forensic applications can be carried out by relying on some of the above techniques, provided they are tailored to the scenario at hand. It is easy to see that in the case of surveillance videos, we cannot define a shot according to the paradigm used to segment a film or a sport video [126, 138]. Rather, the definition of "shot" can be driven by the event that is looked for in the video. In particular, the video analyst should be able to query the system, so that the video is first segmented according to the particular event, and then, the shots that can contain the event of interest with high probability are further analyzed by more sophisticated technique in order to detect the object of interest [139]. The development of such a system is beyond the current state of the art, and it will be carried

The development of reidentification techniques may allow tracking a person in videos collected by multiple cameras at different locations and in different periods. Detecting people can be carried out by face detection. Many of the existing facial recognition systems are sensitive to variations in the enrolment phase [140–145]. Often these systems have been trained by a huge number of pictures of the same person to estimate reliable values of the parameters for statistical classifiers. The current state of the art does not include a suitable system for the generation of a prototype picture of a person nor a suited prototype-based classifier [146, 147]. Some automatic prototype generation developed in the area of pattern recognition

Prototype-based system could effectively handle changes in illumination, as they can perform recognition by part resemblance [151, 152]. For the above reasons, most of the facial recognition systems available today assume a standardized enrolment procedure to be performed in a controlled environment (e.g., a cabin), where a number of pictures of the face in a frontal position (2-D) with respect to the camera are taken. In addition, the picture is renewed whenever the recognition

Many different methods have been used so far for face recognition and cover a wide spectrum of methods in the pattern recognition field: geometrical representation of the face [153], templates [154, 155], hidden Markov models [156], principal component analysis [157], independently component analysis [158], elastic graph matching [159], trace transform [160], and SVM [161]. None of the methods can be seen as the most promising method because the performance depends on the scenario at hand, and the assumption behind the proposed theoretical models might not be met in real scenarios. Thus, new techniques based on the exploitation of different picture representations, such as shape, texture, signs for skin, eyes and spatial, sign-based connections, and the prototype-based system, have to be

Case and similarity-based recognition and sensing methods for speech, sound, and audio recognition using both temporal and frequency domain information will be developed. Development of "query by example," keyword, and phrase-based retrieval schemes using exemplar-based schemes, which will be capable of part and whole similarity matching, will be a significant contribution to the existing speech

Current methods for speech and audio analysis emphasize spectral methods. For example, well-known Shazam music recognition method uses only spectral peaks [162]. Commonly used mel-cepstral coefficients, line spectral frequencies, and RASTA features [134, 135] do not have any temporal information, either. We believe that temporal information is not fully utilized in current methods. Temporal information will provide critical information for speaker recognition and keyword spotting

**86**

Forensic investigations on multimedia evidence usually develop along four different steps: analysis, selection, evaluation, and comparison. During the analysis step, technicians typically look at huge amounts of different multimedia data (e.g., hours of video or audio recordings, pages and pages of text, and hundreds and hundreds of pictures) to reconstruct the dynamic of the event and collect any piece of relevant information. This step obviously requires a lot of time, and many factors can make it difficult, among which data heterogeneity, quality, and quantity are the most relevant. Afterward, during the selection step, technicians select and acquire the most meaningful pieces of information from the different multimedia data (e.g., frames from videos, audio fragments, and documents). Then, in the evaluation step, they look for relevant elements in the selected data, which will be further investigated in the comparison step. They can select heads, vehicles, license plates, guns, sentences, sounds, and all other elements that can link a person to the event. The main problems are the low quality of media data due to high compression, adverse environmental conditions (e.g., noise, bad lighting condition), camera/object position, and facial expressions. Finally, during the comparison step, technicians place the extracted elements side by side with a known element of comparison. From the comparison of general and particular characteristics, the operators give a level of similarity. In forensic application, the use of automatic pattern recognition system gives poor performance because of the high variability of data recording. On the other hand, human perception is a great pattern recognition system but is characterized by high subjectivity and unknown reproducibility and performance.

In this chapter, we propose to develop a toolkit of methods and instruments that will be able to support analysts along all these steps, strongly reducing human intervention. First of all, it will include instruments to process different kinds of

### *Digital Forensic Science*

media data and, possibly, correlate them. This will obviously reduce the time spent to find the correct instruments for processing the medium at hand. Furthermore, it comprises preprocessing tools that alleviate, by filtering and enhancement, the problem of low-data quality. In particular, for image and video data, a great help will come from super-resolution methods that will maximize the information contained in low-resolution images or videos (e.g., foster the process of face reconstruction and recognition from blurred images). This feature will greatly support all the subsequent steps.

In this chapter, we focused on the background and motivation for our work. The overall system architecture is explained. We present the data to be used. After a review of the state of the art of related work of the multimedia data we consider in this work, we describe the method and techniques we are developing that go beyond the state of the art. The work will be continued in the Chapter Part II of Forensic Multimedia Data Analysis.
