**3.2.1 Interactive approaches**

12 will be set by intech

The goal of this section is to briefly introduce the concepts of video coding and compression, which tends to be used interchangeably since they are heavily related. What follows in this section is a short introduction to general video compression for which Lorente (2011) is an

Video compression is the process of converting a video signal into a format that takes up less storage space or transmission bandwidth. It can be considered as a coding scheme that reduces bits of information representing the video. Nevertheless the overall visual quality has to be preserved, leading to a compromise between the level of artifacts and the bandwidth. Two types of compression can be distinguished: lossy and lossless compression Richardson (2003). In a lossless compression system statistical redundancy is removed so that the original data can be perfectly reconstructed. Unfortunately, at the present time lossless methods only allows a modest amount of compression, insufficient for most video applications. On the other hand, lossy compression provides bigger compression ratios, at the expense of not being able to reconstruct perfectly the original signal. Lossy compression is the type of compression most

It is interesting to note that, even if generic compression algorithms do not explicitly use saliency, they implicitly exploit the mechanisms of human visual perception to remove redundant information Geisler & Perry (1998). For example retinal persistence of vision makes the human eye keep an instantaneous view of a scene for about one-tenth of a second at least. This allows video (theoretically a continuum of time) to be represented by a series of discrete

Coding standards define a representation of visual data as well as a method of decoding it to reconstruct visual information. Recent hybrid standards like H.264/AVC Wiegand et al. (2003) have led to significant progress in compression quality, allowing for instance the transmission of high definition (HD) television signals over a limited-capacity broadcast channel, and video

Emerging video coding standard H.265 <sup>3</sup> aims at enhancing video coding efficiency using intensive spatiotemporal prediction and entropy coding. Nevertheless, this new standard only considers *objective* redundancy, as opposed to attention-based methods described below.

The above mentioned compression methods tend to distribute the coding resources evenly in an image. On the contrary, attention-based methods encode visually salient regions with high priority, while treating less interesting regions with low priority (Figure 11). The aim of these methods is to achieve compression without significant degradation of perceived quality.

Saliency-based methods derive from biological properties of the human eye, that enable one to focus only on a limited region of an image at a time. It is thus a subjective notion, but a lot

In the following there is an attempt to list the methods currently available in the literature, pointing to their strengths and weaknesses when possible. Although there is currently no

frames (e.g., 24 frames per second) with no apparent motion distortion.

**3.1 General video coding and compression**

commonly used for video, attention-based or not.

streaming over the internet Richardson (2003).

**3.2 A review of the attention-based methods**

<sup>3</sup> http://www.h265.net

of research has been devoted to its modeling and quantification.

example of recent exhaustive review.

As described above, earlier approaches for modeling the human visual system (HVS) relied on eye-tracking devices to monitor attention points (e.g., Kortum & Geisler (1996)).

With such devices, encoding continuously and efficiently follows the focus of the observer. Indeed, observers usually do not notice any degradation of the received frames. However, these techniques are neither practical (because of the use of the eye-tracking device) nor general (because they are restricted to a single viewer). A general coding scheme should be independent of the number of observers, the viewing distance, and any hardware device or user interaction.

Even in the absence of eye tracking, an interactive approach has demonstrated usefulness. Observers can for example explicitly point to priority regions with the mouse Geisler & Perry (1998). However, extending this approach to general-purpose non-interactive video compression presents severe limitations.

Attempts to automatize this approach by using attention-based methods are very complex as top-down information is very important and if clear salient objects are not present in a frame, people gaze can be very different. Despite progresses in attention modelling and even though human gaze is well modelled in the presence of salient objects, it is not possible to obtain a reliable model of human gaze in the absence of specific salient objects (as can be seen in Figure 12). Indeed, the highly dynamical process of eye movements is influenced a lot by previous gaze position if no salient objects pops out from the background.

Fig. 12. The two left images show several users eye tracking results which are spread through the image and very different, while the two images on the right showing clear regions of interest will exhibit much more correlated fixations.

### **3.2.2 Indirect approaches**

Indirect compression consists in modifying the source image to be coded, while keeping the same coding scheme. Such methods are thus generally driven by a saliency map based methods.

Future work in this direction should include a study of possible artifacts in the low-bit rate regions of the compressed video, which may themselves become salient and attract human attention. Another possible issue pointed out in Li et al. (2011) is that the attention model does not always predict accurately where people look at. For example high speed motion increases saliency, but regions with lower motion can attract more attention (e.g., a person

Human Attention Modelization and Data Reduction 117

Other approaches with lower computational complexity have been investigated, and in particular two methods using the spectrum of the images: the Spectral Residual Hou & Zhang (2007) and the Phase spectrum of Quaternion Fourier Transform Guo & Zhang (2010). The goal here is to suppress spectral elements corresponding to frequently occurring features. The Phase spectrum of Quaternion Fourier Transform (PQFT) is an extension of the phase spectrum of Fourier transform (PFT) to quaternions incorporating inter-frame motion. The latter method derives from the property of the Fourier transform, that the phase information specifies the location each of the sinusoidal components resides within the image. Thus the locations with less periodicity or less homogeneity in an image create the so-called *proto objects* in the reconstruction of the image's phase spectrum, which indicates where the object candidates are located. A multi-resolution wavelet foveation filter suppressing coefficients corresponding to background is then applied. Compression rates between 32.6% (from 8.88Mb for raw H-264 file to 5.98Mb for compressed file) and 38% (from 11.4Mb for raw

MPEG-4 file to 7.07Mb for compressed file) are reported in Guo & Zhang (2010).

These Fourier-based approaches are computationally efficient, but they are less connected to the Human Visual System. They also have two main drawbacks linked to the properties of the Fourier transform. First, if an object occupied most of the image, only its boundaries will be detected, unless resampling is used (at the expense of a blurring of the boundaries). Second, an image with a smooth object in front of a textured background will lead to the background

Using the bit allocation model of Li et al. (2011), a scheme for attention video compression has recently been suggested by Gupta & Chaudhury (2011). It proposes a learning-based feature integration algorithm, with a Relevance Vector Machine architecture, incorporating visual saliency propagation (using motion vectors), to save computational time. This architecture is based on thresholding of mutual information between successive frames for flagging frames

Many encoding techniques have sought to optimize perceptual rather than objective quality: these techniques allocate more bits to the image areas where human can easily see coding distortions, and allocate fewer bits to the areas where coding distortions are less noticeable. Experimental subjective quality assessment results show that visual artifacts can be reduced through this approach. However two problems arise: first, the mechanisms of human perceptual sensitivity are still not fully understood, especially as captured by computational models; second, perceptual sensitivity may not necessarily explain people's attention.

The use of top-down information is very efficient as it is very likely to be attended. Face detection is one of the crucial features, but also text detection, skin color, motion-related events for video-surveillance, . . . (see for example Tan & Davis (2004) and references therein).

running on the sidewalk, while cars are going faster).

being detected (saliency reversal).

requiring recomputation of saliency.

**3.2.4 Enhancing the objective quality**

The seminal model of Itti et al. (1998) was later applied to video compression in Itti (2004) by computing a saliency map for each frame of a video sequence and applying a smoothing filter to all non-salient regions. Smoothing leads to higher spatial correlation, a better prediction efficiency of the encoder, and therefore a reduced bit-rate of the encoded video.

The main advantages of this method are twofold. First, a high correlation with human eye movements on unconstrained video inputs is observed. Second a good compression rate is achieved, the average size of a compressed video being approximately half the size of the original one, for both MPEG-1 and MPEG-4 (DivX) encodings.

Another method combines both top-down and bottom-up information, using a wavelet decomposition for multiscale analysis Tsapatsoulis et al. (2007). Bit rate gain ranging from 15% to 64% for MPEG-1 videos and from 10.4% to 28.3% for MPEG-4 are reported.

Mancas et al. (2007) proposed an indirect approach based on their attention model. An anisotropic pre-filtering of the images or frames is achieved keeping highly salient regions with a good resolution, while low-pass filtering the regions with less important details (Figure 13). Depending on the method parameters, images could be compressed twice as much for standard JPEG. Nevertheless even though the quality of the important areas remain unchanged, the quality of the less important regions can dramatically decrease. It is thus not easy to compare the compression rate as the quality of the images remains subjective.

Fig. 13. Two pairs of images (original and anisotropic filtered). Adapted from Mancas et al. (2007).

The main advantage of indirect approaches is that they are easy to set up because the coding scheme remains the same. The intelligence of the algorithm is applied as a pre-processing step while standard coding algorithms are used afterwards. This fact also led people to easily quantify the gain in terms of compression as the main compression algorithm can be used directly on the image or on the saliency pre-processed image. However, one possible problem is that the degradation of less salient zones can become strong. Selective blurring can lead to artifacts and distortions in low-saliency regions Li et al. (2011).

#### **3.2.3 Direct approaches**

Recent work on modeling visual attention (Le Meur, Itti, Parkhurst, Chauvin ...) paved the way to efficient compression applications that modify the heart of the coding scheme to enhance the perceived quality. In this case some modifications to the saliency map are generally necessary to dedicate it directly to coding. The saliency maps will not only be used in the pre-processing step, but also in the entire compression algorithm.

Li et al. (2011), a recent extension of Itti (2004), uses a similar neurobiological model of visual attention to generate a saliency map, whose most salient locations are used to generate a so-called *guidance map*. The latter is used to guide the bit allocation through quantization parameter (QP) tuning by constrained global optimization. Considering its efficiency at achieving compression while preserving visual quality and the general nature of the algorithm, the authors suggest that it might be integrated in general-purpose video codecs. 14 will be set by intech

The seminal model of Itti et al. (1998) was later applied to video compression in Itti (2004) by computing a saliency map for each frame of a video sequence and applying a smoothing filter to all non-salient regions. Smoothing leads to higher spatial correlation, a better prediction

The main advantages of this method are twofold. First, a high correlation with human eye movements on unconstrained video inputs is observed. Second a good compression rate is achieved, the average size of a compressed video being approximately half the size of the

Another method combines both top-down and bottom-up information, using a wavelet decomposition for multiscale analysis Tsapatsoulis et al. (2007). Bit rate gain ranging from

Mancas et al. (2007) proposed an indirect approach based on their attention model. An anisotropic pre-filtering of the images or frames is achieved keeping highly salient regions with a good resolution, while low-pass filtering the regions with less important details (Figure 13). Depending on the method parameters, images could be compressed twice as much for standard JPEG. Nevertheless even though the quality of the important areas remain unchanged, the quality of the less important regions can dramatically decrease. It is thus not easy to compare the compression rate as the quality of the images remains subjective.

Fig. 13. Two pairs of images (original and anisotropic filtered). Adapted from Mancas et al.

The main advantage of indirect approaches is that they are easy to set up because the coding scheme remains the same. The intelligence of the algorithm is applied as a pre-processing step while standard coding algorithms are used afterwards. This fact also led people to easily quantify the gain in terms of compression as the main compression algorithm can be used directly on the image or on the saliency pre-processed image. However, one possible problem is that the degradation of less salient zones can become strong. Selective blurring can lead to

Recent work on modeling visual attention (Le Meur, Itti, Parkhurst, Chauvin ...) paved the way to efficient compression applications that modify the heart of the coding scheme to enhance the perceived quality. In this case some modifications to the saliency map are generally necessary to dedicate it directly to coding. The saliency maps will not only be used

Li et al. (2011), a recent extension of Itti (2004), uses a similar neurobiological model of visual attention to generate a saliency map, whose most salient locations are used to generate a so-called *guidance map*. The latter is used to guide the bit allocation through quantization parameter (QP) tuning by constrained global optimization. Considering its efficiency at achieving compression while preserving visual quality and the general nature of the algorithm, the authors suggest that it might be integrated in general-purpose video codecs.

efficiency of the encoder, and therefore a reduced bit-rate of the encoded video.

15% to 64% for MPEG-1 videos and from 10.4% to 28.3% for MPEG-4 are reported.

original one, for both MPEG-1 and MPEG-4 (DivX) encodings.

artifacts and distortions in low-saliency regions Li et al. (2011).

in the pre-processing step, but also in the entire compression algorithm.

(2007).

**3.2.3 Direct approaches**

Future work in this direction should include a study of possible artifacts in the low-bit rate regions of the compressed video, which may themselves become salient and attract human attention. Another possible issue pointed out in Li et al. (2011) is that the attention model does not always predict accurately where people look at. For example high speed motion increases saliency, but regions with lower motion can attract more attention (e.g., a person running on the sidewalk, while cars are going faster).

Other approaches with lower computational complexity have been investigated, and in particular two methods using the spectrum of the images: the Spectral Residual Hou & Zhang (2007) and the Phase spectrum of Quaternion Fourier Transform Guo & Zhang (2010). The goal here is to suppress spectral elements corresponding to frequently occurring features.

The Phase spectrum of Quaternion Fourier Transform (PQFT) is an extension of the phase spectrum of Fourier transform (PFT) to quaternions incorporating inter-frame motion. The latter method derives from the property of the Fourier transform, that the phase information specifies the location each of the sinusoidal components resides within the image. Thus the locations with less periodicity or less homogeneity in an image create the so-called *proto objects* in the reconstruction of the image's phase spectrum, which indicates where the object candidates are located. A multi-resolution wavelet foveation filter suppressing coefficients corresponding to background is then applied. Compression rates between 32.6% (from 8.88Mb for raw H-264 file to 5.98Mb for compressed file) and 38% (from 11.4Mb for raw MPEG-4 file to 7.07Mb for compressed file) are reported in Guo & Zhang (2010).

These Fourier-based approaches are computationally efficient, but they are less connected to the Human Visual System. They also have two main drawbacks linked to the properties of the Fourier transform. First, if an object occupied most of the image, only its boundaries will be detected, unless resampling is used (at the expense of a blurring of the boundaries). Second, an image with a smooth object in front of a textured background will lead to the background being detected (saliency reversal).

Using the bit allocation model of Li et al. (2011), a scheme for attention video compression has recently been suggested by Gupta & Chaudhury (2011). It proposes a learning-based feature integration algorithm, with a Relevance Vector Machine architecture, incorporating visual saliency propagation (using motion vectors), to save computational time. This architecture is based on thresholding of mutual information between successive frames for flagging frames requiring recomputation of saliency.

#### **3.2.4 Enhancing the objective quality**

Many encoding techniques have sought to optimize perceptual rather than objective quality: these techniques allocate more bits to the image areas where human can easily see coding distortions, and allocate fewer bits to the areas where coding distortions are less noticeable. Experimental subjective quality assessment results show that visual artifacts can be reduced through this approach. However two problems arise: first, the mechanisms of human perceptual sensitivity are still not fully understood, especially as captured by computational models; second, perceptual sensitivity may not necessarily explain people's attention.

The use of top-down information is very efficient as it is very likely to be attended. Face detection is one of the crucial features, but also text detection, skin color, motion-related events for video-surveillance, . . . (see for example Tan & Davis (2004) and references therein).

Fig. 14. Example of images along with rectangles providing different attention-based automatic zooms. After a saliency map (Mancas (2009)) is computed and low-pass filtered, several threshold values are used to extract the bounding boxes of the more interesting areas.

requirements: to preserve temporal and spatial distance, and to contain useful information such as objects shapes and motions. These methods are composed of the following three steps. First, extraction of the image foreground, for example by minimizing an energy function to automatically separate foreground from background pixels. Second, optimization of the active window to fit in the target size. Third, clustering to reduce the number of parameters to be

Human Attention Modelization and Data Reduction 119

Figure 15 perfectly illustrates the process of the saliency-based retargeting. From the original image, on the left, a saliency map is computed (in the middle) from which an area with higher intensity is extracted using some algorithm and its bounding-box will represent the zoom.

Fig. 15. Example of retargeting: left, the original picture; middle, the saliency map; right, the

A technique to determine automatically the "right" viewing area for spatio-temporal images is proposed in Deselaers et al. (2008). Images are first analyzed to determine relevant regions by using three strategies: the visual saliency of spatial images, optical flow for movements and the appearance of the image. A log-linear algorithm then computes the relevance for every position of the image to determine a sequence of cropping positions with a correct aspect ratio

Suh et al. (2003) uses the Itti & Koch (2001) algorithm to compute the saliency map, that serves as a basis to automatically delineate a rectangular cropping window. A fast greedy algorithm was developed to optimize the window, that has to take into account most of the saliency

The previous methods show that the perceptual zoom not only compresses the images, but it

The Self-Adaptive Image Cropping for Small Displays Ciocca et al. (2007) is based on an Itti and Koch bottom-up attention algorithm but also on top-down considerations as face

reframed picture. (adapted from Le Meur & Le Callet (2009))

Depending on this threshold, the zoom is more or less precise/important.

estimated in the optimization process.

**4.1.2 Saliency maps based methods**

for the display device.

while remaining sufficiently small.

also allows better recognition during visual search!
