**3. Attention-based visual coding**

Since the late 1990's techniques based on attention have been introduced in the field of image and video coding (e.g., Kortum & Geisler (1996); Maeder et al. (1996)). Attention can be used to compress videos or to transmit the most salient parts first during the data transfer from a server to a client. This section will first introduce general principles of video compression, then review some of the major achievements in saliency-based visual coding.

Fig. 11. Illustration of the distortions introduced by general compression methods (three first images on the left) compared to saliency-based compression (three last images on the right),

Human Attention Modelization and Data Reduction 115

unified taxonomy, we have divided the methods into interactive, indirect and direct, the latter

As described above, earlier approaches for modeling the human visual system (HVS) relied

With such devices, encoding continuously and efficiently follows the focus of the observer. Indeed, observers usually do not notice any degradation of the received frames. However, these techniques are neither practical (because of the use of the eye-tracking device) nor general (because they are restricted to a single viewer). A general coding scheme should be independent of the number of observers, the viewing distance, and any hardware device

Even in the absence of eye tracking, an interactive approach has demonstrated usefulness. Observers can for example explicitly point to priority regions with the mouse Geisler & Perry (1998). However, extending this approach to general-purpose non-interactive video

Attempts to automatize this approach by using attention-based methods are very complex as top-down information is very important and if clear salient objects are not present in a frame, people gaze can be very different. Despite progresses in attention modelling and even though human gaze is well modelled in the presence of salient objects, it is not possible to obtain a reliable model of human gaze in the absence of specific salient objects (as can be seen in Figure 12). Indeed, the highly dynamical process of eye movements is influenced a lot by previous

Fig. 12. The two left images show several users eye tracking results which are spread through the image and very different, while the two images on the right showing clear regions of

Indirect compression consists in modifying the source image to be coded, while keeping the same coding scheme. Such methods are thus generally driven by a saliency map based

on eye-tracking devices to monitor attention points (e.g., Kortum & Geisler (1996)).

at three different compression levels. Adapted from Yu & Lisin (2009))

being the most commonly studied.

compression presents severe limitations.

gaze position if no salient objects pops out from the background.

interest will exhibit much more correlated fixations.

**3.2.2 Indirect approaches**

methods.

**3.2.1 Interactive approaches**

or user interaction.

## **3.1 General video coding and compression**

The goal of this section is to briefly introduce the concepts of video coding and compression, which tends to be used interchangeably since they are heavily related. What follows in this section is a short introduction to general video compression for which Lorente (2011) is an example of recent exhaustive review.

Video compression is the process of converting a video signal into a format that takes up less storage space or transmission bandwidth. It can be considered as a coding scheme that reduces bits of information representing the video. Nevertheless the overall visual quality has to be preserved, leading to a compromise between the level of artifacts and the bandwidth.

Two types of compression can be distinguished: lossy and lossless compression Richardson (2003). In a lossless compression system statistical redundancy is removed so that the original data can be perfectly reconstructed. Unfortunately, at the present time lossless methods only allows a modest amount of compression, insufficient for most video applications. On the other hand, lossy compression provides bigger compression ratios, at the expense of not being able to reconstruct perfectly the original signal. Lossy compression is the type of compression most commonly used for video, attention-based or not.

It is interesting to note that, even if generic compression algorithms do not explicitly use saliency, they implicitly exploit the mechanisms of human visual perception to remove redundant information Geisler & Perry (1998). For example retinal persistence of vision makes the human eye keep an instantaneous view of a scene for about one-tenth of a second at least. This allows video (theoretically a continuum of time) to be represented by a series of discrete frames (e.g., 24 frames per second) with no apparent motion distortion.

Coding standards define a representation of visual data as well as a method of decoding it to reconstruct visual information. Recent hybrid standards like H.264/AVC Wiegand et al. (2003) have led to significant progress in compression quality, allowing for instance the transmission of high definition (HD) television signals over a limited-capacity broadcast channel, and video streaming over the internet Richardson (2003).

Emerging video coding standard H.265 <sup>3</sup> aims at enhancing video coding efficiency using intensive spatiotemporal prediction and entropy coding. Nevertheless, this new standard only considers *objective* redundancy, as opposed to attention-based methods described below.
