**2. Attention modeling: what is saliency?**

In this first part of the chapter, a global view of the methods used to model attention in computer science will be presented. The details provided here will be useful to understand the next parts of the chapter which are dedicated to attention-based image and video compression.

## **2.1 Attention in computer science: idea and approaches**

There are two main approaches to attention modeling in computer science. The first one is based on the notion of "saliency" and implies a competition between "bottom-up" and "top-down" information. The idea of saliency maps is that the sight or gaze of people will direct to areas which, in some way, stand out from the background. The eye movements can be computed from the saliency map by using winner-take-all (Itti et al. (1998)) or more dynamical algorithms (Mancas, Pirri & Pizzoli (2011)). The second approach to attention modeling is based on the notion of "visibility" which assumes that people look to locations that will lead to successful task performance. Those models are dynamic and intend to maximize the information acquired by the eye (the visibility) of eccentric regions compared to the current eye fixation to solve a given task (which can also be free viewing). In this case top-down information is naturally included in the notion of task along with the dynamic bottom-up information maximization. The eye movements are in this approach directly an output from the model and do not have to be inferred from a saliency map. The literature about attention modeling in computer science is not symmetric between those two approaches: the saliency-based methods are much more popular than the visibility models. For this reason, the following sections in this first part of the chapter will also mainly deal with saliency methods, but a review of visibility methods will be provided in the end.

### **2.2 Saliency approaches: bottom-up methods**

Bottom-up approaches use features (most of the time low-level features but not always) extracted from the signal, such as luminance, color, orientation, texture, objects relative position or even simply neighborhoods or patches from the signal. Once those features are extracted, all the existing methods are essentially based on the same principle: looking for contrasted, rare, surprising, novel, worthy to learn, less compressible, maximizing the information areas. All those words are actually synonyms and they all amount to searching for some unusual features in a given context which can be spatial or temporal. In the following, different methods are described for still images, videos and audio signals. All those modalities are of course interesting for multimedia compression which, by definition, contains both video and audio information.

### **2.2.1 Still images**

The literature is very active concerning still images saliency models. While some years ago only some labs in the world were working on the subject, nowadays hundreds of different models are available. Those models have various implementations and technical approaches even if initially they all derive from the same idea. It is thus very hard to find a perfect taxonomy which classifies all the methods. Some attempts of taxonomies proposed an opposition between "biologically-driven" and "mathematically-based" methods with a third class including "top-down information". This approach implies that only some methods can handle top-down information while all bottom-up methods could use top-down information 2 will be set by intech

In this first part of the chapter, a global view of the methods used to model attention in computer science will be presented. The details provided here will be useful to understand the next parts of the chapter which are dedicated to attention-based image and video

There are two main approaches to attention modeling in computer science. The first one is based on the notion of "saliency" and implies a competition between "bottom-up" and "top-down" information. The idea of saliency maps is that the sight or gaze of people will direct to areas which, in some way, stand out from the background. The eye movements can be computed from the saliency map by using winner-take-all (Itti et al. (1998)) or more dynamical algorithms (Mancas, Pirri & Pizzoli (2011)). The second approach to attention modeling is based on the notion of "visibility" which assumes that people look to locations that will lead to successful task performance. Those models are dynamic and intend to maximize the information acquired by the eye (the visibility) of eccentric regions compared to the current eye fixation to solve a given task (which can also be free viewing). In this case top-down information is naturally included in the notion of task along with the dynamic bottom-up information maximization. The eye movements are in this approach directly an output from the model and do not have to be inferred from a saliency map. The literature about attention modeling in computer science is not symmetric between those two approaches: the saliency-based methods are much more popular than the visibility models. For this reason, the following sections in this first part of the chapter will also mainly deal with saliency methods,

Bottom-up approaches use features (most of the time low-level features but not always) extracted from the signal, such as luminance, color, orientation, texture, objects relative position or even simply neighborhoods or patches from the signal. Once those features are extracted, all the existing methods are essentially based on the same principle: looking for contrasted, rare, surprising, novel, worthy to learn, less compressible, maximizing the information areas. All those words are actually synonyms and they all amount to searching for some unusual features in a given context which can be spatial or temporal. In the following, different methods are described for still images, videos and audio signals. All those modalities are of course interesting for multimedia compression which, by definition, contains both video

The literature is very active concerning still images saliency models. While some years ago only some labs in the world were working on the subject, nowadays hundreds of different models are available. Those models have various implementations and technical approaches even if initially they all derive from the same idea. It is thus very hard to find a perfect taxonomy which classifies all the methods. Some attempts of taxonomies proposed an opposition between "biologically-driven" and "mathematically-based" methods with a third class including "top-down information". This approach implies that only some methods can handle top-down information while all bottom-up methods could use top-down information

**2. Attention modeling: what is saliency?**

**2.1 Attention in computer science: idea and approaches**

but a review of visibility methods will be provided in the end.

**2.2 Saliency approaches: bottom-up methods**

and audio information.

**2.2.1 Still images**

compression.

more or less naturally. Another difficult point is to judge the biological plausibility which can be obvious for some methods but much less for the others. Another criterion is the computational time or the algorithmic complexity, but it is very difficult to make this comparison as all the existing models do not provide cues about their complexity. Finally a classification of methods based on center-surround contrast compared to information theory based methods do not take into account different approaches as the spectral residual one for example. Therefore, we introduce here a new taxonomy of the saliency methods which is based on the context that those methods take into account to exhibit signal novelty. In this framework, there are three classes of methods. The first one is pixel's surroundings: here a pixel or patch is compared with its surroundings at one or several scales. A second class of methods will use as a context the entire image and compare pixels or patches of pixels with other pixels or patches from other locations in the image but not necessarily in the surroundings of the initial patch. Finally, the third class will take into account a context which is based on a model of what the normality should be. This model can be described as a priori probabilities, Fourier spectrum models . . . In the following sections, the main methods from those three classes are described for still images.

#### 2.2.1.1 Context: pixel's surroundings

This approach is based on a biological motivation and dates back to the work of Koch & Ullman (1985) on attention modeling. The main principle is to initially compute visual features at several scales in parallel, then to apply center-surround inhibition, combination into conspicuity maps (one per feature) and finally to fuse them into a single saliency map. There are a lot of models derived from this approach which mainly use local center-surround contrast as a local measure of novelty. A good example of this family of approaches is the Itti's model (Figure 1) Itti et al. (1998) which is the first implementation of the Koch and Ullman model. It is composed of three main steps. First, three types of static visual features are selected (colors, intensity and orientations) at several scales. The second step is the center-surround inhibition which will provide high response in case of high contrast, and low response in case of low contrast. This step results in a set of feature maps for each scale. The third step consists in an across-scale combination, followed by normalization to form "conspicuity" maps which are single multiscale contrast maps for each feature. Finally, a linear combination is made to achieve inter-features fusion. Itti proposed several combination strategies: a simple and efficient one is to provide higher weights to conspicuity maps which have global peaks much bigger than their mean. This is an interesting step which integrates global information in addition to the local multi-scale contrast information.

This implementation proved to be the first successful approach of attention computation by providing better predictions of the human gaze than chance or simple descriptors like entropy. Following this success, most computational models of bottom-up attention use the comparison of a central patch to its surroundings as a novelty indicator. An update is obtained by adding other features to the same architecture such as symmetry Privitera & Stark (2000) or curvedness Valenti et al. (2009). Le Meur et al. (2006) refined the model by using more biological cues like contrast sensitivity functions, perceptual decomposition, visual masking, and center-surround interactions. Another popular and efficient model is the Graph Based Visual Saliency model (GVBS, Harel et al. (2007)), which is very close to Itti et al. (1998) regarding feature extraction and center-surround, but differs from it in the fusion step where GBVS computes an activation map before normalization and combination. Other models like Gao et al. (2008) also used center-surround approaches even if the rest of the computation is made in a different mathematical framework based on a Bayesian approach.

the fixations predicted by Itti et al. (1998) where the locally contrasted apple edges are well detected while its less contrasted but rare defect is not. The third image shows results from Mancas et al. (2007) which detected the apple edges, but also the defect. Finally the rightmost

Human Attention Modelization and Data Reduction 107

Fig. 3. Difference between locally contrasted and globally rare features. Left image: an apple with a defect in red, Second Image: Itti et al. (1998), Third image: Mancas et al. (2007), Right

A typical model using this context is the model of Stentiford (2001) which uses random neighborhoods and check if it is possible to find a lot of those neighborhoods or not in the rest of the image. If there are few possibilities, the patch was rare, thus salient. This model does not need feature extraction, as features remain included in the compared patches.

Oliva et al. (2003) also defined the saliency as the inverse likelihood of the features at each location. This likelihood is computed as a Gaussian probability all over the image on the features which are extracted by using a steerable pyramid. Boiman & Irani (2005) proposed a method where different patches were not only compared between them, but also their relative

A well-known model is Bruce & Tsotsos (2006). This model of bottom-up overt attention is proposed based on the principle of maximizing information sampled from a scene. The proposed operation is based on Shannon's self-information measure and is achieved in a neural circuit taking into account patches from the image projected on a new basis obtained by performing an ICA (Independent Component Analysis Hyvärinen et al. (2001)) on a large

Recently, Goferman et al. (2010) has introduced context-aware saliency detection based on four principles. First, local low-level considerations, including factors such as contrast and color are used. Second, global considerations, which suppress frequently occurring features, while maintaining features that deviate from the norm are taken into account. Higher level information as visual organization rules, which state that visual forms may possess one or several centers of gravity about which the form is organized are then used. Finally, human faces detection are also integrated into the model. While the two first points are purely

This approach is probably less biologically-motivated in most of the implementations. The context which is used here is a model of what the image should be: if things are not like they should be, this can be surprising, thus interesting. In Achanta et al. (2009) a very simple attention model was developed. His method, first, changes the color space from RGB to Lab and finds the Euclidean distance between the Lab pixel vectors in a Gaussian filtered image with the average Lab vector for the input image. This is illustrated in the Figure 4. The mean

image is the mouse tracking result for more than 30 users.

image: mouse tracking (ground truth).

positions where taken into account.

2.2.1.3 Context: a model of normality

sample of 7x7 RGB patches drawn from natural images.

bottom-up, the two others may introduce some top-down information.

Fig. 1. Model of Itti et al. (1998). Three stages: center-surround differences, conspicuity maps, inter-feature fusion into saliency map.

#### 2.2.1.2 Context: the whole image

In this approach, the context which is used to provide a degree of novelty or rarity to image patches is not necessarily the surroundings of the patch, but can be other patches in its neighborhood or even anywhere in the image. The idea can be divided in two steps. First, local features are computed in parallel from a given image. The second step measures the likeness of a pixel or a neighborhood of pixels to other pixels or neighborhoods within the image. This kind of visual saliency is called "self-resemblance". A good example is shown in Figure 2. The model has two parts. First it proposes to use local regression kernels as features. Second it proposes to use a nonparametric kernel density estimation for such features, which results in a saliency map consisting of local "self-resemblance" measure, indicating likelihood of saliency Seo & Milanfar (2009).

Fig. 2. Model of Seo & Milanfar (2009). Patches at different locations are compared.

A similar approach was developed in Mancas (2007) and Mancas (2009), that detects saliency in the areas which are globally rare and locally contrasted. After a feature extraction step, both local contrast and global rarity of pixels are taken into account to compute a saliency map. An example of the difference between locally contrasted features and globally rare is given in Figure 3. The leftmost image is an apple with a defect shown in red. The second image shows 4 will be set by intech

Fig. 1. Model of Itti et al. (1998). Three stages: center-surround differences, conspicuity maps,

In this approach, the context which is used to provide a degree of novelty or rarity to image patches is not necessarily the surroundings of the patch, but can be other patches in its neighborhood or even anywhere in the image. The idea can be divided in two steps. First, local features are computed in parallel from a given image. The second step measures the likeness of a pixel or a neighborhood of pixels to other pixels or neighborhoods within the image. This kind of visual saliency is called "self-resemblance". A good example is shown in Figure 2. The model has two parts. First it proposes to use local regression kernels as features. Second it proposes to use a nonparametric kernel density estimation for such features, which results in a saliency map consisting of local "self-resemblance" measure, indicating likelihood

Fig. 2. Model of Seo & Milanfar (2009). Patches at different locations are compared.

A similar approach was developed in Mancas (2007) and Mancas (2009), that detects saliency in the areas which are globally rare and locally contrasted. After a feature extraction step, both local contrast and global rarity of pixels are taken into account to compute a saliency map. An example of the difference between locally contrasted features and globally rare is given in Figure 3. The leftmost image is an apple with a defect shown in red. The second image shows

inter-feature fusion into saliency map.

2.2.1.2 Context: the whole image

of saliency Seo & Milanfar (2009).

the fixations predicted by Itti et al. (1998) where the locally contrasted apple edges are well detected while its less contrasted but rare defect is not. The third image shows results from Mancas et al. (2007) which detected the apple edges, but also the defect. Finally the rightmost image is the mouse tracking result for more than 30 users.

Fig. 3. Difference between locally contrasted and globally rare features. Left image: an apple with a defect in red, Second Image: Itti et al. (1998), Third image: Mancas et al. (2007), Right image: mouse tracking (ground truth).

A typical model using this context is the model of Stentiford (2001) which uses random neighborhoods and check if it is possible to find a lot of those neighborhoods or not in the rest of the image. If there are few possibilities, the patch was rare, thus salient. This model does not need feature extraction, as features remain included in the compared patches.

Oliva et al. (2003) also defined the saliency as the inverse likelihood of the features at each location. This likelihood is computed as a Gaussian probability all over the image on the features which are extracted by using a steerable pyramid. Boiman & Irani (2005) proposed a method where different patches were not only compared between them, but also their relative positions where taken into account.

A well-known model is Bruce & Tsotsos (2006). This model of bottom-up overt attention is proposed based on the principle of maximizing information sampled from a scene. The proposed operation is based on Shannon's self-information measure and is achieved in a neural circuit taking into account patches from the image projected on a new basis obtained by performing an ICA (Independent Component Analysis Hyvärinen et al. (2001)) on a large sample of 7x7 RGB patches drawn from natural images.

Recently, Goferman et al. (2010) has introduced context-aware saliency detection based on four principles. First, local low-level considerations, including factors such as contrast and color are used. Second, global considerations, which suppress frequently occurring features, while maintaining features that deviate from the norm are taken into account. Higher level information as visual organization rules, which state that visual forms may possess one or several centers of gravity about which the form is organized are then used. Finally, human faces detection are also integrated into the model. While the two first points are purely bottom-up, the two others may introduce some top-down information.

#### 2.2.1.3 Context: a model of normality

This approach is probably less biologically-motivated in most of the implementations. The context which is used here is a model of what the image should be: if things are not like they should be, this can be surprising, thus interesting. In Achanta et al. (2009) a very simple attention model was developed. His method, first, changes the color space from RGB to Lab and finds the Euclidean distance between the Lab pixel vectors in a Gaussian filtered image with the average Lab vector for the input image. This is illustrated in the Figure 4. The mean

Fig. 5. For three original images (on the left), the eye-tracking results (column A) and six other saliency models maps. B = Itti and Koch (1998), C = Harrel et al. (2007); D = Mancas (2007), E = Seo and Milanfar (2009), F = Bruce and Tsotsos (2005), G = Achanta (2009). A

Human Attention Modelization and Data Reduction 109

extended his model by learning ICA not only on 2D patches but on spatio-temporal 3D patches. As shown in Figure 6, similarly to Gao, Seo & Milanfar (2009) introduced the time dimension in addition to his static model. Another model is SUN (Saliency Using Natural

statistics) from Butko et al. (2008) that propose a Bayesian framework for saliency. Two methods are implemented. First, the features are calculated as responses of biologically plausible linear filters, such as DoG (Differences of Gaussians) filters. Second, the features are calculated as the responses to filters learned from natural images using independent

Frintrop (2006) introduces the biologically motivated computational attention system VOCUS (Visual Object detection with a Computational attention System) that detects regions of interest in images. It operates in two modes, in an exploration mode in which no task is provided, and in a search mode with a specified target. The bottom-up mode is based on an

thresholded applied on the saliency maps is shown on the images bellow.

Fig. 6. Seo & Milanfar (2009) generalized to video in 2009.

component analysis (ICA).

enhancement of the Itti model.

image used is a kind of model of the image statistics and pixels which are far from those statistics are more salient.

Fig. 4. Achanta et al. (2009) uses a model of the mean image.

In 2006, Itti & Baldi (2006) released the concept of surprise, central to attention. They described a formal Bayesian definition of surprise that is the only consistent formulation under minimal axiomatic assumptions. Surprise quantifies how data affects an observer, by measuring the difference between posterior and prior (model of normality) beliefs of the observer. In Hou & Zhang (2007), the authors proposed a model that is independent of any features. As it is known that natural images have a <sup>1</sup> *<sup>f</sup>* decreasing log-spectrum, the difference between this normality model obtained by low-pass filtering and the log-spectrum of the image is reconstructed into the image space and lead to the saliency map.

#### 2.2.1.4 Attention models for still images: a comparison

It is not easy to classify attention models, for several reasons. First, there is a large variety of models. Second, some research groups (e.g., Itti's) have implemented different models, finding themselves in several categories. Also some approaches have several contexts and could be classified in more than one category, but based on the context notion, is seems possible to find this three main families of methods despite their diversity.

Figure 5 displays saliency maps computed with six models (available as Matlab codes), along with the eyestracking results to show where people really look at. For this purpose three images from the Bruce's database <sup>1</sup> were used. Along with the saliency maps of six models, one can find the most salient areas after automatic thresholding.

Figure 5 reveals that saliency maps can be quite different, from very fuzzy ones (Itti, Harrel or Seo) to high resolution ones (Mancas, Bruce or Achanta). It is not easy to compare those saliency maps (they should all be low-pass filtered to decrease their resolution). Nevertheless for the purpose of compression, one needs a model which is able to highlight the interesting areas but also the interesting edges as Mancas.

#### **2.2.2 Videos**

Part of the static models have been extended to video. Itti's model was generalized with the addition of motion features and flickering and in Itti & Baldi (2006) he applied another approach based on surprise to static but also dynamic images. Le Meur et al. (2007b) used motion in addition to spatial features. Gao et al. (2008) generalized his 2D square center-surround approach to 3D cubic shapes. Belardinelli et al. (2008) used an original approach of 3D Gabor filter banks to detect spatio-temporal saliency. Bruce & Tsotsos (2009)

<sup>1</sup> http://www-sop.inria.fr/members/Neil.Bruce/

6 will be set by intech

image used is a kind of model of the image statistics and pixels which are far from those

In 2006, Itti & Baldi (2006) released the concept of surprise, central to attention. They described a formal Bayesian definition of surprise that is the only consistent formulation under minimal axiomatic assumptions. Surprise quantifies how data affects an observer, by measuring the difference between posterior and prior (model of normality) beliefs of the observer. In Hou & Zhang (2007), the authors proposed a model that is independent of any features. As

this normality model obtained by low-pass filtering and the log-spectrum of the image is

It is not easy to classify attention models, for several reasons. First, there is a large variety of models. Second, some research groups (e.g., Itti's) have implemented different models, finding themselves in several categories. Also some approaches have several contexts and could be classified in more than one category, but based on the context notion, is seems

Figure 5 displays saliency maps computed with six models (available as Matlab codes), along with the eyestracking results to show where people really look at. For this purpose three images from the Bruce's database <sup>1</sup> were used. Along with the saliency maps of six models,

Figure 5 reveals that saliency maps can be quite different, from very fuzzy ones (Itti, Harrel or Seo) to high resolution ones (Mancas, Bruce or Achanta). It is not easy to compare those saliency maps (they should all be low-pass filtered to decrease their resolution). Nevertheless for the purpose of compression, one needs a model which is able to highlight the interesting

Part of the static models have been extended to video. Itti's model was generalized with the addition of motion features and flickering and in Itti & Baldi (2006) he applied another approach based on surprise to static but also dynamic images. Le Meur et al. (2007b) used motion in addition to spatial features. Gao et al. (2008) generalized his 2D square center-surround approach to 3D cubic shapes. Belardinelli et al. (2008) used an original approach of 3D Gabor filter banks to detect spatio-temporal saliency. Bruce & Tsotsos (2009)

*<sup>f</sup>* decreasing log-spectrum, the difference between

Fig. 4. Achanta et al. (2009) uses a model of the mean image.

reconstructed into the image space and lead to the saliency map.

one can find the most salient areas after automatic thresholding.

areas but also the interesting edges as Mancas.

<sup>1</sup> http://www-sop.inria.fr/members/Neil.Bruce/

**2.2.2 Videos**

possible to find this three main families of methods despite their diversity.

2.2.1.4 Attention models for still images: a comparison

it is known that natural images have a <sup>1</sup>

statistics are more salient.

Fig. 5. For three original images (on the left), the eye-tracking results (column A) and six other saliency models maps. B = Itti and Koch (1998), C = Harrel et al. (2007); D = Mancas (2007), E = Seo and Milanfar (2009), F = Bruce and Tsotsos (2005), G = Achanta (2009). A thresholded applied on the saliency maps is shown on the images bellow.

extended his model by learning ICA not only on 2D patches but on spatio-temporal 3D patches. As shown in Figure 6, similarly to Gao, Seo & Milanfar (2009) introduced the time dimension in addition to his static model. Another model is SUN (Saliency Using Natural

Fig. 6. Seo & Milanfar (2009) generalized to video in 2009.

statistics) from Butko et al. (2008) that propose a Bayesian framework for saliency. Two methods are implemented. First, the features are calculated as responses of biologically plausible linear filters, such as DoG (Differences of Gaussians) filters. Second, the features are calculated as the responses to filters learned from natural images using independent component analysis (ICA).

Frintrop (2006) introduces the biologically motivated computational attention system VOCUS (Visual Object detection with a Computational attention System) that detects regions of interest in images. It operates in two modes, in an exploration mode in which no task is provided, and in a search mode with a specified target. The bottom-up mode is based on an enhancement of the Itti model.

Fig. 8. Kayser et al. (2005) audio saliency model inspired from Itti.

segments of interest in audio streams that are monitored.

**2.3.1 Top-down as learnt normality: attending unusual events**

natural images, the eye gaze is attracted by the center of the images.

**2.3 Saliency models: including top-down information**

of interest.

with lower rates.

Couvreur et al. (2007) define features that can be computed along audio signals in order to assess the level of auditory attention on a normalized scale, i.e. between 0 and 1. The proposed features are derived from a time-frequency representation of audio signals and highlight salient regions such as regions with high loudness, temporal and frequency contrasts. Normalized auditory attention levels can be used to detect sudden and unexpected changes of audio textures and to focus the attention of a surveillance operator to sound

Human Attention Modelization and Data Reduction 111

There are two main families of top-down information which can be added to bottom-up attention. The first one mainly deals with learnt normality which can come from the experience from the current signal if it is time varying, or from previous experience (tests, databases) for still images. The second approach is about task modeling which can either use object recognition-related techniques or which can model the usual location of those objects

Concerning still images, the "normal" gaze behavior can be learnt from the "mean observer". Eye-tracking techniques can be used on several users, and the mean of their gaze on a set of natural images can be computed. This was achieved by several authors as it can be seen on Figure 9. Bruce and Judd et al. (2009) used eye-trackers while Mancas (2007) used mouse-tracking techniques to compute this mean observer. In all cases, it seems clear that, for

This fact seems logical as natural images are acquired using cameras and the photographer will naturally tend to locate the objects of interest in the center of the picture. This observation might be interesting in the field of image compression as high quality compression seems to be required mainly in the center of the image while peripheral areas could be compressed

Of course, this observation for natural images is very different from more specific images which use a priori knowledge. Mancas (2009) showed using mouse tracking that gaze density

Finally, Mancas, Riche & J. Leroy (2011) has developed a bottom-up saliency map to detect abnormal motion. The proposed method is based on a multi-scale approach using features extracted from optical flow and global rarity quantification to compute bottom-up saliency maps. It shows good results from four objects to dense crowds with increasing performance. The idea here is to show that motion is most of the time salient but within motion, there might be motion which is more or less salient. Mancas model is capable of extracting different motion behavior from complex videos or crowds (Figure 7).

Fig. 7. Detection of salient motion compared to the rest of motion. Red motion is salient because of unexpected speed. Cyan motion is salient because of unexpected direction Mancas, Riche & J. Leroy (2011).
