**5.2 Automatic human attention prediction issues**

As already shown in Figure 12, different viewers' gaze can be predictable or not depending on the situation, thus a compression system should take this fact into account. If there is not a real salient object standing out from the background, the compression scheme should not take saliency into account while, this one can help if salient objects are present.

Another point to take into account is the shape of the saliency maps. As stated in section Attention models for still images: a comparison, saliency maps with a high resolution and which also highlight edges might be more convenient for compression purposes than more 20 will be set by intech

Fig. 17. The original images (A and B) and for each one seams removal (vertical seams for A and horizontal seams for B) using gradient (top-row) and using a saliency map (bottom row).

most of the time work better than simple gradient, they are not perfect and the results can be

For spatio-temporal images, Rubinstein et al. (2008) propose to remove 2D seam manifolds from 3D space-time volumes by replacing dynamic programming method with graph cuts optimization to find the optimal seams. A forward energy criterion is presented which improves the visual quality of the retargeted images. Indeed, the seam carving method removes the seams with the least amount of energy, and might introduce energy into the images due to previously non-adjacent neighbors becoming neighbors. The optimal seam is

Grundmann et al. (2010) proposed a saliency-based spatio-temporal seam-carving approach with much better spatio-temporal continuity than Rubinstein et al. (2008). The spatial saliency maps are computed on each frame but they are averaged over and history of frames in order to smooth the maps from a temporal point of view. Moreover, the seams proposed by the author are temporally discontinuous providing only the appearance of a continuous seam

In this chapter we discussed the use of saliency-based methods on two main approaches to image and video compression. The first one uses the result of the saliency maps to compress the signal but it does not modify the original spatial (frame resolution) and temporal (video length) size of the signal. The second one uses saliency maps to crop or reduce the spatio-temporal resolution of the signal. In this latter case, the compression is not obtained through signal quality reduction, but through quantity reduction. Of course, both methods can be used together and they are more or less interesting depending on the application.

As already shown in Figure 12, different viewers' gaze can be predictable or not depending on the situation, thus a compression system should take this fact into account. If there is not a real salient object standing out from the background, the compression scheme should not take

Another point to take into account is the shape of the saliency maps. As stated in section Attention models for still images: a comparison, saliency maps with a high resolution and which also highlight edges might be more convenient for compression purposes than more

saliency into account while, this one can help if salient objects are present.

Adapted from: http://cilabs.kaist.ac.kr

the one which introduces a minimum amount of energy.

which helps in keeping both spatial and temporal coherence.

**5.2 Automatic human attention prediction issues**

**5. Discussion and perspectives**

**5.1 Two main approaches**

very different depending on the method used.

fuzzy approaches. Those maps preserve important details where artifacts would be clearly disturbing.

Attention-based visual coding seems to become less crucial as the bandwidth of Internet and TV continuously increase. Nevertheless, for precise applications like video-surveillance where the quality of uninteresting textures is not a problem and where the transmission bandwidth may be a problem, especially for massive HD multi-camera setups, the saliency-based approaches are very relevant. In the same way, storage of huge amount of live visual data is very resource-demanding and the best compression is needed while preserving the main events.

Concerning image and video retargeting and summarization, the perceptual zooming and smart resizing is of great importance in the context of smart mobile devices becoming common. Those devices have limited screen sizes and their bandwidth is much less easy to control in terms of quality of service and bandwidth. Intelligent and flexible methods of automatic thumbnailing, zoom, resizing and repurposing of audio-video data are crucial for a fast developing HD multimedia browsing market. Of course, in this case, a very good spatio-temporal continuity is required.

#### **5.3 Quality evaluation and comparison issue**

Coding artifacts in non-salient regions might attract attention of the viewer to these regions, thereby degrading visual quality. This problem is particularly noticeable at low bit rates as it can be seen in Figure 18: for example some repeating patterns like textures are not interesting but they become interesting (actually annoying) if they have compression artifacts or defects. Several methods have been proposed to detect and reduce such coding artifacts, to keep user's attention on the same regions that were salient before compression. It is however difficult to find appropriate criteria and quality metrics Farias (2010); Ninassi et al. (2007), and benchmark datasets (e.g.,Li et al. (2009)).

Fig. 18. First row: classical compression, Second row: attention-based compression. Adapted from http://www.svcl.ucsd.edu/projects/ROI\_coding/demo.htm.

Another recurring problem encountered in writing this review is the lack of cross-comparison between the different methods. For example few authors report compression rates for an equivalent perceptual quality. The notion of "equivalent quality" itself seems difficult to define as even objective methods are not necessary perceptually relevant. This problem is particularly important for the methods in section Attention modeling: what is saliency? but it is also present in the retargeting and summarization methods from section Image retargeting based on saliency maps.

One way to fill in these data would be to provide datasets on the internet that would serve as benchmarks.

Boiman, O. & Irani, M. (2005). Detecting irregularities in images and in video, *International*

Human Attention Modelization and Data Reduction 125

Bruce, N. D. B. & Tsotsos, J. K. (2009). Saliency, attention, and visual search: An information

Bruce, N. & Tsotsos, J. (2006). Saliency based on information maximization, *in* Y. Weiss,

Butko, N. J., Zhang, L., Cottrell, G. & Movellan, J. (2008). Visual saliency model for robot cameras, *IEEE Inter. Conf. on Robotics and Automation (ICRA)*, pp. 2398U-2403. ˝ Chamaret, C., Le Meur, O., Guillotel, P. & Chevet, J.-C. (2010). How to measure the relevance

Ciocca, G., Cusano, C., Gasparini, F. & Schettini, R. (2007). Self-adaptive image cropping for small displays, *IEEE Transactions on Consumer Electronics* 53(4): 1622–1627. Couvreur, L., Bettens, F., Hancq, J. & Mancas, M. (2007). Normalized auditory attention

de Bruijn, O. & Spence, R. (2000). Rapid serial visual presentation: A space-timed trade-off in

Deselaers, T., Dreuw, P. & Ney, H. (2008). Pan, zoom, scan – time-coherent, trained automatic

Fan, X., Xie, X., Ying Ma, W., Jiang Zhang, H. & qin Zhou, H. (2003). Visual attention based

Frintrop, S. (2006). Vocus: A visual attention system for object detection and goal-directed

Gao, D., Mahadevan, V. & Vasconcelos, N. (2008). On the plausibility of the discriminant center-surround hypothesis for visual saliency., *J Vis* 8(7): 13.1–1318. Geisler, W. S. & Perry, J. S. (1998). A real-time foveated multiresolution system for

Goferman, S., Zelnik-Manor, L. & Tal, A. (2010). Context-aware saliency detection, *Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR)*, pp. 2376–2383. Grundmann, M., Kwatra, V., Han, M. & Essa, I. (2010). Discontinuous seam-carving for

Guo, C. & Zhang, L. (2010). A novel multiresolution spatiotemporal saliency detection

Gupta, R. & Chaudhury, S. (2011). A scheme for attentional video compression, *Pattern*

Harel, J., Koch, C. & Perona, P. (2007). Graph-based visual saliency, *Advances in Neural*

Hou, X. & Zhang, L. (2007). Saliency detection: A spectral residual approach, *Proc. IEEE Conf.*

Hyvärinen, A., Karhunen, J. & Oja, E. (2001). *Independent Component Analysis*, New York:

low-bandwidth video communication, *in Proc. SPIE*, pp. 294–305.

information presentation, *Advanced Visual Interfaces*, pp. 189–192.

Farias, M. C. Q. (2010). *Video Quality Metrics (in: Digital Video)*, InTech.

*Recognition and Machine Intelligence* 6744: 458–465.

*Information Processing Systems 19*, MIT Press, pp. 545–552.

*Computer Vision and Pattern Recognition CVPR '07*, pp. 1–8.

B. Schölkopf & J. Platt (eds), *Advances in Neural Information Processing Systems 18*,

of a retargeting approach?, *Workshop Media Retargeting ECCV 2010*, Crete, Grèce,

levels for automatic audio surveillance, *International Conference on Safety and Security*

video cropping, *IEEE Conference on Computer Vision and Pattern Recognition*, IEEE,

image browsing on mobile devices, *Proc. of ICME 2003*, IEEE Computer Society Press,

search, *Thesis print*, Vol. 3899 of *Lecture Notes in Artificial Intelligence*, Springer Berlin

video retargeting, *Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR)*,

model and its applications in image and video compression., *IEEE Trans Image Process*

*Conference on Computer Vision (ICCV)*.

theoretic approach, *Journal of Vision* 9(3).

MIT Press, Cambridge, MA, pp. 155–162.

pp. 1–14.

*Engineering (SAFE)*.

Anchorage, AK, USA.

pp. 53–56.

/ Heidelberg.

pp. 569–576.

19(1): 185–198.

Wiley.

#### **5.4 Saliency cross-modal integration: combining audio and visual attention**

In a multimedia file a lot of information is included into the visual data. But also, supplemental or complementary information can be found within the audio track: audio data could confirm visual data information, help in being more selective or even bring new information that is not present in the camera field of view. Indeed, in some contexts sound might even be the only way to determine where to focus visual attention, for example if several persons are in a room but only one is talking. It seems thus that the use of both visual and audio saliency is a relevant idea.

Multimodal models of attention are unfortunately very few and they are mainly used in the field of robotics such in Ruesch et al. (2008). Another interesting idea is to localize the sound-emitting regions in a video. Recent work as Lee et al. (2010) has shown the ability to localize sounds in an image.

Given the computationally intensive nature and the real-time requirements of video compression methods and especially in the case of multimodal integration of saliency maps, some algorithms have exploited recent advances in Graphics Processing Unit (GPU) computing. In particular, a parallel implementation of a spatio-temporal visual saliency model has been proposed Rahman et al. (2011).

#### **5.5 Saliency models and new trends in multimedia compression**

Visual compression has been a very active field of research and development for over 20 years, leading to many different compression systems and to the definition of international standards. Even though video compression has become a mature field, a lot of research is still ongoing. Indeed, as the quality of the compression increases, so does users' level of expectations and their intolerance to artifacts. Exploiting saliency-based video compression is a challenging and exciting area of research and especially nowadays when saliency models include more and more top-down information and manage to better and better predict real human gaze.

Multimedia applications are a continuously evolving domain and compression algorithms must also evolve and adapt to new applications. The explosion of portable devices with less bandwidth and smaller screens, but also the future semantic TV/web and its object-based description will lead in the future to a higher importance of saliency-based algorithms for multimedia data repurposing and compression.
