**1. Introduction**

For humans, object recognition is a nearly instantaneous, precise, and extremely adaptable process. Furthermore, it has the innate ability to learn new classes of objects from a few examples [1, 2]. The human brain reduces the complexity of incoming data by filtering out some of the information and processes only those things that grab our attention. This, combined with our biological predisposition to respond to certain shapes or colors, allows us to recognize at a glance the most important or outstanding regions of an image. This mechanism can be observed by analyzing which parts of the images humans pay more attention to; for example, where they fix their eyes when they are shown an image [3, 4]. The most accurate way to record this behavior is by tracking eye movements, while the subject in question is presented with a set of images to evaluate. Computational estimation of saliency (or salient or salient regions) aims to identify to what extent regions or objects stand out from their surroundings or background to human observers. Saliency maps can be used in a wide range of applications, including object detection, image and video understanding, and eye tracking. On the other hand, it is known that the human visual system can effortlessly identify the number of objects in the range 1 to 4 by having just one glance [5]. Since

then, this phenomenon, coined later by [6] as subitizing, has been studied and tested in various experimental settings [7].

Therefore, inspired by subitizing and the results obtained in [8, 9], the main objective of this project is to incorporate the subitizing of salient objects (SOS), in order to improve our previous results. This means that the subitizing information will tell us the number of outgoing objects in a given image and thus subsequently provide us with the location or appearance information of the outgoing objects explicitly, and everything will be done within a weakly supervised configuration. It should be noted that when the network is trained with the subitizing supervisions, the network will learn to focus on the regions related to the outgoing objects. Therefore, it will design a saliency subitizing process (SSP) architecture that is responsible for extracting attention regions as saliency map. A second module that is in charge of improving the quality of the saliency masks can be defined as the saliency map update process (SUP), which will basically be in charge of refining the activation regions in an endto-end way. It will then merge the source images and saliency maps to get the masked images as new inputs for the next refinement. Finally, in this work we propose to design and build a convolutional neural network (CNN), which will basically consist of a process that will be in charge of SSP and a function that will help us in the task of SUP. The first SSP will serve as a support to obtain and calculate the number of outstanding objects and thus extract the saliency maps with their respective locations. Instead, SUP will help us update the saliency masks produced by the first module. The general model of our proposal is shown in **Figure 1**.

However, as this work is a first attempt at the final result, it will only consider the development, experimentation, and explanation associated with step 1.

It briefly summarizes below its main contributions:

• It proposes an approach that generates saliency maps from subitizing of saliency process (SSP),

#### **Figure 1.**

*Overview of our proposed method with the saliency subitizing process (SSP) and the saliency updating process (SUP).*


The chapter is organized as follows. Section 2 is devoted to review the related work in saliency detection. Section 3 presents our approach. Experimental results are reported in Section 4. Finally, Section 5 contains our conclusions.
