**2. Related work**

Saliency is generally known as local contrast [12], which generally originates from contrasts between objects and their surroundings, such as differences in color, texture, shape. This mechanism measures intrinsically salient stimuli to the vision system that primarily attracts the attention of humans, in the initial stage of visual exposure to an input image [13]. To quickly extract the most relevant information from a scene, the human visual system pays more attention to highlighted regions, as seen in **Figure 1**. Research on computational saliency focuses on the design of algorithms that, like human vision, predict which regions of a scene stand out [14, 15].

Initial efforts to model saliency involved multi-scale representations of color, orientation, and intensity contrast. These were often biologically inspired, such as the wellknown works [12, 16]. From that model, a large number of models were based on the manual elaboration of these features to obtain an accurate saliency map [17, 18], either maximizing [19] or learning statistics from natural images [13, 20]. Relevancy research was further driven by the availability of large datasets that enabled the use of machine learning algorithms [21], primarily pre-trained on existing human fixation data.

The question of whether saliency is important for object recognition and tracking has been raised in [22]. More recent methods [23] take advantage of end-to-end convolutional architectures by fine graining on fixation prediction [4, 24, 25]. But the main goal of these works was to estimate a saliency map, not how saliency might contribute to object recognition.

Several works have shown that having the saliency map of an image can be useful for object recognition, for example, [8, 10, 11]. Since the saliency map can help focus attention on the relevant parts of the image to improve recognition, additionally, it can help guide training by focusing backpropagation on relevant image regions. Previous work has shown that saliency modulated image classification (SMIC) is especially efficient for training on data sets with few labeled data [10]. The main drawback of these methods is that they require a trained saliency method. Also, Refs. [8, 9] show that this restriction can be removed and that it can hallucinate the saliency image from the RGB image. By training the network for image classification on the ImageNet dataset [26], it can obtain the saliency branch without using human reference images.

Recently, the progress in the detection of salient objects has grown substantially, mainly benefiting from the development of deep neural networks (CNN). In [27], a CNN based on the use of superpixels for saliency detection was proposed. Instead, Li et al. [28] used multi-scale features extracted from a deep CNN. Zhao et al. [29] proposed a multi-context deep learning framework to detect salient objects with two different CNNs, which were useful for learning local and global information. Yuan

et al. [30] proposed a saliency detection framework, which extracted the correlations between object contours and RGB features of the image. On the other hand, Wang and Shen [31] defined a pyramid-shaped structure to expand the receptive field in visual attention. Hou and Zhang [32] introduced short connections for edge or contour detection. Zhu [33], on the other hand, proposed a visual attention architecture called DenseASPP, to extract information. Chen [34] proposed a spatial attenuation context network, which recursively translated and aggregated the context features in different layers. Tu [35] introduced an edge-guided block to embed boundary information in saliency maps. Zhou [36] proposed a multi-type self-attention network to learn more semantic details from degraded images. However, these methods rely heavily on pixel-based monitoring. Overcoming the scarcity of pixel-based data, it focusses on the saliency detection task.

#### **2.1 Weakly supervised saliency detection**

There are many works using weak supervisions for the saliency detection task. For example, Li [37] used the image-level labels to train the classification network and applied coarse activation maps as saliency maps. Wang [38] proposed a weakly supervised two-stage method by designing an inference network to predict foreground regions and global smooth pooling (GSP) to aggregate responses from those predicted objects. On the other hand, Zeng [39] designed a unified network, which is capable of weak monitoring of multiple sources, including image labels, captions, and pseudo-labels. Furthermore, they designed a loss of attention transfer to transmit signals between subnetworks with different supervisions.

Different from the previous methods, it proposes to use subitizing information as weak supervision in the saliency detection task, where it will first study the problem of subitizing of the outgoing object and the relationships between subitizing and saliency detection.
