**4.3 Evaluation benchmark of saliency estimation**

Here, we compare the saliency estimation that is obtained after only performing Step 1 in **Figure 1** with existing saliency models (see **Table 5**). This saliency estimation is trained without access to any groundtruth saliency data.

Saliency prediction metrics assign a score depending on how well the predicted saliency map is able to match with locations of human fixations (see definitions in Borji et al. [17]). It selected the area under ROC (AUC), Kullback-Leibler divergence (KL), similarity (SIM), shuffled AUC (sAUC) and information gain (IG) metrics considering its consistency of predictions of human fixation maps. It compares scores

#### **Figure 3.**

*Qualitative results for real images (Toronto dataset). Each image is represented in a different column and each model saliency map in each row. The ground truth density map of human fixations is represented in the second row.*

#### **Figure 4.**

*Qualitative results for synthetic images (SID4VAM dataset). Each image is represented in a different column and each model saliency map in each row. The ground truth density map of human fixations is represented in the second row.*

with classical saliency models, both with handcrafted low-level features (i.e., IKN [16], AIM [19], SDLF [20], and GBVS [13]) and with state-of-the-art deep saliency models (i.e., DeepGazeII [24], SAM-ResNet [4], and SalGAN [47]) mainly pretrained on human fixations. The results are surprising; our method, which has not been trained on any saliency data, obtains competitive results. For the case of *Toronto* (**Table 2**), the best model is GBVS, followed by our model, which scores in the top 3 of KL and SAM-ResNet and scores slightly higher in InfoGain metric. For the case of *SID4VAM* (**Table 3**), our approach gets the best scores for most metrics compared with other deep saliency models, being mainly among the top 2 acquiring similar scores to GBVS in most metrics (outperforming it in AUC measures).

These saliency prediction results show that our model has robust metric scores on both real and synthetic images for saliency prediction. Again, we would like to stress that our model is not trained on fixation prediction datasets and our model with subitizing supervision (SUP) performs best on detecting pop-out effects (from visual attention theories [16]), while performing similarly for real image datasets (**Figure 4**). Some deep saliency models use several mechanisms to leverage (or/and train) performance for improving saliency metric scores, such as smoothing/thresholding (see **Figure 4**, rows 5). It also considers that some of these models are already fine-tuned for synthetic images (e.g., SAM-ResNet [4]). *Our approach* (which has not been trained in these types of datasets) has shown to be robust on these two distinct scenarios/domains.
