**3. Proposed method**

This work proposes to design and implement a convolutional neural network, which will consist mainly of saliency subitizing process (SSP). The SSP will help us to count the highlighted objects and at the same time extract the saliency from the maps that will contain the locations (positions) of the objects.

#### **3.1 Subitizing of saliency process (SSP)**

It should be noted that the information provided by the subitizing process will indicate the number of outgoing objects in a given image [40]. Therefore, it will not explicitly provide the location or information related to the appearance of the output objects. However, when the network is being trained with subitizing (simulating supervised learning), the network will learn to focus on the regions related to the most salient (or salient) objects. Training images are divided into five categories based on the number of salient objects: 0, 1, 2, 3, and 4 or +. For the same reason, it will design the SSP to extract these regions as if it were a saliency mask. During this process, a classification network will be used for the object subitizing task, in this context ResNet-152 or ResNet50 [41] and AlexNet [42] as "backbone network," which are pre-trained from the ImageNet dataset [43].

*Saliency Detection from Subitizing Processing DOI: http://dx.doi.org/10.5772/intechopen.108552*

Also, it uses cross-entropy as the classification loss (see Eq. (1)). In order to obtain denser saliency maps, the stride of the last two down-sampling layers is set as 1 in our backbone network, which produces feature maps with 1/8 of the original resolution before the classification layer. In order to enhance the representation power of the proposed network, it also applies two attention modules: channel attention module and spatial attention module, which tell the network "where" and "what" to focus, respectively. Both of them are placed in a sequential way between the ResNet blocks and AlexNet convolutional layers.

$$\mathcal{T} = \sum\_{I \in \mathcal{D}} \log p\_{\epsilon(I)}(\mathbf{y}|I) \tag{1}$$

In addition, it applies the technique of the gradient-weighted class activation mapping (Grad-CAM) [44] to extract salient regions as the initial saliency maps, which contains the gradient information flowing into the last convolutional layers during the backward phase. The gradient information represents the importance of each neuron during inference of the network. It assumes that the features produced from the last convolutional layer has a channel size of *K*. For a given image, let *f <sup>k</sup>* be the activation of unit *<sup>K</sup>*, where k∈½ � 1, *<sup>K</sup>* . For each class *<sup>c</sup>*, the gradients of the score *<sup>y</sup><sup>c</sup>* with respect to activation map *f <sup>k</sup>* are averaged to obtain the neuron significant weight *ac <sup>k</sup>* of class *c*:

$$a\_k^\epsilon = \frac{1}{N} \sum\_i^m \sum\_j^h \frac{\partial \mathfrak{y}^\epsilon}{\partial \mathfrak{f}\_{i,j}^k} \tag{2}$$

where *i* and *j* represent the coordinates in the features map *N* ¼ *m* � *h*. With the neuron importance weight *a<sup>c</sup> <sup>k</sup>*, we can compute the activation map *M<sup>c</sup>* :

$$M^\epsilon = ReLU\left(\sum\_k a\_k^\epsilon f^k\right) \tag{3}$$

And, finally it adds an activation map with ReLU (rectified linear unit) function layer; this function filters negative gradient values, since only the positive ones contribute to the class decision, while the negative values contribute to other categories. The size of the saliency map is the same as the size of the last convolutional feature maps (1/8 of the original resolution). This process is shown in **Figure 2**.

**Figure 2.**

*The pipeline of the saliency subitizing process (SSP).*
