**3. Proposed approach**

We investigate the use of a full TIP pipeline, based on prior work in the field [22– 24], to generate a range of appearance and contents based dataset variation (Section 3.1). Subsequently, CNN object detection architecture is used to evaluate the TIP based data augmentation approach and compare the performance with real X-ray security imagery (Section 3.2)

## **3.1 Synthetic X-ray security imagery via TIP**

Our TIP pipeline consists of three components: *threat image transformation*, *insertion position determination* and *image compositing* as illustrated in **Figure 2**.

We use threat (prohibited item) images containing clean, isolated object signatures which can be easily segmented from their plain background via simple thresholding. To diversify the resultant synthetic images, we apply *threat image transformation* via rotating the threat signature by a random angle *θ*. Although other threat image transformation strategies (e.g., noise, illumination, magnification, etc.) have been explored in [25], our work focuses on the pure combination of our segmented threat signature and a benign X-ray security image, isolating the effects of other data augmentation techniques. We denote this transformed threat image as *Is* and the *i*-th row, *j*-th column pixel as *Is*(*i, j*).

A valid insertion position within the bag image is determined based on the bag region and the shape of threat signature.

Given a bag image *It*, we use morphological operations to extract the bag region. Specifically, the original bag image is firstly binarised by thresholding (**Figure 3b**) to extract the foreground (target) region for insertion. Due to noise, a simple thresholding process cannot ideally separate background and foreground. We sequentially apply a series of appropriately parameterised morphological operations including dilation (**Figure 3c**), hole filling (**Figure 3d**) and erosion (**Figure 3e**) to identify the largest connected image region as the target for insertion (see **Figure 3f**). Obviously, a valid insertion of the threat signature has to guarantee the threat

#### **Figure 2.**

*Threat image projection (TIP) pipeline for synthetically composited image generation.*

*Evaluating Convolutional Neural Networks for Prohibited Item Detection Using Real… DOI: http://dx.doi.org/10.5772/intechopen.105162*

**Figure 3.**

*Image segmentation using morphological operations for insertion position determination. (a) Grayscale (b) Binarisation (c) Dilation (d) Hole filling (e) Erosion (f) Biggest region*

signature is completely located inside this target region. To this end, we use a loop to generate a random insertion position until it is a valid one. The selected valid insertion position can be denoted by a binary mask matrix *M* of the same size as the target baggage image with elements of ones indicating the insertion region.

Finally, a threat signature *Is* is superimposed onto the target bag image *It* in the selected valid position (denoted by *M*) to generate a synthetically composited image *IT IP*. To ensure the plausibility of the composited TIP image, we consider two factors in image blend. Parameter *α* controls the transparency of the source image *Is* (*α* = 0*.*9). The other parameter is the *threat threshold T* ensuring the consistency of source image with the target image in terms of image contrast. The use of *threat threshold T* aims to remove the high-value pixels of the threat signature so that the inserted threat signature is not visually too bright comparing against the target region where it is superimposed. To calculate the value of *T*, we first transform the target image *It* to a greyscale image *Gt*. The threat threshold *T* can be empirically calculated by:

$$T = \min\left(\exp\left(\hat{\mathbf{g}}^{\sf S}\right) - \mathbf{0}.5, \mathbf{0}.95\right) \tag{1}$$

where ^*g* is the normalised average intensity of the insertion region within G*t*, calculated as:

$$\hat{\mathbf{g}} = \frac{\sum\_{i,j} \mathbf{G}\_t(i,j) \* M(i,j)}{\sum\_{i,j} \mathbf{255} \* M(i,j)} \in [\mathbf{0}, \mathbf{1}] \tag{2}$$

The image compositing can be formulated as follows:

$$I\_{\rm TIP}(i,j) = \begin{cases} \left(\mathbf{1} - \infty\right) I\_l(i,j) + a \, I\_l(i',j'), & \mathbf{M}(i,j) = \mathbf{1} \, and \, I\_l(i',j') < 2\mathbf{S}5 \,\*\, T\\ I\_l(i,j), & \text{otherwise} \end{cases} \tag{3}$$

where *I*(*i, j*) denotes the value of pixel in *i*-th row and *j*-th column of the image *I*; *Is*(*i* 0 *, j*0 ) denotes the pixel in source image corresponding to the pixel of *I*(*i, j*). Since the value of *T* computed by Eq. (1) is in the range of 0.5–0.95, any pixel of the higher value than *T* 255\* in the source image will be ignored during image compositing process.

The proposed TIP approach is able to generate a large number of diverse synthetic X-ray baggage images containing prohibited items whose locations are accessible without any extra cost for training a supervised learning detection model.

### **3.2 Detection strategies**

We use two representative CNN object detection model, Faster R-CNN [8] and RetinaNet [26], for the purposes of our evaluation.

Faster R-CNN [8] is an object detection algorithm which is the combination of its predecessor Fast R-CNN [27] and Region Proposal Network (RPN). Unlike Fast R-CNN [27], which utilises external region proposal, this architecture has its own region proposal network, which is consists of convolutional layers that generate object proposals and two fully connected layers that predict coordinates of bounding boxes. The corresponding locations and bounding boxes are then fed into objectness classification and bounding box regression layers. Finally the objectness classification layer classify whether a given region proposal is an object or a background region while a bounding box regression layer predicts object localisation, at the end of the overall detection process.

RetinaNet [26] is an object detector where the key idea is to solve the extreme class imbalance between foreground and background classes. To improve the performance, RetinaNet employs a novel loss function called Focal Loss, where it modifies the crossentropy loss such that it down-weights the loss in easy negative samples so that the loss is focusing on the sparse set of hard samples. Unlike Faster R-CNN [8] which apply two-stage approach, RetinaNet only apply one-stage approach, potentially to be faster and simpler.
