**5. Evaluation**

Our evaluation considers the comparative performance of CNN architecture to detect prohibited items using real X-ray imagery against prohibited items under synthetic X-ray imagery. We consider mean Average Precision (mAP) as our evaluation criteria following [15].

### **5.1 Prohibited item detection—Quantitative results**

In the first set of experiments (**Table 1**, upper), prohibited items in X-ray security imagery are detected using the CNN architectures set out in Section 3.2. We use the *Dbf3* dataset consisting of three types of prohibited items {*Firearm, Firearm Parts, Knives}*. To provide performance benchmark, our CNN architectures are firstly trained and evaluated on real images of *Dbf3* (*Dbf3Real* ➔ *Dbf3Real*). The AP/mAP highlighted in **Table 1** (upper) denotes the maximal performance achieved. **Table 1** shows statistical results of prohibited item detection for Faster R-CNN [8] and RetinaNet [26] architecture using ResNet50 and ResNet101. Inline with the overall complexity of the network, we observe maximal mAP performance from ResNet101 for all three prohibited item classes. In this performance benchmark, we observe that the best performance (mAP = 0.88) is achieved on *Dbf3real* by Faster R-CNN with ResNet101 configuration, as presented in **Table 1** (upper).

In second set of experiments (**Table 1**, middle), the CNN architecture are trained on the synthetic X-ray imagery (*Dbf3SC*) achieve 0.78 mAP when tested on same set of real X-ray imagery (*Dbf3Real*) of **Table 1** (upper). Even though the performance is lesser when compared with former results (**Table 1**, upper), this experimental setting does not require any manual image labelling (as TIP insertion positions are known) and yet achieves surprisingly good performance on a standard benchmark. The performance gap between CNN architecture trained on real and synthetically composited


#### **Table 1.**

*Detection results of varying CNN architecture trained on: upper* ! *Dbfg3Real, middle* ! *Dbf3*SC *and lower* ! *Dbfg3Real. All models are evaluated on set of real X-ray security imagery.*

X-ray imagery is attributable to the domain shift problem whereby the distribution of training and test data differ. In the first experiment (**Table 1**, upper), the training and test data are from the same distribution since they are created by randomly dividing data captured under the same experimental conditions. By contrast, in this second experimental setup (**Table 1**, middle), the prohibited items used for the synthetic X-ray imagery (*Dbf3SC*) data are different from those in the test X-ray imagery (*Dbf3Real*) data. It is also noteworthy that prohibited images used for generating synthetic X-ray imagery (*Dbf3SC*) data is a smaller set of prohibited item instances than in the real training images. As a result, CNN architecture trained on synthetic data have larger generalisation errors than those trained on real data. However, when tested on synthetic X-ray imagery (*Dbf3SC*) data (**Table 2**), however, CNN architecture trained with real or synthetic CNN architecture have comparable performance. These experimental results show that it is essential to have diverse prohibited item signatures in the training data to improve the generalisation. It also largely explains why overall performance in **Table 2** (showing evaluation on the synthetic dataset, *Dbf3SC*) is significantly higher than overall performance in **Table 1** (evaluation is on the real dataset, *Dbf3Real*).

In the third set of experiments (**Table 1**, bottom), we evaluate the effectiveness of synthetic X-ray imagery by combining it with real images of *Dbf3* to create *Dbf3Real*+*SC* dataset, as explained in the Section 4. We evaluate the testing sets of images from real *Dbf3* (**Table 1**) and synthetically composite (**Table 2**) datasets. Surprisingly, the combination of real and synthetic imagery data does not improve the results (e.g. 0.81 vs. 0.88 on *Dbf* 3*Real* and 0.89 vs. 0.91 on *Dbf* 3*SC* with Faster R-CNN and ResNet101). This can also be explained by the domain shift problem mentioned previously. Possibly this data combination can perform well if we apply domain adaptation techniques [30] explicitly. In addition, we may also need to evaluate the quality of the TIP solution that underpins our work further.

*Evaluating Convolutional Neural Networks for Prohibited Item Detection Using Real… DOI: http://dx.doi.org/10.5772/intechopen.105162*


#### **Table 2.**

*Detection results of different CNN architecture trained on: Upper* ! *Dbfg3Real, middle* ! *Dbf3*SC *and lower* ! *Dbfg3Real. ALL models are evaluated on set of SC dataset.*

#### **Figure 5.**

*Exemplar detection of prohibited items in red box using Faster R-CNN [8] and trained on (A) Dbfg3Real and (B)* Dbf3SC *images.*

### **5.2 Prohibited Item detection—Quantitative examples**

Exemplar prohibited items detection results from Faster R-CNN [8] with ResNet101 are depicted in **Figure 5**, using real (top row) and synthetic (bottom row) training imagery. These results illustrate that the synthetically composited imagery using TIP techniques can be effective in training detection architectures for prohibited item detection in cluttered X-ray security imagery.

We also visually inspect the detection results to investigate the performance difference when training the models using real and synthetic data. By comparing the results depicted in **Figure 6A1** and **B1**, the model trained with synthetic data fails to detect the knives since such type of knives have very different appearance from the

**Figure 6.**

*Exemplar prohibited item detection (by Faster R-CNN [8]) using Dbfg3Real (A1,A2) and* Dbf3SC *(B1, B2) training datasets. Green dashed box in B1 fails to detect, where in B2 wrongly detects as* knife*.*

ones we used to generate the synthetic imagery. On the other hand, from **Figure 6A2** and **B2** we can see that the model trained on synthetic imagery has mistakenly detected something benign as a knife. These results account for the low performance for knife detection observed in **Table 1**. As a result, we need to either use more diverse threat signatures for data synthesis or particular domain adaptation techniques to tackle the potential domain shift problem identified previously.
