**2. Related work**

Traditional computer vision methods that rely on handcrafted features have been applied to prohibited item detection in X-ray security imagery such as Bag of Visual Words (BoVW) [3, 11, 12] and sparse representations [13]. However, the recent advancement in CNN have drawn more attention to prohibited item detection due to significant performance gains within X-ray security imagery [14–16]. The works of [14, 16] compare handcrafted features with a BoVW based sparse representation to CNN features. These shows that such deep CNN features achieve superior performance with more than 95% accuracy for prohibited item detection. The study of [14] exhaustively compares various CNN architectures to evaluate the impact of network complexity on overall performance. Fine tuning the entire network architecture for this problem domain yields 0.99% true positive, 0.01% false positive and 0.994% accuracy for generalized prohibited item detection [14].

Further work on prohibited item under X-ray security imagery is undertaken by Mery et al. [11], where regions of interest detection is performed across multiple views of the object. Subsequently, the candidate region obtained from an earlier segmentation step is then matched based on their similarity. This achieves 94.3% true positive and 5.6% false positive across multiple view X-ray security imagery. The work of [15] examines the relative performance of traditional sliding window driven CNN detection model based on [14] against contemporary region-based and single forward-pass based CNN variants such as Faster R-CNN [8], R-FCN [17], and YOLOv2 [18], achieving a maximal 0.88 and 0.97 mAP over 6-class object detection and 2-class firearm detection problems respectively.

To investigate the generalised applicability of CNN within X-ray security imagery, large X-ray imagery datasets are required. Existing public domain datasets such as GDXray [9] contains three major categories of prohibited items, {*Guns, Shurikens, Razor blades*}. However, images in GDXray are provided with lesser clutter and overlap making object detection less challenging than in typical operational conditions. By contrast, the SIXray dataset [10] contains six classes, {*Guns, Knives, Wrenches, Pliers, Scissors, Hammers*}*,* from cluttered operational imagery. This provides more interoccluding imagery examples but at the same time provides significantly fewer prohibited item than benign samples akin to an operational (real-world) scenario, where the presence of prohibited items is low within stream-of-commerce (largely benign) X-ray security imagery.

To overcome the limited dataset availability, data augmentation has been used to increase overall dataset diversity. Whilst simple image data augmentation strategies such as translations, flipping and scaling do increase geometric diversity of the imagery they do not increase the appearance or content diversity of the dataset itself [19]. The work of [20] alternatively attempts data augmentation based on an Generative Adversarial Network (GAN) approach but generates synthetic prohibited items in isolation rather than within a full cluttered X-ray security image. By contrast, the work of [21] utilises an approach, similar to the concept of TIP, whereby a prohibited item is superimposed into X-ray security imagery. Therefore, in this work, we explore the feasibility of TIP as a data augmentation strategy to support performance enhancement and evaluation of contemporary deep CNN architectures within the context of prohibited item detection in security X-ray imagery.
