**4. Common challenges and possible countermeasures**

During dataset preparation, the ideal case is the one in which several samples (in the order of thousands or more) are available for each class to be detected, the classes have balanced data, and they are well separated from each other. In such an ideal case, it is possible to give the network a representative set of samples of the whole input space for the training and avoid confusing the network with an uneven distribution of the inputs or the similarities between the classes.

Unfortunately, industrial production lines having well-optimized processes are usually present with few defective products and much more good samples. Therefore, it is often unfeasible to get sets of defective samples large enough to train CNNs for classification purposes. In the majority of the cases, the objective of the training moves from defect classification to anomaly detection. The worst-case scenario is the one presenting an *imbalanced dataset* with scarce availability of defect samples and classes that are not easily separable. Deep metric learning uses DNNs to directly learn a similarity metric, rather than creating it as a byproduct of solving a classification task [37]. They are well suited for tasks where the number of object classes is perhaps endless, and classification is not applicable. The approach is to compute a certain distance metric between input samples and reference prototypes. Moreover, the training will not even require defective samples if the class features are well defined and distinct from each other. Unfortunately, textured objects present surface appearance and properties that are stochastic.

Different sampling strategies could be implemented to deal with *imbalanced datasets*. When the minority class represents the defective pieces, it could be convenient to use as many elements from the majority class as the available defective ones. This approach is known as *undersampling* in literature. It can certainly be applied when the amount of defective samples is adequate for the training task. The alternative for not reducing the majority class is to give the network the available defective samples multiple times, trying to get the same amount of the good ones. Such an approach is called *oversampling* in literature. It is important to notice that this method could be risky as it can be easy to overfit the network due to the scarce representation of the input space that usually cannot completely cover the possible scenarios. Nevertheless, there are cases in which the beforementioned solutions are valuable tools to enhance the performance of the classifier as the work proposed by Yap et al. [38].

An alternative approach that is often used to increase the robustness of the classification is *data augmentation*. Traditional techniques involve operations on the input images, such as scaling, cropping, rotation, mirroring, and color shift [39]. The samples are augmented based on the available data with the risk of a strong correlation between the original samples and the augmented ones that could probably lead to overfitting scenarios on a small dataset. However, if the augmentation is correctly managed, it can boost the performance of the classifier. Indeed, data augmentation has been employed with success in many defect inspection methods [40, 41].

Other ways for enlarging the dataset have been experimented like passing the input data through an encoder-decoder network that applies different transformations featured with random noise [42]. Another approach worth mentioning is the generation of *virtual samples*. For example, the work presented by Leng et al. [43] successfully exploits virtual samples for face reconstruction.

Virtual data generation could be obtained by producing synthetic images with the intent to cover the whole input feature space. Generative adversarial network (GAN) [44] or the most recent conditional GAN (cGAN) [45] could be alternatively used for this purpose. However, this is computationally expensive and requires taking into account all possible configurations and boundary conditions for generating samples as close as possible to real ones. Domain randomization techniques [46] could be applied to synthetically generated data for improving the generalization capabilities and the robustness of the network.

Similar to humans, when learning new concepts or rules, if not clearly defined, the training can lead to fuzzy assumptions, possibly resulting in wrong outcomes. Additionally, when dealing with data obtained by a sensing apparatus, it is important to check the correctness of the acquired data samples to avoid possible causes of classification errors. A cleaning process should remove outliers (wrong data association of a sample with a class) and spurious samples that could confuse the learning process. Industrial processes often rely on qualitative evaluation, and unfortunately, different quality experts in the same industrial process classify the same product as belonging to different classes. If the same confusion is transferred to the DL architecture, the learning process will probably worsen the decision process. For this reason, a preprocessing stage on the data is essential. In most cases, the help of professionals of the sector for interpreting, filtering, and preprocessing the data is welcome.

A last and quite important aspect is the adoption of correct performance metrics and *loss functions* enabling successful training with *imbalanced dataset*s. In this context, Mower [47] proposes a balanced accuracy statistic that mediates the *recall* and *specificity* metrics. A more general approach is to directly scale the confusion matrix terms based on the relative support of each class as proposed by Tripicchio et al. [48]. Other studies modify the *loss function* to account for class imbalance. In particular, *binary cross-entropy loss* is a common choice for classification tasks. In a study by Xi and Tu [49] a balanced *cross-entropy* is introduced where, differently from the *binary crossentropy loss*, the contribution of the dominant class is multiplied by the fraction of the less dominant class. However, the method does not differentiate between easy/hard examples. A different approach is proposed by Lin et al. [50] where the authors focus the training on hard negatives, down-weighting the *loss* assigned to well-classified examples. The resulting *loss* is called *focal loss*.
