**7. Model performance evaluation**

Being able to measure the performance of our models is crucial to be able to evaluate their suitability for different tasks. Also, when performing finetunning, it is important to be able to have performance measures to know which parameters promote the best results. First of all, every time we test a model, we will have part of a dataset (in this particular case, images) that already has the labels assigned to it. The assignment of labels is done by the medical professionals specialized in the pathology being worked on. When the model processes the samples and predicts the new labels, these are compared with the original ones (called ground truth). With the result of the comparison, what is called a confusion matrix is constructed [29, 30]. This structure contains the true positives (TP) and true negatives (TN) and false positives (FP) and false negatives (FN). A TP or TN is established when the prediction and the ground truth are the same for a given sample (e.g., is a TN when the model predicted negative and the image was negative). On the contrary, a FP or FN is established when there is no coincidence between the model and the ground truth (e.g., the model predicted negative and the ground truth indicated a positive sample, therefore, the sample is a FN) [29]. Almost all the other global metrics that are usually reported in the different publications are derived from the four previous metrics. For example, the accuracy of a model corresponds to the number of samples correctly predicted by the model over the total number of samples. Then, considering the previous metrics, the correctly predicted samples would be included in the sum of the TP and TN. Additionally, the total number of samples would not be more than the sum of the TP, TN, FP, and FN [29].
