**4. Evaluation measures**

For classical classification, different performance measures have been proposed such as accuracy and coverage. However, the performance measures for classification in multi-label datasets are more complicated than single-class datasets. Consequently, a number of evaluation measures are proposed specifically to the multi-label datasets. These measures are categorized into two groups: example-based measures and labelbased ones.

#### **4.1 Example-based measures**

These measures evaluate each instance of test dataset, and they return the mean value. They are also divided into two groups: prediction-based measures and rankingbased ones. The former measures use a learning system, and they are calculated based on the average difference of the actual and the predicted set of labels over all test instances. Whereas, the latter measures evaluate the label ranking quality depending on the scoring function f(.,.).

#### *4.1.1 Prediction-based measures*

Hamming Loss [9] represents the fraction of misclassified instances. If this measure is low, then the classifier has good performance (Eq. (4)).

$$\mathbf{H} = \frac{\mathbf{1}}{\mathbf{N}} \ast \sum\_{i=1}^{m} |\mathbf{Y\_i} \Delta \mathbf{Z\_i}| \text{ where } \mathbf{: Y\_i} \Delta \mathbf{Z\_i} = \mathbf{XOR}(\mathbf{Y\_i}, \mathbf{Z\_i}) \tag{4}$$

Classification Accuracy [10] represents the fraction of well-classified instances. It is a very strict as it requires the predicted set of labels to be an exact match of the true set of labels. It is also known as subset Accuracy [11] (Eq. (5)).

$$\text{Classification Accuracy} = \frac{1}{\mathbf{N}} \ast \sum\_{i=1}^{\text{N}} \mathbf{I}(\mathbf{Z}\_{i} = \mathbf{Y}\_{i}) \tag{5}$$

Where: I Zð Þ <sup>i</sup> ¼ Yi =1 if Zð Þ <sup>i</sup> ¼ Yi et 0 else.

Accuracy [12] represents the percentage of correctly predicted labels among all predicted and true labels (Eq. (6)).

$$\text{Accuracy} = \frac{1}{\text{N}} \ast \sum\_{i=1}^{\text{N}} \frac{|\text{Y}\_{i} \cap \text{Z}\_{i}|}{|\text{Y}\_{i} \cup \text{Z}\_{i}|} \tag{6}$$

Precision represents the proportion of true positive predictions (Eq. (7)) [13].

$$\text{Precision} = \frac{1}{\text{N}} \ast \sum\_{i=1}^{\text{N}} \frac{|\text{Y}\_{i} \cap \text{Z}\_{i}|}{|\text{Z}\_{i}|} \tag{7}$$

Recall: estimates the proportion of true labels that have been predicted as positives (Eq. (8)) [13].

$$\text{Racall} = \frac{1}{\mathbf{N}} \ast \sum\_{i=1}^{\mathbf{N}} \frac{|\mathbf{Y\_i} \cap \mathbf{Z\_i}|}{|\mathbf{Y\_i}|} \tag{8}$$

#### *4.1.2 Ranking-based metrics [14]*

Coverage error evaluates how many steps are needed, on average, to move down the ranked label list so as to cover all the relevant labels of the instance (Eq. (9)).

$$\text{Coverage } error = \frac{1}{N} \ast \sum\_{i=1}^{N} \max\_{y \in Y\_i} rank\_f(X\_i, y) - 1 \tag{9}$$

One-error computes how many times the top-ranked label is not in the true set of labels of the instance (Eq. (10)).

$$One-error = \frac{1}{N} \ast \sum\_{i=1}^{N} \langle \left[ \arg \max\_{\mathcal{Y} \in Y\_i} f(\mathbf{X}\_i, \mathbf{y}) \right] \notin \mathbf{Y}\_i \rangle \tag{10}$$

#### **4.2 Label-based measures**

In the aim to present the label measures, we compute the four components of the confusion matrix for each label yi, which are TPi, FPi, TNi, and FNi that represent respectively true positive, false positive, true negative, and false negative (Eqs. (11)-(14) [15].

$$\text{TP}\_{i} = |\{X\_{i} \text{ where } : \mathcal{Y}\_{i} \in \mathcal{Y}\_{i} \text{ and } \mathcal{Y}\_{i} \in \mathcal{Z}\_{i}; 1 \le i \le N\}|\tag{11}$$

$$FP\_i = |\{X\_i \text{ where } : \mathcal{Y}\_i \notin \mathcal{Y}\_i \text{ and } \mathcal{Y}\_i \in \mathcal{Z}\_i; 1 \le i \le N\}|\tag{12}$$

$$TN\_i = |\{X\_i \text{ where } : \mathcal{Y}\_i \notin \mathcal{Y}\_i \text{ and } \mathcal{Y}\_i \notin Z\_i; 1 \le i \le N\}|\tag{13}$$

$$FN\_i = |\{X\_i \text{ where } : \mathcal{Y}\_i \in \mathcal{Y}\_i \text{ and } \mathcal{Y}\_i \notin Z\_i; 1 \le i \le N\}|\tag{14}$$

The label measures evaluate each label, and they return the average. The calculation of the average of all the labels can be achieved using two operations: macroaverage and micro-average [16]. In macro-average, we calculate the performance measure of each label (Eqs. (15) and (16)), and then we take the average. On the other hand, in micro-average, we calculate the average performance measure of all the labels (Eqs. (17) and (18)).

