*5.2.1 Decision trees*

The decision tree algorithm C4.5 [26] is efficient and robust for machine learning. It consists of constructing a tree top down in which the nodes contain the most suitable attributes. The selection of the suitable attribute is done by using the information gain (Eq. (19)), which is the difference between the entropy of the remaining instances in the training dataset and the weighted sum of the entropy of the subsets caused by partitioning on the values of that attribute.

$$\text{information gain} \ (\mathbf{D}, \mathbf{A}) = \text{entropy}(\mathbf{D}) - \sum\_{\mathbf{v} \in \mathbf{VA}} \frac{|\mathbf{D}\_{\mathbf{v}}|}{|\mathbf{D}|} \* \text{entropy}(\mathbf{D}\_{\mathbf{v}}) \tag{15}$$

Where: D is the training dataset, A is the considered attribute, VA is the set of possible values of the attribute A, Dv is the number of instances from the training dataset in which the value of the attribute A is v, and the entropy for a set of instances is defined in (Eq. (20)):

$$entropy\ (D) = -\sum\_{i=1}^{N} \mathbf{p}(\mathbf{c}\_i) \* \log \mathbf{p}(\mathbf{c}\_i) \tag{16}$$

Where: p(ci) is the probability (relative frequency) of class ci in this set.

The formula of the entropy is specific to a single class where the leaves contain one class. Therefore, C45 algorithm is the problem for multi-label datasets, and it is necessary to modify the formula of the entropy.

In [27], the learning process is accomplished by allowing multiple labels in the leaves of the tree. The formula for calculating entropy is modified for solving multilabel problems. The modified entropy sums the entropies for each individual class label (Eq. (21)).

$$entropy\ (D) = -\sum\_{i=1}^{N} \mathbf{p}(\mathbf{c}\_i) \ast \log \mathbf{p}(\mathbf{c}\_i) + \mathbf{q}(\mathbf{c}) \ast \log \mathbf{q}(\mathbf{c})\tag{17}$$

Where: p(ci) is the relative frequency of class label ci, and q(ci) = 1- p(ci).

#### *5.2.2 K-nearest neighbors KNN*

Several methods exist based on KNN algorithm. ML-KNN [28] is the extension of KNN for classification problem in multi-label datasets. It consists of computing the prior and posterior probabilities to determine labels of a test instance. We introduce these notations before presenting ML-KNN.


To classify the test instance T, we follow these steps:

• Compute the prior probability P(*H<sup>l</sup>* **<sup>1</sup>**) of each label l using all the training dataset (Eq. (22)):

$$P(H\_1^l) = \mathbf{s} + \sum\_{i=1}^{N} \overrightarrow{\mathbf{y}}\_{\mathbf{x}\_i}(\mathbf{l})/(\mathbf{s} \ast \mathbf{2} + \mathbf{N})\tag{18}$$

Where: N is the size of the training dataset, s is an input argument, which is a smoothing parameter controlling the strength of uniform prior.


$$P\left(E\_j^l \backslash H\_1^l\right) = \mathbf{s} + \mathbf{c}[\mathbf{j}])/(\mathbf{s} \* (\mathbf{K} + \mathbf{1})) + \sum\_{\mathbf{p}=\mathbf{0}}^{\mathbf{K}} \mathbf{C}\_1[\mathbf{p}] \tag{19}$$

$$P\left(E\_j^l \langle H\_0^l \rangle = \mathbf{s} + \mathbf{c}'[\mathbf{j}]\right) / (\mathbf{s} \* (\mathbf{K} + \mathbf{1})) + \sum\_{\mathbf{p=0}}^{\mathbf{K}} \mathbf{C}\_2[\mathbf{p}] \tag{20}$$

Where the vectors C1 and C2 are computed for each label and each instance.

• Compute the prediction using the posterior probabilities.

#### *5.2.3 Support vector machine*

The support vector machines (SVMs) have been extended to handle the multilabel problem. For example, Rank-SVM [29] defines a linear model based on a ranking system combined with a label set size predictor with the aim to minimize the ranking loss (Eq. (25)) and to maximize the margin.

$$\text{RLoss} = \frac{1}{\mathbf{N}} + \sum\_{i=1}^{N} \frac{1}{|\mathbf{Y}\_i| \ast |\overline{\mathbf{Y}\_i}|} \ast |\mathbf{R}(\mathbf{x}\_i)| \tag{21}$$

Where R(x*i*) = {(l1, l2) ∈ Yi \* Yi | f(xi, l1) ≤ f(xi, l2)}, Yi denotes the complement of Yi in Y, and f is the scoring function that gives a score for each label l interpreted as the probability that l is relevant.

#### *5.2.4 Ensemble methods*

AdaBoost.MH [30] is the extension of the AdaBoost algorithm, which is designed to minimize the hamming loss; for more details, see [31]. The minimization is done by decomposing the problem into k orthogonal binary classification problems.

AdaBoost.MR [30] is designed to find a hypothesis that ranks the correct labels at the top.
