*2.4.4 xExtreme gradient boosting*

A gradient boosting is another ML algorithm, which is an ensemble of simple, weak, and unreliable predictors, mainly decision trees [40]. When multiple trees are grouped, they create a robust and reliable algorithm [44]. XGB starts by creating a first simple tree [45] and builds upon the weaker learners. Each iteration revises the previous tree until an optimal point is reached [46].

Feature importance is the value generated by tree-based models, including *decision trees, random forest, XGB*, etc. [40]. The measure signifies the importance of features in the model as well as how good the feature is at reducing the node impurity. Feature importance is also known as '*gini importance'* or *'mean decrease impurity,'* and is defined as the total decrease in node impurity averaged over trees in the ensemble [44]. It is calculated as: *weight, gain, and cover*, where *'weight'* represents the number of times a feature is observed in a tree, '*gain'* denotes the average gain of splits, and *'cover'* is defined as the average coverage of splits. Finally, *coverage* represents the number of samples impacted by the split [46].

#### *2.4.5 Chi-Square test*

The Chi-Square test is nonparametric [33], often employed to test the independence between the observed and expected frequencies of one or more data elements. It is known as the *'goodness of fit test'* [47]. In this chapter, the Chi-Square test was utilized to select the top significant features [48].

#### *2.4.6 p-value*

The *p*-value is the probability of an observed result, assuming that the null hypothesis is correct. The *p*-value is used to test if the null hypothesis can be rejected in favor of the alternative hypothesis. A lower *p*-value implies a stronger indication in support of the alternative hypothesis [23]. In this analysis, the significance level was set at 5% to aid the feature importance evaluation and statistical results' identification.

#### *2.4.7 Classification metrics*

The following classification metrics are often leveraged to validate the ML models' performance. A confusion matrix is generated from the predicted probability

values with *0.5* as the classification threshold. Patients with probability values greater than or equal to *0.5* are classified as *1* and below 0.5 are classified as 0. Below is the list of metrics used in evaluating models performance [32, 43, 46, 49]:

Confusion matrix:


Model performance metrics:


In this chapter, the LR, being the simplest of all ML algorithms, was chosen as the base model. Both the LR and XGB models were trained on the analytical dataset defined in the earlier section of this chapter. The top 1000 features from each algorithm were selected to reduce the dataset dimension. As the next step, the Chi-Square test from the *scikit-learn* Python package was utilized to identify the top most significant features from the list of data elements employed in both models. Finally, algorithms were re-trained on the top significant features to identify the key data elements in predicting the endometriosis onset. All ML algorithms were trained on Python 3.5 using '*scikit-learn'* and '*xgboost*' libraries.
