**3.1 Important features selection**

**Table 3** presents the ML model performance metrics of the initial run, where the objective was to select the top features and study whether the data captured was


*Applying Machine Learning Algorithms to Predict Endometriosis Onset DOI: http://dx.doi.org/10.5772/intechopen.101391*

#### **Table 3.**

*Classification metrics of train and test sets for LR and XGB model.*

reasonably proven in disease prediction. Algorithms were trained on 70% of the analytical dataset and were tested on the remaining 30%. Metrics captured indicated that both the LR and XGB models performed relatively well in predicting the condition onset. The models' accuracy ranged between 88% and 96%. **Figure 2** presents the ROC curves on the test set for LR and XGB models respectively. The area under the ROC curve (AUC) values were 0.88 and 0.96, respectively for both models.

From the outputs of the initial model run, the top 1000 features with absolute regressor coefficients in descending order greater than zero (*0*) were selected from the LR. Similarly, another set of top 1000 features with feature importance greater than zero (*0*) were identified from XGB. Both sets were combined to establish a unique list of top features. As the next step, the Chi-Square test for feature selection from Python *scikit-learn* package was applied to select the top 1000 most significant features for the final model run. The top features were selected at a standard significance level of 5% (α = 0.05). Most of the top significant features were associated with a series of medical and surgical procedures, as well as various diagnostic and comorbid conditions.

As noted above, **Table 4** presents the list of most significant features identified by the Chi-Square test, which were associated with the endometriosis diagnosis.

**Figure 2.** *XGB & LR ROC curves on test set.*


#### **Table 4.**

*Most significant features from LR, XGB, and Chi-Square test.*

The table also presents the LR coefficients to provide relative direction between the endometriosis onset and the selected top regressors. As noted in the earlier version of the chapter available on Research Square, data elements including 'non-inflammatory disorder of uterus,' 'pelvic and perineal pain' presented examples of the diagnosis codes, indicated a positive relationship with symptoms of endometriosis [21, 50]. Procedure codes such as 'anesthesia of lower abdomen for laparoscopy,' 'vaginal hysterectomy including biopsy' were also identified as the procedures often correlated with the diagnosis as well treatment of endometriosis [50]. Furthermore, the Chi-Square test suggested that patients often consulted with a variety of healthcare specialists, including 'emergency medicine (SPCLT\_EM),' 'family medicine (SPCLT\_FM),' 'obstetrics and gynecology (SPCLT\_OBG)' when experiencing gynecological symptoms and concerns; however, a larger number of office visits might negatively impact the likelihood for the condition diagnosis, as noted by the negative regressor coefficients.
