**2.4 Overview of machine learning algorithms**

In this section of the chapter, a summary of the classical statistical modeling and ML approaches is presented to review the available methods for healthcare research, and also to summarize the selected methodology applied in this study. Statistical modeling has evolved in the last few decades and shaped the future of business analytics and data science, including the current use and applications of ML algorithms [31]. It represents a branch of applied mathematics, in which statistical methods are leveraged to analyze a dataset. Statistical models are the mathematical representation of real-world scenarios with certain assumptions undertaken. They play a fundamental role in making statistical inferences while studying the characteristics of a population, upon which hypotheses were framed [8]. These models

are not only useful in finding relationships between variables and the significance of those relationships, but they are also useful in the prediction and forecasting of future events.

ML is a subfield of the AI area, which includes statistics, mathematics, computer algorithms, etc., focused on building applications that learn and improve their predictive capabilities automatically over time without being specifically programmed to do so. ML models are built upon a statistical framework since they involve a large amount of data elements often described using statistical distributions. In the last two decades, ML algorithms have received a significant amount of attention in the fields of computer vision, natural language processing, autonomous driving vehicles, healthcare and drug development, e-commerce, to list a few due to the increased amounts of data availability and significant advancements in the computing power. ML algorithms can be broadly categorized as supervised, unsupervised, and semi-supervised algorithms [5, 7, 32, 33].

#### *2.4.1 Supervised learning algorithms*

Supervised learning is a set of algorithms that learn from the input space (*X*) to the output space (*Y*), i.e. *Y = f(X)* [34]. The major objective is to estimate the mapping function (*f*) to ensure that with an addition of a new data point (*x*), the outcome, (*y*), could be predicted [35]. Supervised learning algorithms are often applied to classification and prediction problems [32]. The following are the selected examples of supervised algorithms often employed in research studies: *logistic regression, decision trees (DTs), random forest (RF), extreme gradient boosting, support vector machines (SVMs), Naïve Bayes*, *adaptive boosting (AdaBoost), artificial neural network (ANN), etc*. [36].

#### *2.4.2 Unsupervised learning algorithms*

Different from the supervised learning algorithms, the unsupervised learning algorithms try to understand the hidden patterns within the input dataset (*X*) [37]. The algorithms learn and uncover the patterns without the researcher's assistance [38]. These algorithms are often leveraged to find the naturally occurring clusters, reduce data dimensions, detect anomalies, etc*. k-means clustering, principal component analysis (PCA), factor analysis (FA), singular value decomposition (SVD), apriori algorithm (association rule)* represent a few examples of these types of algorithms [36]. In some cases, a semi-supervised approach is used to enhance the model performance with the help of a small amount of labeled data [36].

Depending on the study objectives and the availability and granularity of data, algorithms are reviewed for analytical relevance, tested for performance, data type fit, and selected as optimal algorithms accordingly. For this chapter, LR and XGB models were chosen to develop a predictive algorithm for the endometriosis onset. LR estimated the odds of the condition occurrence for a given medical event [39], while XGB provided more flexibility in fine-tuning the hyper-parameters when compared to other tree-based algorithms [40].

#### *2.4.3 Logistic regression*

An LR is a statistical model as well as the simplest version of ML algorithms that uses a logistic function to model a binary dependent variable with two possible outcomes: '*0*' and '*1*' [39, 41, 42]. A *multinomial logistic regression* is also often considered for research studies with multiple outcomes. LR is applied in a variety of fields, including healthcare research and social sciences [43].

#### *Applying Machine Learning Algorithms to Predict Endometriosis Onset DOI: http://dx.doi.org/10.5772/intechopen.101391*

In regression modeling, analysis often involves interpreting the independent variables' coefficients. Regression coefficients describe the size and direction of the relationship between regressors (*x*) and the outcome variable (*y*). They explain the behavior of the dependent variable given a unit change in an independent variable while holding all other data elements constant. The magnitude and sign of the coefficients signify the resulting relationship with the dependent variable. Interpreting the LR's coefficients also includetheir interpretation, as well as the odds and odds ratios [41].

*Odds* exemplify the ratio of probabilities of two mutually exclusive events [41], at the same time the *odds ratio* represents the ratio of two different odds. The simplest way to calculate the *odds ratio* in the LR is to exponentiate the coefficient of a predictor [39]. As a result, if the *odds ratio* for the age variable in years is 1.25, then for each additional year, the probability of event/success increases by 25%. For categorical features, the interpretation of the *odds ratio* can be more meaningful than the interpretation of *odds* [41].
