**3.1 Background and significance**

Machine learning is often described as a subarea of artificial intelligence that seeks to recognize subtle patterns found within data. It has been noted that the field has roots dating back to early work done by Arthur Samuel in 1959 [80]. However, despite this long history, only recent technological advances over the past two and a half decades have considerably revived interest in ML. With increases in both data collection and computational power, the applications for machine learning algorithms have become vast and integral parts of our everyday lives (e.g., facial recognition, spam detection, etc.)

One explanation for the utility of ML approaches is that they are able to model complex structures in data and leverage the detailed information to accurately predict or classify unobserved outcomes. Unique to these algorithms is their ability to adaptively update themselves (learning) through repeated exposure to new observations (a process formally known as "training") [81, 82]. Intuitively, an algorithm should achieve a higher predictive accuracy after training on larger data sets: the more possibilities an algorithm is exposed to, the better said algorithm becomes at correctly identifying similar complex patterns in heterogeneous populations [82, 85]. This represents a common tenet about ML theory: the more data the better. However, just having data is not always enough. A second tenet is related to the strength of signal between the observed data and the scientific question of interest. The greater the signal-to-noise ratio, the more amenable the task is to ML methodology. In practice, there exists a general relationship between tenets 1 and 2: the more data one has, the less robust the signal-to-noise ratio must be to achieve an acceptable prediction/classification; conversely, a high signal-to-noise ratio will compensate for less data. Note that this is obviously not a strict relationship, as many have demonstrated ML algorithms to perform well on noisy data sets with few observations.

With its growing popularity in the biological literature, the formal connection between machine learning and more traditional statistical sciences cannot be overlooked. Indeed, many current approaches in ML are motivated by prediction; however, there are opportunities to pair these tools with fundamental probabilistic concepts to improve power for inference-based tasks as well. This is particularly

**139**

**Figure 1.**

*Stem Cells and Extracellular Vesicles: Biological Regulators of Physiology and Disease*

relevant for biological problems where it is also important to understand the processes that are contributing to better predictions. To this end, recent works have used (interpretable) ML algorithms for live risk stratification in cancer patients [6], novel biomarker identification in liquid biopsies [87], hypoxemia prevention during surgery [88], point-of-care diagnosis of lymphoma [89], as well as many other uses

With an increasing literature on both statistical and machine learning methods,

For simplicity, we will consider an arbitrary data analysis problem. Let *y* be an *n*-dimensional response/outcome vector for *n* individuals. Assume that for each individual, we measure *p* features and tabulate their collection via an *n* × *p* design matrix *X*. Statistically, these features are variables that we believe will help accurately predict the outcome. In the case of our research on vesicle biology, features may be biophysical (i.e., vesical diameter and volume), genomic (i.e., sequence data), proteomic, or lipidomic measurements. Following previous work, we may specify a (Bayesian) NN by assuming some hierarchical architecture to "learn" the

*Basic "decision tree" schematic for deciding between different statistical and machine learning methods. Here,* 

*approaches are grouped by their designed purpose for applications.*

it can be difficult to decide which algorithm to use **Figure 1** provides a general approach for determining the proper choice [90]. In this section, however, we will focus on detailing an increasingly popular machine learning method known as a neural network (NN). Although NNs excel at classification tasks (see **Figure 1**), many recent works have focused on applying neural networks to a wider range of

predicted response for each observation in the data [96].

*DOI: http://dx.doi.org/10.5772/intechopen.86845*

in genetics and genomics [82, 83].

**3.2 Probabilistic formulation**

applications [83, 91, 92].

*Stem Cells and Extracellular Vesicles: Biological Regulators of Physiology and Disease DOI: http://dx.doi.org/10.5772/intechopen.86845*

relevant for biological problems where it is also important to understand the processes that are contributing to better predictions. To this end, recent works have used (interpretable) ML algorithms for live risk stratification in cancer patients [6], novel biomarker identification in liquid biopsies [87], hypoxemia prevention during surgery [88], point-of-care diagnosis of lymphoma [89], as well as many other uses in genetics and genomics [82, 83].
