**3.2 Probabilistic formulation**

*Innovations in Cell Research and Therapy*

In this section, we discuss machine learning (ML) as an emerging scientific field of sophisticated algorithms that aid in the understanding of how nonlinear interactions between molecular features contribute to disease etiology. Here, we give relevant background on how machine learning is used in biology, provide a formal and probabilistic specification of the hierarchical architectures implemented by common ML methods (Bayesian deep neural networks), and demonstrate their power via real data applications. Indeed, a myriad of well-established algorithms can be surveyed in detail, but our main goal is to develop a more conceptual pipeline on how to use machine learning techniques on individualized

In the context of our own research interests, we have found vesicle biology to be amenable to ML because of (i) the ability to observe millions of vesicles during a single study and (ii) the nonlinear nature of downstream vesicle effects. As we will show, large sample sizes and the presence of variable interactions are often leveraged by ML algorithms to provide high predictive accuracies. We hypothesize that these performance gains will lend to a more complete picture of how vesicle

Machine learning is often described as a subarea of artificial intelligence that seeks to recognize subtle patterns found within data. It has been noted that the field has roots dating back to early work done by Arthur Samuel in 1959 [80]. However, despite this long history, only recent technological advances over the past two and a half decades have considerably revived interest in ML. With increases in both data collection and computational power, the applications for machine learning algorithms have become vast and integral parts of our everyday lives (e.g., facial

One explanation for the utility of ML approaches is that they are able to model complex structures in data and leverage the detailed information to accurately predict or classify unobserved outcomes. Unique to these algorithms is their ability to adaptively update themselves (learning) through repeated exposure to new observations (a process formally known as "training") [81, 82]. Intuitively, an algorithm should achieve a higher predictive accuracy after training on larger data sets: the more possibilities an algorithm is exposed to, the better said algorithm becomes at correctly identifying similar complex patterns in heterogeneous populations [82, 85]. This represents a common tenet about ML theory: the more data the better. However, just having data is not always enough. A second tenet is related to the strength of signal between the observed data and the scientific question of interest. The greater the signal-to-noise ratio, the more amenable the task is to ML methodology. In practice, there exists a general relationship between tenets 1 and 2: the more data one has, the less robust the signal-to-noise ratio must be to achieve an acceptable prediction/classification; conversely, a high signal-to-noise ratio will compensate for less data. Note that this is obviously not a strict relationship, as many have demonstrated ML algorithms to perform well on noisy data sets with few

With its growing popularity in the biological literature, the formal connection between machine learning and more traditional statistical sciences cannot be overlooked. Indeed, many current approaches in ML are motivated by prediction; however, there are opportunities to pair these tools with fundamental probabilistic concepts to improve power for inference-based tasks as well. This is particularly

**3. Machine learning**

biological problems.

behavior impacts the overall cellular environment.

**3.1 Background and significance**

recognition, spam detection, etc.)

**138**

observations.

With an increasing literature on both statistical and machine learning methods, it can be difficult to decide which algorithm to use **Figure 1** provides a general approach for determining the proper choice [90]. In this section, however, we will focus on detailing an increasingly popular machine learning method known as a neural network (NN). Although NNs excel at classification tasks (see **Figure 1**), many recent works have focused on applying neural networks to a wider range of applications [83, 91, 92].

For simplicity, we will consider an arbitrary data analysis problem. Let *y* be an *n*-dimensional response/outcome vector for *n* individuals. Assume that for each individual, we measure *p* features and tabulate their collection via an *n* × *p* design matrix *X*. Statistically, these features are variables that we believe will help accurately predict the outcome. In the case of our research on vesicle biology, features may be biophysical (i.e., vesical diameter and volume), genomic (i.e., sequence data), proteomic, or lipidomic measurements. Following previous work, we may specify a (Bayesian) NN by assuming some hierarchical architecture to "learn" the predicted response for each observation in the data [96].

### **Figure 1.**

*Basic "decision tree" schematic for deciding between different statistical and machine learning methods. Here, approaches are grouped by their designed purpose for applications.*

$$
\hat{\mathcal{Y}} = \sigma[f] \tag{1}
$$

$$f = H(\Theta)w \star b \tag{2}$$

$$w \sim \pi \tag{3}$$

These sets of equations reformulate a general NN as a probabilistic hierarchical statistical model. In Eq. (1), *y* ^ is an *n*-dimensional vector of predicted values, *f* is an *n*-dimensional vector of continuous unbounded values that need to be estimated, and σ[∙] is a link function that relates *f* to the mean of the (assumed) distribution of *y*. Note that the link function can be flexibly changed depending on the goals of the research. For example, in the case of regression problems with continuous outcomes, the link function is set to the identity; while for classification-based applications with binary data, we may use a sigmoid function that which transforms the systematic part of the model to be between 0 and 1. If one is faces with a multiclass problem, then σ[∙] can be redefined as a softmax function.

In Eq. (2), we use *H*[θ] to denote an *n* × *k* matrix of activations from the penultimate layer (which are fixed given a set of inputs and point estimates θ from previous layers), *w* is a *k*-dimensional vector of weights at the output layer that is assumed to follow some prior distribution *π* (see Eq. (3)), and *b* is an *n*-dimensional vector of biases that is produced during the training phase.

Under this formulation, notice that we may divide arbitrary Neural Networks into three components (see the middle panel in **Figure 2**): (i) an input layer of the *p* features in the design matrix *X* (red nodes), (ii) a set of hidden layers where parameters are deterministically computed based off of a set series of activations and point estimates (blue nodes), and (iii) a penultimate layer where the weights are treated as random variables (green nodes). This structure is also highly generalizable: hidden layers can take on any form, provided that the additional structure can be represented via some linear combination of activations, weights, and biases.

### **Figure 2.**

*Our general work flow for when using neural networks for prediction purposes in biological datasets. Here, we show how diverse feature types can be transformed/quantified and used for various applications.*

**141**

line data [94, 95].

phenotypes.

**4. Conclusion**

*Stem Cells and Extracellular Vesicles: Biological Regulators of Physiology and Disease*

We now demonstrate how machine learning and, more specifically, neural networks can be adopted to positively impact data analysis. Our group looks to characterize the vesicle phenotype of patients at various stages of treatment in various leukemias, such as AML. Here, we utilize a common NN architecture known as a Multilayer Perceptron [84] where we first train the algorithm on patients with known disease statuses (i.e., *yi* = 1 if the *i*th patient has cancer) and then test its ability to accurately classify a set of undiagnosed individuals. We define accuracy here as simply the percentage of correctly classified samples in a testing dataset. For each validation run, a Receiver Operating Characteristic (ROC) curve is drawn and the area under the curve (AUC) is calculated. The AUC is a standard performance metric for classification problems in statistics and may be interpreted as an assessment of how effective an algorithm is at discriminating between two classes (i.e., a healthy versus a disease phenotype) [93]. Higher AUC values (on a scale from 0 to 100%) indicate better model performance. An overall summary of our workflow may be found in **Figure 2** where we illustrate how different biological features are

We first trained the algorithm on data collected from a NanoSight Tracking instrument, the NS5000. This allowed us to collect a wide selection of vesicle features including size, area, volume, diffusion coefficients, and total vesicles secreted. This data was collected from two cell type populations: (i) a primary hMSC cell line and (ii) a Kasumi AML cell line. We did this in order to first assess the validity of the idea that there is a discernable difference between vesicles derived from "normal" hMSC and vesicles from the cancerous Kasumi cell line. Within the training set, we were able to classify vesicles with relatively high accuracy: the mean AUC (plus or minus standard deviation) after 10-fold cross validation was 90.16 ± 9.26%. This translated into a high accuracy in the testing population with a mean AUC (after 10-fold cross validation) of 95.97 ± 5.38%. We next tested the algorithm on real patient samples, achieving perfect accuracy in reliably characterizing and classifying healthy tissue. We believe the reason for the high accuracy is due to the primary hMSC cell line accurately representing the vesicular phenotype of normal, healthy bone marrow. There is still some work to be done in accurately classifying malignant samples. We believe that the heterogeneity of the leukemic vesicle phenotype cannot trivially be captured through cell

To address this heterogeneity problem, we then elected to train and test our machine learning algorithm solely on patient samples—in hopes of increasing the predictive performance. We fed the model 35 samples from patients with various hematologic conditions. We tested and trained the model on these 35 samples and were able to achieve a mean training accuracy of 93.76 ± 4.77% and an out-of-sample AUC of 97.33 ± 3.46%. The high testing performance suggests that the algorithm is capable of accurate classification and serves as a general proof-of-concept of the potential utility of machine learning in this space. Here, this technology has the power to identify complex, heterogeneous patterns that distinguish the normal healthy vesicle phenotypes from leukemic vesicle

EVs are a ubiquitous and dynamic population of cell-specific information. Functionally, they act as a class of membrane-bound cellular communication

*DOI: http://dx.doi.org/10.5772/intechopen.86845*

quantified and few through a NN to make predictions.

**3.3 Real data applications**

*Stem Cells and Extracellular Vesicles: Biological Regulators of Physiology and Disease DOI: http://dx.doi.org/10.5772/intechopen.86845*
