**3.3 Real data applications**

*Innovations in Cell Research and Therapy*

statistical model. In Eq. (1), *y*

^

problem, then σ[∙] can be redefined as a softmax function.

vector of biases that is produced during the training phase.

*y* ^

*f* = *H*(θ)*w* + *b* (2)

*w* ∼ *π* (3)

These sets of equations reformulate a general NN as a probabilistic hierarchical

*n*-dimensional vector of continuous unbounded values that need to be estimated, and σ[∙] is a link function that relates *f* to the mean of the (assumed) distribution of *y*. Note that the link function can be flexibly changed depending on the goals of the research. For example, in the case of regression problems with continuous outcomes, the link function is set to the identity; while for classification-based applications with binary data, we may use a sigmoid function that which transforms the systematic part of the model to be between 0 and 1. If one is faces with a multiclass

In Eq. (2), we use *H*[θ] to denote an *n* × *k* matrix of activations from the penultimate layer (which are fixed given a set of inputs and point estimates θ from previous layers), *w* is a *k*-dimensional vector of weights at the output layer that is assumed to follow some prior distribution *π* (see Eq. (3)), and *b* is an *n*-dimensional

Under this formulation, notice that we may divide arbitrary Neural Networks into three components (see the middle panel in **Figure 2**): (i) an input layer of the *p* features in the design matrix *X* (red nodes), (ii) a set of hidden layers where parameters are deterministically computed based off of a set series of activations and point estimates (blue nodes), and (iii) a penultimate layer where the weights are treated as random variables (green nodes). This structure is also highly generalizable: hidden layers can take on any form, provided that the additional structure can be represented via some linear combination of activations, weights, and biases.

*Our general work flow for when using neural networks for prediction purposes in biological datasets. Here, we* 

*show how diverse feature types can be transformed/quantified and used for various applications.*

= σ[ *f* ] (1)

is an *n*-dimensional vector of predicted values, *f* is an

**140**

**Figure 2.**

We now demonstrate how machine learning and, more specifically, neural networks can be adopted to positively impact data analysis. Our group looks to characterize the vesicle phenotype of patients at various stages of treatment in various leukemias, such as AML. Here, we utilize a common NN architecture known as a Multilayer Perceptron [84] where we first train the algorithm on patients with known disease statuses (i.e., *yi* = 1 if the *i*th patient has cancer) and then test its ability to accurately classify a set of undiagnosed individuals. We define accuracy here as simply the percentage of correctly classified samples in a testing dataset. For each validation run, a Receiver Operating Characteristic (ROC) curve is drawn and the area under the curve (AUC) is calculated. The AUC is a standard performance metric for classification problems in statistics and may be interpreted as an assessment of how effective an algorithm is at discriminating between two classes (i.e., a healthy versus a disease phenotype) [93]. Higher AUC values (on a scale from 0 to 100%) indicate better model performance. An overall summary of our workflow may be found in **Figure 2** where we illustrate how different biological features are quantified and few through a NN to make predictions.

We first trained the algorithm on data collected from a NanoSight Tracking instrument, the NS5000. This allowed us to collect a wide selection of vesicle features including size, area, volume, diffusion coefficients, and total vesicles secreted. This data was collected from two cell type populations: (i) a primary hMSC cell line and (ii) a Kasumi AML cell line. We did this in order to first assess the validity of the idea that there is a discernable difference between vesicles derived from "normal" hMSC and vesicles from the cancerous Kasumi cell line. Within the training set, we were able to classify vesicles with relatively high accuracy: the mean AUC (plus or minus standard deviation) after 10-fold cross validation was 90.16 ± 9.26%. This translated into a high accuracy in the testing population with a mean AUC (after 10-fold cross validation) of 95.97 ± 5.38%. We next tested the algorithm on real patient samples, achieving perfect accuracy in reliably characterizing and classifying healthy tissue. We believe the reason for the high accuracy is due to the primary hMSC cell line accurately representing the vesicular phenotype of normal, healthy bone marrow. There is still some work to be done in accurately classifying malignant samples. We believe that the heterogeneity of the leukemic vesicle phenotype cannot trivially be captured through cell line data [94, 95].

To address this heterogeneity problem, we then elected to train and test our machine learning algorithm solely on patient samples—in hopes of increasing the predictive performance. We fed the model 35 samples from patients with various hematologic conditions. We tested and trained the model on these 35 samples and were able to achieve a mean training accuracy of 93.76 ± 4.77% and an out-of-sample AUC of 97.33 ± 3.46%. The high testing performance suggests that the algorithm is capable of accurate classification and serves as a general proof-of-concept of the potential utility of machine learning in this space. Here, this technology has the power to identify complex, heterogeneous patterns that distinguish the normal healthy vesicle phenotypes from leukemic vesicle phenotypes.
