4. Cancer detection and monitoring using neural network-based methods

#### 4.1. Using artificial neural networks for lung cancer detection and diagnosis

Goryński team describes [36] an artificial neural network (ANN)-based model class used for early detection and diagnosis of lung cancer. In their study, a dataset consisting of a wide range of biochemical parameters obtained from blood samples, as well as results from medical interviews (48 values in total) from 193 patients of mixed age and sex was used to train a family of 10 multilayer perceptron network (MLP) [3, 4] architectures, using a range of activation functions (linear, logistic, and tanh) for both hidden and output layers, as well as varying number of processing units ("neurons") in the hidden layer and different training algorithms (gradient descent, Broyden-Fletcher-Goldfarb-Shanno (BFGS) [37], and scaled conjugate gradient (SCG)) [38].

Goryński team found that two of the trained models, named MLP 48–9-22 (trained using BFGS algorithm and using linear and tanh activation functions for hidden and output layers, respectively) and MLP 48–15-2 (SCG algorithm, logistic and tanh activation functions) gave highly

<sup>2</sup> The naming scheme represents the number of "neurons" in the input, hidden, and output layers of the MLP model, respectively.

accurate results in terms of inferring the presence or absence of lung cancer from the given set of variables, with ROC value reaching 99.83%.

additional independent NSCLC case. When further tested against data from a melanoma case, F1 accuracy of 0.71 was achieved, indicating that the model had learned specific mutation patterns associated with NSCLC, as well as a more general pattern associated with both

A Review on Machine Learning and Deep Learning Techniques Applied to Liquid Biopsy

http://dx.doi.org/10.5772/intechopen.79404

57

Kothen-Hill team presents the Kittyhawk CNN model as the first ML architecture designed specifically for detecting cancer-related mutations in a low allele frequency environment, such as liquid biopsy and might serve as the foundation for novel early stage cancer detection

Issadore team has developed a ML-based platform [45] for isolating exosomes from liquid biopsy samples and, using the RNA inside these exosomes to diagnose pancreatic cancer in

Using the Exosome Track-Etched Magnetic Nanopore (ExoTENPO) nanofluidics chip developed as part of the study, Issadore team successfully isolated exosomes from cell cultures, as well as human and mouse liquid biopsy (blood plasma) samples. Exosomal mRNA was subsequently

Training datasets of 15 mouse and 10 patient profiles, respectively, were created. Linear discriminant analysis (LDA) [46] was used to identify combinations of mRNA profile that discriminated between healthy and tumour-bearing samples. The prediction algorithm was generated by running LDA on the training set, which produced a vector that was used to calculate a weighted sum such that it maximally separates the control group from the sample group with tumours. Two independent blinded test sets, mouse (N ¼ 18) and patient (N ¼ 34), respectively, were used to evaluate the performance of the LDA classifier. Fisher's exact test

Although in their study Issadore team focused primarily on the development and evaluation of the ExoTENPO nanofluidics platform, they conclude that even very simple ML algorithms such as LDA can produce good quality predictive models for classifying biochemical and genetic markers and note that more advanced ML solutions could be used in future research

Wurdinger team demonstrated a ML-based approach to sequencing and analysis of mRNAs obtained from tumour-educated platelets (TEPs) [47] as a tool for accurate tumour diagnosis,

The initial dataset consisted of blood platelet samples from healthy donors (N ¼ 55) and both treated and untreated patients with six different tumour types (NSCLC, colorectal cancer, glioblastoma, pancreatic cancer, hepatobiliary cancer, and breast cancer) in various stages of advancement and metastasis (N ¼ 228). After the mRNA extraction, amplification, and sequencing, a set of approximately 5000 different mRNAs was selected for further analysis.

extracted and used to develop a predictive panel for pancreatic cancer biomarkers.

was used to quantify the predictive value of the classifier, yielding P < 0:001.

4.4. Machine learning-based RNA sequencing for multi-class cancer diagnostics

both within a single class and across six different tumour classes.

techniques that could be used for both screening and prognosis.

4.3. Machine learning and nanofluidics in pancreatic cancer diagnosis

NSCLC and melanoma.

human and murine cohorts.

in order to further improve performance.

Goryński team concluded that these, relatively simple, ANN solutions, while not viable as a full substitute of expert opinion, are nonetheless efficient in early diagnosis and risk prognosis of lung cancer and therefore are promising as potential improvements over and additions to the existing inventory of diagnostic and prognostic methods.

## 4.2. Mutation prediction and early lung cancer detection in liquid biopsy using convolutional neural networks

The proliferation of cancer cells is driven by specific somatic mutations in the cancer genome [39]. To fulfil the high expectations associated with liquid biopsy, such as comprehensive characteristics of the whole tumour in contrast to limited sampling in the traditional tissue biopsy, or dynamic assessment during treatment, the somatic mutations must be detected with high sensitivity and accuracy; limited coverage depth is not sufficient. Kothen-Hill team has demonstrated a CNN-based classifier system named "Kittyhawk" [40] that enables the detection of cancer-related mutations even in extremely low variant allele frequencies (VAFs), more than 2 orders of magnitude lower than is possible with the currently available methods.

For training dataset, whole genome sequencing (WGS) data from 4 non-small cell lung cancer (NSCLC) patients and 3 melanoma patients were used, with <sup>&</sup>gt; <sup>1</sup>:<sup>2</sup> 107 reads in total. To ensure adequate genetic context regardless of variants appearing at the end of the read, additional bases were added to both ends of the read. Additional bases were also added to ensure equal read length in cases where a read is shorter than 150 bp.

Kothen-Hill team chose an 8-layer CNN with a single fully connected output layer, similar to the VGG<sup>3</sup> architecture [41], with a perceptive field of size 3 used to convolve the features, based on results of [42] who showed that the tri-nucleotide context contains distinct mutagenesisrelated signatures. After 2 successive convolutional layers, downsampling by max-pooling with a receptive field of 2 and a stride of 2 was applied, forcing the model to retain only the highest-importance features, as per [43]. The output of the last convolutional layer was directly connected to a fully connected sigmoid output layer for final classification. A logistic regression layer was used to retain the features associated with the position of the read.

The model was trained using minibatch stochastic gradient decent (SGD) with batch size of 256, initial learning rate of 0.1, and momentum of 0.9, with batch normalisation [23] and a rectified linear unit (RLU) [44] applied after each convolutional layer.

Kothen-Hill team presents the Kittyhawk architecture as a first of its specific kind, being able to avoid the information loss associated with similar earlier architectures. To evaluate the performance of the model, a test dataset consisting of <sup>&</sup>gt; <sup>2</sup> 105 reads that were split off the training set of reads from the 4 NSCLC patients was used. Kothen-Hill team found that the model achieves F1 accuracy of 0.961 when using this test dataset, and 0.92 when using data from an

<sup>3</sup> A CNN architecture developed by the Visual Geometry Group at University of Oxford.

additional independent NSCLC case. When further tested against data from a melanoma case, F1 accuracy of 0.71 was achieved, indicating that the model had learned specific mutation patterns associated with NSCLC, as well as a more general pattern associated with both NSCLC and melanoma.

Kothen-Hill team presents the Kittyhawk CNN model as the first ML architecture designed specifically for detecting cancer-related mutations in a low allele frequency environment, such as liquid biopsy and might serve as the foundation for novel early stage cancer detection techniques that could be used for both screening and prognosis.

### 4.3. Machine learning and nanofluidics in pancreatic cancer diagnosis

accurate results in terms of inferring the presence or absence of lung cancer from the given set

Goryński team concluded that these, relatively simple, ANN solutions, while not viable as a full substitute of expert opinion, are nonetheless efficient in early diagnosis and risk prognosis of lung cancer and therefore are promising as potential improvements over and additions to the

The proliferation of cancer cells is driven by specific somatic mutations in the cancer genome [39]. To fulfil the high expectations associated with liquid biopsy, such as comprehensive characteristics of the whole tumour in contrast to limited sampling in the traditional tissue biopsy, or dynamic assessment during treatment, the somatic mutations must be detected with high sensitivity and accuracy; limited coverage depth is not sufficient. Kothen-Hill team has demonstrated a CNN-based classifier system named "Kittyhawk" [40] that enables the detection of cancer-related mutations even in extremely low variant allele frequencies (VAFs), more than 2 orders of magnitude lower than is possible with the currently available methods.

For training dataset, whole genome sequencing (WGS) data from 4 non-small cell lung cancer (NSCLC) patients and 3 melanoma patients were used, with <sup>&</sup>gt; <sup>1</sup>:<sup>2</sup> 107 reads in total. To ensure adequate genetic context regardless of variants appearing at the end of the read, additional bases were added to both ends of the read. Additional bases were also added to

Kothen-Hill team chose an 8-layer CNN with a single fully connected output layer, similar to the VGG<sup>3</sup> architecture [41], with a perceptive field of size 3 used to convolve the features, based on results of [42] who showed that the tri-nucleotide context contains distinct mutagenesisrelated signatures. After 2 successive convolutional layers, downsampling by max-pooling with a receptive field of 2 and a stride of 2 was applied, forcing the model to retain only the highest-importance features, as per [43]. The output of the last convolutional layer was directly connected to a fully connected sigmoid output layer for final classification. A logistic regres-

The model was trained using minibatch stochastic gradient decent (SGD) with batch size of 256, initial learning rate of 0.1, and momentum of 0.9, with batch normalisation [23] and a

Kothen-Hill team presents the Kittyhawk architecture as a first of its specific kind, being able to avoid the information loss associated with similar earlier architectures. To evaluate the performance of the model, a test dataset consisting of <sup>&</sup>gt; <sup>2</sup> 105 reads that were split off the training set of reads from the 4 NSCLC patients was used. Kothen-Hill team found that the model achieves F1 accuracy of 0.961 when using this test dataset, and 0.92 when using data from an

sion layer was used to retain the features associated with the position of the read.

ensure equal read length in cases where a read is shorter than 150 bp.

rectified linear unit (RLU) [44] applied after each convolutional layer.

A CNN architecture developed by the Visual Geometry Group at University of Oxford.

4.2. Mutation prediction and early lung cancer detection in liquid biopsy using

of variables, with ROC value reaching 99.83%.

convolutional neural networks

56 Liquid Biopsy

3

existing inventory of diagnostic and prognostic methods.

Issadore team has developed a ML-based platform [45] for isolating exosomes from liquid biopsy samples and, using the RNA inside these exosomes to diagnose pancreatic cancer in human and murine cohorts.

Using the Exosome Track-Etched Magnetic Nanopore (ExoTENPO) nanofluidics chip developed as part of the study, Issadore team successfully isolated exosomes from cell cultures, as well as human and mouse liquid biopsy (blood plasma) samples. Exosomal mRNA was subsequently extracted and used to develop a predictive panel for pancreatic cancer biomarkers.

Training datasets of 15 mouse and 10 patient profiles, respectively, were created. Linear discriminant analysis (LDA) [46] was used to identify combinations of mRNA profile that discriminated between healthy and tumour-bearing samples. The prediction algorithm was generated by running LDA on the training set, which produced a vector that was used to calculate a weighted sum such that it maximally separates the control group from the sample group with tumours. Two independent blinded test sets, mouse (N ¼ 18) and patient (N ¼ 34), respectively, were used to evaluate the performance of the LDA classifier. Fisher's exact test was used to quantify the predictive value of the classifier, yielding P < 0:001.

Although in their study Issadore team focused primarily on the development and evaluation of the ExoTENPO nanofluidics platform, they conclude that even very simple ML algorithms such as LDA can produce good quality predictive models for classifying biochemical and genetic markers and note that more advanced ML solutions could be used in future research in order to further improve performance.

### 4.4. Machine learning-based RNA sequencing for multi-class cancer diagnostics

Wurdinger team demonstrated a ML-based approach to sequencing and analysis of mRNAs obtained from tumour-educated platelets (TEPs) [47] as a tool for accurate tumour diagnosis, both within a single class and across six different tumour classes.

The initial dataset consisted of blood platelet samples from healthy donors (N ¼ 55) and both treated and untreated patients with six different tumour types (NSCLC, colorectal cancer, glioblastoma, pancreatic cancer, hepatobiliary cancer, and breast cancer) in various stages of advancement and metastasis (N ¼ 228). After the mRNA extraction, amplification, and sequencing, a set of approximately 5000 different mRNAs was selected for further analysis.

The accuracy of TEP-based multi-class cancer classification in the training dataset (N ¼ 175) was estimated, using an SVM algorithm. To cross-validate the SVM for the entire sample set, leave-one-out cross-validation (LOOCV) method was applied. The percentage of correct predictions was reported as the accuracy score. The algorithm was performed 175 times, in order to classify and cross-validate the entire dataset. To determine specific input gene lists for the algorithm, Wurdinger team performed ANOVA testing. They selected a set of 1072 mRNAs to use with the training dataset, yielding final accuracy of 96% and ROC value of 0.986. From the patient cohort, all 39 patients with localised tumours and 33 of the 39 patients with primary tumours in the CNS were classified as cancer patients.

<sup>S</sup> <sup>¼</sup> Thost Tdevice

where Thost and Tdevice are execution times for the host CPU and the co-processor, respectively,

Idevice <sup>¼</sup> <sup>I</sup>

where I is the original DNA sequence, Ihost is the part of I analysed by the host CPU, and Idevice

Memeti & Pllana used the "single instruction, multiple data" (SIMD) parallelism [52] of both the host CPU and the Xeon Phi co-processor to achieve teraFLOP (1012 floating point operations per second) performance. For experimental evaluation of their deterministic finite automata (DFA) algorithm, Memeti & Pllana used reference genomes of human and 11 different animals from the GenBank sequence database of the National Center for Biological Information, with the average dataset size of 2043 MB. In total, data from approximately 4000 experiments was used to train the performance predictor and to evaluate the DFA performance. The DFA performance was evaluated using different thread affinity modes (compact, balanced, and scatter) and numbers of threads for each of the DNA sequences. The balanced thread affinity mode evenly distributes the threads among the computing cores, compact mode completely fills a single core with threads before assigning the remaining threads to the next core, while the

Memeti & Pllana discovered that the balanced thread affinity mode is overall fastest for all of the tested DNA sequences, with second best being the scatter mode. The evaluation of DFA with regard to varying thread counts showed that the algorithm scales well up to approximately 120 threads, whereas in the 180–240 thread range the performance improvement becomes modest due to overhead from thread management operations. Performance-wise, Memeti & Pllana found that the parallel version of DFA running on a heterogeneous platform has a speed-up from 35:6� up to 206:6�, compared to a sequential (single-thread) version running on the host CPU, with the exact speed-up degree depending on the given host CPU. Memeti & Pllana. intend to use this work to study and develop highly parallel DNA analysis

While the ML models currently used in liquid biopsy analysis in particular and biological and medical research in general (typically different classes of neural networks and linear classifiers) appear to both produce accurate results and show generally high performance, they represent only a narrow subset of machine learning and artificial intelligence solutions [5]. For instance,

scatter mode distributes threads among the cores in a round-robin sequence.

S þ 1

and using the partitioning scheme

is the part of I analysed by the co-processor.

solutions on more powerful hardware in the future.

6. Future prospects

, (2)

http://dx.doi.org/10.5772/intechopen.79404

59

, (4)

Ihost ¼ I � Idevice (3)

A Review on Machine Learning and Deep Learning Techniques Applied to Liquid Biopsy

Wurdinger team concluded that using the SVM classifier with TEP-based data produces highaccuracy, high-specificity models for liquid biopsy-based diagnostics for several common cancer types. They expect that using more advanced ML algorithms capable of self-learning could further improve the performance of these diagnostic models. They also suggest evaluating systemic factors such as inflammatory diseases and other non-cancerous diseases as potential factors that can influence the mRNA profile.
