5. Using machine learning to accelerate DNA sequencing and biomarker development

#### 5.1. A supervised machine learning-based approach to DNA sequence analysis

DNA sequencing and sequence analysis is an important task in many scientific and medical fields that is well-known for being both data-rich and computationally intensive. Memeti & Pllana describe a ML-based solution for optimised DNA sequence analysis [48, 49]. Their algorithm leverages the increased performance and parallelisation capabilities of heterogeneous (a host central processor (CPU) in combination with a 61-core Intel Xeon Phi coprocessor) multi-core computing platform.

Memeti & Pllana used the widely known Aho-Corasick (AC) algorithm [50] as the basis for their work, since DNA analysis is a specific case of a string matching problem, where the input text is the given DNA sequence and the alphabet consists of characters corresponding to the four nucleotide bases. AC uses finite automata (FA), a simple type of formal machine in the form of a prefix tree with additional links between internal nodes. These links allow for fast failure transitions (also known as ε-transitions) between branches of the tree that share a common prefix, thus avoiding backtracking. A known drawback of the AC algorithm is its being non-deterministic. Memeti & Pllana solved the non-determinism issue by modifying the AC finite automaton so that it computes the correct transition for each state, thus eliminating failure transitions and guaranteeing that every character always has the same number of operations associated with it.

A boosted decision tree regression-based predictor [51] was used to estimate the execution time of DNA sequence analysis for both the host CPU and the Intel Xeon Phi co-processor. The predictor's output was used to partition the DNA sequence based on the S-factor,

A Review on Machine Learning and Deep Learning Techniques Applied to Liquid Biopsy http://dx.doi.org/10.5772/intechopen.79404 59

$$S = \frac{T\_{host}}{T\_{dcive}} \text{.} \tag{2}$$

where Thost and Tdevice are execution times for the host CPU and the co-processor, respectively, and using the partitioning scheme

$$I\_{\text{host}} = I - I\_{\text{device}} \tag{3}$$

$$I\_{device} = \frac{I}{\mathbf{S} + \mathbf{1}} \,\prime \tag{4}$$

where I is the original DNA sequence, Ihost is the part of I analysed by the host CPU, and Idevice is the part of I analysed by the co-processor.

Memeti & Pllana used the "single instruction, multiple data" (SIMD) parallelism [52] of both the host CPU and the Xeon Phi co-processor to achieve teraFLOP (1012 floating point operations per second) performance. For experimental evaluation of their deterministic finite automata (DFA) algorithm, Memeti & Pllana used reference genomes of human and 11 different animals from the GenBank sequence database of the National Center for Biological Information, with the average dataset size of 2043 MB. In total, data from approximately 4000 experiments was used to train the performance predictor and to evaluate the DFA performance. The DFA performance was evaluated using different thread affinity modes (compact, balanced, and scatter) and numbers of threads for each of the DNA sequences. The balanced thread affinity mode evenly distributes the threads among the computing cores, compact mode completely fills a single core with threads before assigning the remaining threads to the next core, while the scatter mode distributes threads among the cores in a round-robin sequence.

Memeti & Pllana discovered that the balanced thread affinity mode is overall fastest for all of the tested DNA sequences, with second best being the scatter mode. The evaluation of DFA with regard to varying thread counts showed that the algorithm scales well up to approximately 120 threads, whereas in the 180–240 thread range the performance improvement becomes modest due to overhead from thread management operations. Performance-wise, Memeti & Pllana found that the parallel version of DFA running on a heterogeneous platform has a speed-up from 35:6� up to 206:6�, compared to a sequential (single-thread) version running on the host CPU, with the exact speed-up degree depending on the given host CPU. Memeti & Pllana. intend to use this work to study and develop highly parallel DNA analysis solutions on more powerful hardware in the future.

#### 6. Future prospects

The accuracy of TEP-based multi-class cancer classification in the training dataset (N ¼ 175) was estimated, using an SVM algorithm. To cross-validate the SVM for the entire sample set, leave-one-out cross-validation (LOOCV) method was applied. The percentage of correct predictions was reported as the accuracy score. The algorithm was performed 175 times, in order to classify and cross-validate the entire dataset. To determine specific input gene lists for the algorithm, Wurdinger team performed ANOVA testing. They selected a set of 1072 mRNAs to use with the training dataset, yielding final accuracy of 96% and ROC value of 0.986. From the patient cohort, all 39 patients with localised tumours and 33 of the 39 patients with primary

Wurdinger team concluded that using the SVM classifier with TEP-based data produces highaccuracy, high-specificity models for liquid biopsy-based diagnostics for several common cancer types. They expect that using more advanced ML algorithms capable of self-learning could further improve the performance of these diagnostic models. They also suggest evaluating systemic factors such as inflammatory diseases and other non-cancerous diseases as poten-

5. Using machine learning to accelerate DNA sequencing and biomarker

DNA sequencing and sequence analysis is an important task in many scientific and medical fields that is well-known for being both data-rich and computationally intensive. Memeti & Pllana describe a ML-based solution for optimised DNA sequence analysis [48, 49]. Their algorithm leverages the increased performance and parallelisation capabilities of heterogeneous (a host central processor (CPU) in combination with a 61-core Intel Xeon Phi co-

Memeti & Pllana used the widely known Aho-Corasick (AC) algorithm [50] as the basis for their work, since DNA analysis is a specific case of a string matching problem, where the input text is the given DNA sequence and the alphabet consists of characters corresponding to the four nucleotide bases. AC uses finite automata (FA), a simple type of formal machine in the form of a prefix tree with additional links between internal nodes. These links allow for fast failure transitions (also known as ε-transitions) between branches of the tree that share a common prefix, thus avoiding backtracking. A known drawback of the AC algorithm is its being non-deterministic. Memeti & Pllana solved the non-determinism issue by modifying the AC finite automaton so that it computes the correct transition for each state, thus eliminating failure transitions and guaranteeing that every character always has the same number of

A boosted decision tree regression-based predictor [51] was used to estimate the execution time of DNA sequence analysis for both the host CPU and the Intel Xeon Phi co-processor. The

predictor's output was used to partition the DNA sequence based on the S-factor,

5.1. A supervised machine learning-based approach to DNA sequence analysis

tumours in the CNS were classified as cancer patients.

tial factors that can influence the mRNA profile.

processor) multi-core computing platform.

operations associated with it.

development

58 Liquid Biopsy

While the ML models currently used in liquid biopsy analysis in particular and biological and medical research in general (typically different classes of neural networks and linear classifiers) appear to both produce accurate results and show generally high performance, they represent only a narrow subset of machine learning and artificial intelligence solutions [5]. For instance, a potentially valuable research direction might be in the form of highly advanced probabilistic graphical models [53] augmented with functionality such as one-shot learning [54] and probabilistic program synthesis [55], which could potentially allow researchers to reduce the size of the commonly massive training datasets required for creating ANN- or DL-based models.

Author details

\*, Boriss Strumfs<sup>2</sup>

\*Address all correspondence to: arets.paeglis@protonmail.com

2 Latvian Institute of Organic Synthesis, Riga, Latvia

1 Department of Pathology, Riga Stradins University, Riga, Latvia

, Dzeina Mezale<sup>1</sup> and Ilze Fridrihsone<sup>1</sup>

A Review on Machine Learning and Deep Learning Techniques Applied to Liquid Biopsy

http://dx.doi.org/10.5772/intechopen.79404

61

[1] Rosenblatt F. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review. 1958;65(6):386-408. DOI: 10.1037/h0042519 [2] Stormo GD, Schneider TD, Gold L, et al. Use of the 'perceptron' algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Research. 1982;10.9:2997-3011. DOI:

[3] LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. May 2015;521(7553):436-444. DOI:

[4] Goodfellow I, Bengio Y, Courville A. Deep Learning. USA: MIT Press; 2016. URL: http://

[5] Russell S, Norvig P. Artificial Intelligence: A Modern Approach. 3rd ed. Upper Saddle

[6] Estivill-Castro V. Why so many clustering algorithms. ACM SIGKDD Explorations News-

[7] MacQueen J. Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics. Berkeley, California: University of California Press; 1967. pp. 281-297

[8] Kriegel HP, Kröger P, Sander J, et al. Density-based clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 2011;1(3):231-240. DOI: 10.1002/

[9] Schmidhuber J. Deep learning in neural networks: An overview. Neural Networks. 2015;

[10] Deng L. Deep learning: Methods and applications. Foundations and Trends® in Signal

[11] Hornik K. Approximation capabilities of multilayer feedforward networks. Neural Net-

[12] Ghasemi F, Mehridehnavi A, Fassihi A, et al. Deep neural network in QSAR studies using deep belief network. Applied Soft Computing. 2018;62:251-258. DOI: 10.1016/j.asoc.2017.

River, NJ, USA: Prentice Hall Press; 2009. ISBN: 0136042597; 9780136042594

letter. 2002;4(1):65-75. DOI: 10.1145/568574. 568575

61:90-93. DOI: 10.1016/j.neunet.2014.09.003

Processing. 2014;7(3–4):197-387. DOI: 10.1561/2000000039

works. 1991;4(2):251-257. DOI: 10.1016/0893-6080(91) 90009-t

Arets Paeglis<sup>1</sup>

References

10.1093/nar/10.9.2997

10.1038/nature14539

widm.30

09.040

www.deeplearningbook.org

Furthermore, with a single exception, all of the studies reviewed here have been focused on the performance and accuracy of software ML models, which is currently the predominant class of machine learning solutions. However, recent advances in general purpose computation using both graphics processing units (GPUs) and specialised application-specific integrated circuits (ASICs) tailor-made for machine learning [56] provide a strong case for the exploration and exploitation of hardware or hybrid ML solutions, as evidenced by, e.g., the results from the AlphaGo experiments and public performance [57].
