**2.2 Drug design for malaria using deep learning and others in silico techniques**

For the last few years, several reports have been published on drug design for malaria using machine learning or others computational methods. Arash et al. [9] devised a deep learning based technique (DeepMalaria) using a graph based model and SMILE to predict the anti-malarial inhibitory compounds. There are few studies which delineate In Silico approaches of analysing anti-plasmodium compounds. Monika Samant et al. [10] has developed a protein–protein interaction network of *Plasmodium falciparum* and human host by integrating experimental data and computational prediction of interactions using the interolog method. Manila et al. [11] has also studied inhibitor against the plasmepsin II. This is an aspartic protease encoded by the malarial parasite that is essential for host haemoglobin degradation. They have studied target protein structure and searched a suitable molecule that has a high binding affinity towards the target protein. Although there are few studies on In silico drug target identification or drug design for malaria but real-life implementation of these drugs which have been designed by computational methods is not successfully achieved yet. One of the main reasons for this is that these


*Note: These are the most significant paper for in-silico malaria research and DeepMalaria is the only paper where authors has used deep neural network for malaria drug design.*

#### **Table 1.**

*Different approaches for in silico antimalarial drug discovery.*

computation methods are only optimizing physical or chemical properties but these methods are not able to predict whether the designed drug is biologically relevant for malaria. In the next sections, we will provide a road map to harness recent developments in several genomics and AI technique which can be used in drug design programs (**Table 1**).

#### **3. Drug target selection using genomics data**

The first step in the drug discovery project is to identify a potential target. Transcriptomics data can be effectively utilized for target identification. Differential gene expression analysis provides information about differences in gene expression between normal and diseased states. For malaria parasites, plasmoDB database [13] provides a large variety of trarencriptomics datasets at different stages of the life cycle or at different times after it infects the RBC's. However, bulk RNA-sequencing data does not have the ability to recognize cell to cell variability within a population as a result of which some essential feature may remain undetected. For example, recent single cell RNA-seq experiment was able to differentiate between sexually and asexually committed scizont from a population of parasites [14]. With the launch of such single cell RNA sequencing (RNA-seq), it is now possible to measure RNA levels on the entire genome scale to gain insights into cellular processes and illuminate the specifics of many important molecular events such as alternative splicing, gene fusion, variation of single nucleotide, and differential genes expression. ScRNA-seq enables analysis of individual cell transcriptomes. ScRNA-seq is generally used to examine transcriptional similarities and variations within the cell population. RNA sequencing technologies continue to advance and provide new ideas for understanding biological processes. Early findings revealed previously unrecognised levels of heterogeneity in embryonic and immune cell population [15, 16]. Thus, the analysis of heterogeneity remains a core reason to embark on scRNA-seq studies. Similarly, assessments of transcriptional variations between individual cells have been used to distinguish unusual cell populations that would otherwise go undetected in pooled cell analysis [17], such as malignant tumor cells within a tumor mass [18], or hyper-responsive immune cells within an otherwise homogeneous group [19]. The scRNA-seq technique is also ideal for the examination of individual cells where each cell is essentially unique, such as individual T lymphocytes expressing highly diverse T-cell receptors [20], brain neurons [21], or early-stage embryo cells [22]. In scenarios such as embryonic development, cancer, myoblast, and lung epithelium differentiation and lymphocyte fate diversification, scRNA-seq is also increasingly used to trace the lineage and developmental relationships between heterogeneous yet related cellular states [22–27].

*Drug Design for Malaria with Artificial Intelligence (AI) DOI: http://dx.doi.org/10.5772/intechopen.98695*

In addition to solving cellular heterogeneity, scRNA-seq can also provide important information on essential gene expression characteristics which include studying the expression of monoallelic genes [15, 28, 29], splicing patterns [30], as well as noise during transcriptional responses [30–32]. Importantly, the study of gene coexpression patterns at the single-cell level could enable the identification of coregulated gene modules and even inference of gene-regulatory networks underlying functional heterogeneity and cell-type specifications [32, 33]. Additionally, we can extract many more information such as how many genes can be detected and whether a particular gene of interest is being expressed, or whether there has been differential splicing, depending on the procedure of generating the mRNA data. Several single cell RNA-seq experiments have been performed in the last couple of years for plasmodium parasites which paved a much more sophisticated way to characterise gene expression at different stages of the life cycle [14, 34–36]. Both supervised and unsupervised learning methods can be used to identify deferentially expressed genes at different life cycle stages [37, 38]. These information can be harnessed in a gene ontology analysis or a genome-scale metabolic model to identify function of the genes and subsequently potential target for a drug.

#### **4. Drug discovery with AI**

Classical approach of drug discovery is a time consuming and complex process. It takes almost 12 years to discover a drug with the cost soaring to billions of dollars. Several pharma-companies are working on drug discovery but 90 percent of all drug discovery programs are failing due to limitations both at the computational as well as clinical phases. We can divide drug discovery into four major steps (1) target selection and validation; (2) compound screening and lad optimization; (3) preclinical studies; (4) clinical trials. Bio-pharmaceutical industries are focusing on computational approaches in order to enhance the drug discovery processes as well as to reduce research and development expenses by diminishing failure rates in clinical trials and ultimately generate superior medicines. Different machine learning approaches help to identify drug targets, find suitable molecules from data libraries, suggest chemical modifications, etc. There are several steps for drug discovery and we will discuss how computational approaches help in each step of drug discovery process.

#### **4.1 Primary drug screening with AI**

#### *4.1.1 Image processing and usage of AI to sort and classify cells*

AI technology performs well at classifying images that contained various objects or features [39, 40]. Various dimension reduction techniques like principal component analysis(PCA) can be utilized to reduce the features of the images and then we can use AI-based techniques to classify the cells [41]. Least square support vector machine (LS-SVM), which use classification and regression techniques shows the highest accuracy (95.34) during classification. Modern devices like activated cell sorting images (IACS) are used to measure the optical, electrical, and mechanical properties of cells for highly versatile and scalable cell sorting automation. This instrument use neural network algorithms to do decision-making and high-speed digital image processing. AI is recently used to interpret computerized electrocardiography (ECG). This process plays an significant role in the diagnosis/clinical treatment of the workflow.

### **4.2 Secondary drug screening with AI**

#### *4.2.1 Physical properties predictions*

For drug design, features like bioavailability, bioactivity and toxicity are very important defining characteristics of a compound. The Partition coefficient (logP) and melting point affect a drug molecule's bioavailability. The melting point of a drug indicates how easily it dissolves in water, whereas logP quantifies relative solubility between oil and water. logP is used to calculate cellular drug absorption. A molecular fingerprint, SMILE(simplified molecular input line-entry system) string, potential energy measurements (e.g., from ab initio calculations), molecular graphs with varying weights for atoms or bonds, Coulomb matrices, molecular fragments or bonds, and atomic coordinates in 3D are examples of molecular representations used in an AI drug design algorithm that takes these properties into account [42]. These inputs can be utilized in the DNN training phase and can be processed by various DNNs in different stages, including generative and predictive stages. We can also use reinforcement learning (RL). The generative stage of a DNN is trained to generate chemically feasible SMILES strings using SMILES inputs in a typical sample, while the predictive stage is trained to predict molecule properties. Although the two stages are initially trained separately using supervised learning algorithms, different kinds of biases may be introduced by rewarding or penalising specific properties when the two stages are trained simultaneously.

#### *4.2.2 Predictions of bioactivity and toxicity*

The toxicity and bioactivity profiles are significant properties of a compound. Matched molecular pair (MMP) analysis can be used to explore the local changes of the drug molecule and its significance on the molecular properties as well as bioactivity [43]. MMP is used to study the quantitative structure–activity relationship (QSAR) [43]. Random forest(RF), gradient boosting machine (GBMs), and DNNs, the machine learning techniques previously applied without MMP, are used to gather new transformations, fragments as well as modifications of the core static. When it comes to predicting compound activity, DNN outperforms RF and GBM. MMP with ML has been used to predict many properties of bioactivity such as oral exposure, the distribution coefficient (log D) [44, 45], the intrinsic clearance, the absorption, distribution, metabolism, and excretion (ADME), and mode of action owing to the rapid increase of public databases (such as ChEMBL and PubChem) containing a significant amount of structure–activity relationships (SAR). A few methods for predicting the bioactivity of a drug candidate have recently been created. For example, few researcher used a network coding convolution graph with discrete chemicals to extract the drug target sites'signature into a sustainable space latent vectors (LVS). LVS enables optimization based on molecule gradients in space, allowing predictions to be made based on the model's differential affinity and other binding properties. The DeepTox algorithm is important for toxicity prediction [46].

#### **4.3 AI in drug design**

#### *4.3.1 Prediction of target proteins 3D structure*

The 3D structure of a target protein's ligand-binding site is usually used to design new drug molecules [47, 48]. As a result, researchers have used homology

#### *Drug Design for Malaria with Artificial Intelligence (AI) DOI: http://dx.doi.org/10.5772/intechopen.98695*

modelling and de novo protein design in the past [49–51]. With the development of AI-based approaches, prediction of the 3D structure of a target protein can be performed more accurately. The AI tool AlphaFold is successfully implemented to predict the 3D structure of a drug target protein in the recent Crucial Assessment of Protein Structure Prediction contest and performed amazingly well. AlphaFold correctly predicted 25 of 43 structures using only primary protein sequences. These results outperformed the second-place finisher, who correctly predicted just three of the 43 test sequences. AlphaFold is based on deep neural networks (DNNs) trained to predict proteins' properties based on their primary sequences. It measures the angles between peptide bonds in close proximity as well as the distances between pairs of amino acids. These two features are then combined to generate a score that can be used to predict the accuracy of a proposed 3D protein structure model. These scoring functions are used by AlphaFold to examine the protein structure landscape and find structures that match predictions.
