**5. Virus discovery and identification tools**

Viral discovery and identification from isolates and metagenomic samples present major challenges to bioinformatics in general. This is because viral genomes are prone to very high variability and deviation from reference genomes [42], continuous emergence of new viruses with no available references, high intrapopulation diversity, and the relative rareness of viral DNA fragments in metagenomic samples [43]. These challenges have largely been addressed through the following pipelines.

## **5.1 Genome detective**

*Vector-Borne Diseases - Recent Developments in Epidemiology and Control*

are able to engage in near-patient or disease surveillance activities [35].

genomes to an ever-increasing large number of genomes [39, 40].

The advent of next generation sequence (NGS) and advancements in bioinformatics present an opportunity to tap into new insights that are crucial to the establishment of an open, global digital surveillance system. NGS technologies have enabled the production and deposit of vast amounts of whole genomes into public repositories [36–38] ushering the field of genomics into era of big data. This has in turn increased the scale of genomic studies from the analysis of single or few

Toward the development of global surveillance system, bioinformatics provides the tools to answer pertinent questions including the identification of organisms responsible for an outbreak, the source of an outbreak and evolutionary information of pathogens crucial for understanding the unique phenotypes such as drug

Several bioinformatics tools and pipelines have been developed to facilitate the processing, analysis and visualization of these data in order to derive useful information from it [41]. The major fields of interest addressed by these tools include comparative genomics which involves comparing the genetic content of one organism against that of another; prediction of the function of genes and sequences of the coding regions; identification of evolutionary events and inference of phylogenetic relationships. These fields of study play a critical role in elucidating pathogen evolution, niche adaptation, population structure and host-pathogen interaction. Furthermore, these findings inform vaccine and drug design, as well as the identifi-

Bioinformatics pipelines and workflows comprise of a series of third-party executable command line software assembled to perform a specific task or analysis. A complete pipeline will, therefore, be able to support the end of analysis of a given field of study such as phylogenetics or variant detection. Pipelines can thus be broken down into two major components i.e. the data processing component and the analytical component that performs the core analysis of the pipeline. Below, we review some of the prominent bioinformatics pipelines and workflows that support the processing and analysis of NGS data to provide insights on relevant global

**3. Bioinformatics tools and phylogenetic tools**

resistance, virulence and disease outcome.

**4. Bioinformatics pipelines and workflows**

cation of virulence genes.

surveillance of arboviral outbreaks.

viral infections in a national and international context, assuming an important role in solving issues relevance to Public Health [35]. As a result, studies involving more in-depth molecular and dispersion analysis of circulating pathogens may help the World Health Organization appropriately adopt measures to control epidemics and to monitor the dynamics and spreading of new viral strains. However, even though NGS has advantages over diagnostics routine, all of the different strategies and technologies, developed by Illumina, Thermo Scientific, Oxford Nanopore and others, are not yet considered a panacea. Remaining challenges include dealing with high data throughput, which requires sophisticated computational processing as well as the annotation of large amounts of sequencing data, high DNA or RNA input sample requirements (in some cases hundreds of nanograms), which often raises the need for previous PCR-based amplification approaches. On top of all this, there are relatively few researchers in the area with sufficient bioinformatics expertise and who

**24**

Genome detective (http://www.genomedetective.com/app/) is an easy to use web-based software application that assembles the genomes of viruses quickly and accurately, designed to generate and analyze whole or partial viral genomes directly from NGS reads within minutes [44]. The application gains accuracy by using a novel alignment method that uses a combination of both amino acids and nucleotide scores to construct genomes by reference-based linking of de novo contigs. Speed and accuracy are also gained by using DIAMOND [45] with a UniProt90 reference dataset to sort viral taxonomy units. The use of DIAMOND and UniRef90 allowed genome detective to identify viral short reads at least 1000 times faster than when Blastn and the viral nucleotide database of NCBI were used. The software was optimized using synthetic datasets to represent the great diversity of virus genomes. The application was then validated with next-generation sequencing data of hundreds of viruses.

### **5.2 VirusTAP: viral genome-targeted assembly pipeline**

One of the major difficulties in this process is the correct de novo assembly of viral genomes from crude metagenomic deep sequencing reads, including large amounts of bacteria and human related sequencing reads. Such read contaminations often force the server to overload during de novo assembly and might cause misassembly of the resultant contigs. Pre-filtering by host-mapping subtraction could lead to efficient de novo assembly, allowing the rapid and accurate procurement of a complete viral genome sequence. In addition to the accuracy of de novo assembly, the exclusion of human-related sequences can circumvent conflicting ethical issues by avoiding analyzing the personal genetic information of patients [46, 47].

VirusTAP is web-based, integrated NGS analysis tool designed to facilitate rapid and accurate viral genome assembly from raw reads by just clicking on several selections. Like genome detective, it ensures that non-viral reads are eliminated prior to de novo assembly in order to ensure performance is not compromised.

## **5.3 Virus identification pipeline (VIP)**

VIP (https://github.com/keylabivdc/VIP) is a web-based virus discovery and identification tool [46]. With a single click, it will filter out background-related reads, classify reads on basis of nucleotide and remote amino acid homology, and perform phylogenetic analysis to provide evolutionary insights.

### **5.4 TAR-VIR: a pipeline for TARgeted VIRal strain reconstruction from metagenomic data**

TAR-VIR is a non-reference based NGS analysis tool for the reconstruction of viral strains from metagenomic samples [46, 47]. It was developed to classify RNA viral reads from viral metagenomic data and also to produce the assembled viral strains (i.e. haplotypes) from classified reads. It mainly has two components: (1) viral read classification using partial or remotely related reference genomes and (2) de novo assembly of viral haplotypes from recruited reads with PEHaplo [47, 48], which is a haplotype reconstruction tool. As TAR-VIR has a modular structure, the users have options to use other assembly tools after read classification in step (1).

#### **6. Genotyping tools**

While variant discovery and identification tools play a critical role in determining the pathogen responsible for the infection, they are unable to determine the subtype or quasispecies that is responsible for the outbreak. Arboviruses exist as a mixed population of genomic variants due to rapid replication and the error prone nature of viral RNA-dependent RNA polymerase (RdRp) [47]. Monitoring virus genotype diversity is therefore crucial to understand the emergence and spread of outbreaks. Genotyping tools provide an efficient workflow to enable researchers and public health practitioners to determine the strain that is responsible for the outbreak.

Most free-access bioinformatics programs used to classify the genetic profile of subtypes, genotypes, subgroups or groups of viruses are based on the use of similarity search tools to determine the genotype of a new sequence. These genotyping tools use a set of reference sequence genomes, carefully selected for the purpose of representing each individual genotype. The use of a number of reference sequences representing the genotype of a given group increases the consistency and reproducibility of the data, thus ensuring a higher speed in the search for the data and offering greater and more complete information while ensuring that the results are not limited to an inadequate set of reference sequences that do not represent the information needed to identify the virus.

The similarity-based methods are useful for identifying recombination patterns in viral sequences, but they need further confirmation of their own phylogenetic methods and have no statistical support for their results.

Recently [49], four viral genotyping tools for yellow fever (YFV) (https:// www.genomedetective.com/app/typingtool/yellowfevervirus/), dengue (DENV) (https://www.genomedetective.com/app/typingtool/dengue/), Chikungunya (CHIKV) (https://www.genomedetective.com/app/typingtool/chikungunya/) and Zika (ZIKV) (https://www.genomedetective.com/app/typingtool/zika/) were developed and linked to genome detective to enable phylogenetic classification below species level [50, 51].

#### **6.1 Castor**

The classification and annotation of virus genomes constitute important assets in the discovery of genomic variability, taxonomic characteristics and disease mechanisms. Existing classification methods are often designed for specific wellstudied families of viruses [43]. Thus, the viral comparative genomic studies could benefit from more generic, fast and accurate tools for classifying and typing newly sequenced strains of diverse virus families.

CASTOR is a virus classification platform based on machine learning methods, inspired by a well-known technique in molecular biology: restriction fragment length polymorphism [52]. It simulates, in silico, the restriction digestion of genomic material by different enzymes into fragments. It uses two metrics to construct feature vectors for machine learning algorithms in the classification step. The performance of CASTOR, its genericity and robustness could permit the conduct of

**27**

**9. Conclusion**

manner.

*Mosquito-Borne Viral Diseases: Control and Prevention in the Genomics Era*

novel and accurate large-scale virus studies. The CASTOR web platform provides an open access, collaborative and reproducible machine learning classifiers. CASTOR

Phylogenetic tools are an extremely important resource used in the field of virology to study viral evolution, trace the origin of epidemics, establish the mode of transmission, investigate the occurrence of drug resistance or determine the origin of the virus in different body compartments. Thus, the tools developed by bioinformatics are fundamental to monitor the evolution of viral diversity, supporting studies of genomic sequence analysis, crucial for the surveillance of viral polymorphism, the development of new therapeutic strategies, the development of vaccine products or the appropriate choice products. Toward the development of a global surveillance outbreak surveillance system, the advances below have been made.

Nextstrain is a real-time pathogen evolution tracking platform that implements cutting-edge analysis and visualization of pathogen genome data [53]. It provides evolutionary information in the form of interactive visualizations to virologists, epidemiologists, public health officials and citizen scientists. It has been used to track various arboviral epidemics globally including West Nile Virus (WNV) in the Americas, Zika virus in 33 countries and Dengue virus outbreaks in 64 countries. The platform is continually updated with publicly available datasets to provide new insights into viral epidemic outbreaks globally in an intuitive and visually esthetic

In disease surveillance, understanding the effect of mutations detected in the viral genomes through the methods identified above is invaluable in the development of relevant controls and interventions [47]. Many of these mutations serve as drug targets as well as provide insights into the response mechanism of the pathogens to existing interventions. A global surveillance system would therefore be incomplete without the capability to provide insights to the function of discovered mutations. Below we explore some of the tools that have been applied to understand

The SIFT algorithm predicts the effect of coding variants on protein function [54, 55]. Since its introduction in 2001, SIFT has become one of the standard tools for characterizing missense variation. It has a corresponding website that provides

Augmenting epidemiological data with insights from genomic data provides a powerful tool for surveillance and control of disease outbreaks. Advances in

the functional relevance of mutations found in arboviruses.

**8.1 The SIFT (sorting intolerant from tolerant)**

users with predictions on their variants.

*DOI: http://dx.doi.org/10.5772/intechopen.88769*

can be accessed at (http://castor.bioinfo.uqam.ca).

**7. Phylogenetic and phylodynamic tools**

**7.1 Nextstrain (https://nextstrain.org/)**

**8. Functional prediction tools**

novel and accurate large-scale virus studies. The CASTOR web platform provides an open access, collaborative and reproducible machine learning classifiers. CASTOR can be accessed at (http://castor.bioinfo.uqam.ca).
