**2. Genomic surveillance**

Infectious diseases continue to be one of the leading causes of death worldwide [23] and pathogens such as viruses can evolve and spread rapidly, leading to the

**23**

*Mosquito-Borne Viral Diseases: Control and Prevention in the Genomics Era*

emergence of newly-mutated human pathogens, more virulent strains, as well as antibiotic and drug resistant organisms [24, 25]. In this context, genomic surveillance aims are to: (i) to perform global surveillance of pathogens using whole genome sequencing and (ii) to understand drug resistance, emergence and spread of viral pathogens. Several approaches have been developed and are widely used for the quick detection and identification of viral pathogens (i.e., diagnostics). Some of them are based on different serological and molecular strategies including, for example, assays based on real-time polymerase chain reaction [26]. Even though these kinds of approaches present high sensitivity and specificity for their purpose, they are more suitable for diagnostics only and cannot provide detailed genomic

Bearing these limitations in mind, the main point of developing new genomic surveillance tools is to answer the following inquiry: what sort of questions is important for genomic surveillance that cannot be addressed by conventional RT-qPCR or serology? (i) RT-qPCR assays do not allow genotype classification, neither does it help identify particular and/or characteristic transmission routes; (ii) RT-qPCR assays also do not allow to determine how fast a viral pathogen is being transmitted and in what direction it is spreading; (iii) serological and molecular assays also cannot help identify epidemiologically linked individuals, neither predict future outbreaks; and (iv) finally, serological and some molecular approaches cannot help to identify novel pathogenic agents and are, therefore, unsuitable for

Next generation sequencing (NGS) technologies produce significantly more raw data than other molecular diagnostic assays, including Sanger sequencing, and are also capable of informing not just pathogen diagnostics but also epidemiology [28]. This is why whole genome sequencing of viral genomes by using new technologies plays an important role in the fight against emerging and re-emerging epidemics [29, 30]. The availability of high-throughput sequencing has also provided immense insights into the ecology of health care-associated pathogens [31]. Therefore, realtime sequencing of entire pathogen genomes has become a standard and indispensable research tool for the critical role of genomic surveillance in the prevention and control of emerging infectious diseases [32], which justifies why NGS can be considered a powerful strategy that also allows the discovery of novel potential viral

Considering pathogen surveillance in mind, bioinformatics tools and the combination of genomic and epidemiological data from viral infections can give essential information for understanding the past and the future of an epidemic, because genomic data generated by real-time sequencing can provide important information on how and when viruses were introduced in a particular site, their pattern and determinants of dissemination in neighboring locations and the extent of genetic diversity, i.e., its dynamics, making it possible to establish an effective surveillance framework on tracking the spread of infections to other geographic regions [21, 22, 34]. In this context, recently established international networks for real-time, portable genomic sequencing, genomic surveillance and data analysis made it possible to monitor the evolution of viral genomes, to understand the origins of outbreaks and epidemics, to predict future outbreaks and to assist in the maintenance of updated diagnostic methods [33–35]. Additionally, genomic surveillance framework allows to determine, through genome sequencing, the real-time molecular epidemiology of viruses circulating and co-circulating in different regions in a specific area, and also to detect and characterize the early emergence of new pathogens in large urban centers, generating data that can inform outbreak control responses [27, 34]. Generated data regarding the molecular, epidemiological, phylogenetic and geographical aspects of circulating viral pathogens in a specific setting contribute to a better understanding of those

*DOI: http://dx.doi.org/10.5772/intechopen.88769*

information [27].

pathogen discovery [27].

pathogens [33, 34].

*Vector-Borne Diseases - Recent Developments in Epidemiology and Control*

million between 1999 and 2012 [15].

spread during an epidemic [22].

**2. Genomic surveillance**

is underlined by the increasing number of countries reporting transmission of mosquito-borne viruses. Transmissions of arboviruses, such as Zika, dengue, chikungunya, yellow fever, and Rift Valley fever, have been reported in 85, 111, 106, 43, and 39 countries, respectively [5]. Projections indicated that 3.83 billion people are living in areas prone to transmission of dengue and it is predicted that by 2050 large increases in dengue suitability will be seen in southern Africa and in the Sahel in West Africa [14]. Bhatt et al. projected the global burden of dengue around the world whose estimate indicated that 96 million dengue infections occur per year worldwide and this number represents infections that manifest at any level of the disease severity [6]. the Americas, comprising North and South America, registered more than 2 million dengue cases in 2016, and more than 1.4 million cases in 2019 [7]. For chikungunya fever, the Americas registered more than 94,000 cases in 2018, and in that same region, Zika fever accounted for more than 650,000 cases in 2016 [8, 9]. High number of cases of arboviral diseases was also registered in other regions in recent years, such as in the western pacific region where more than 375,000 suspected dengue cases were reported in 2016 [10]. In Africa, the government of Congo reported 6149 suspected cases of chikungunya until April 2019, and more than 13,000 chikungunya cases were reported in Sudan until October 2018 [11, 12]. The increasing in frequency and distribution of arboviral diseases in recent years represents a worrying burden not only for the public health system, but also for the economic sector [3]. Some estimates of the economic costs of arboviral infections have been made and for the case of dengue infections, it has been estimated that the median cost of all reported dengue hospital admissions registered in a municipality from Brazil was US\$ 259.9 per hospitalization [13, 17]. Also, in Maldives, in the Indian Ocean, dengue fever represented a total cost of \$3 million in 2015 [14]. Another estimate indicated that West Nile fever hospitalized cases in US represented a total cumulative cost of \$778

Dengue and chikungunya are two arboviral diseases present in the list of neglected tropical diseases from the World Health Organization. Neglected tropical diseases are a group of diseases that have received insufficient public attention, strive in tropical and subtropical areas, and strongly affect populations living in poverty [12]. It is argued that arboviruses can be considered a group of neglected tropical diseases, since they can have a long-lasting impact in the health and economic life of affected populations [16]. Some studies have argued that socioeconomic factors and land-use changes associated with the effects of climate change and global travel, and trade modulate the dynamics of expansion of emerging e re-emerging mosquito-borne diseases [17–20]. Movement of people between neighboring countries has been considered a good predictor for chikungunya spread in the Caribbean and Indian Ocean [14]. The expansion of the geographic distribution of arbovirus has significant negative impact on public health in many regions of the world. As measures to reduce such impacts, it has been argued about the relevance to public health of the implementation of a surveillance system that monitors virus diffusion and the appearance of new genetic variants [21]. In this sense, the use of genomic sequencing data and bioinformatics has been employed in the study of virus evolution, aiming to elucidate phylogenetic relationships and patterns of virus

Infectious diseases continue to be one of the leading causes of death worldwide [23] and pathogens such as viruses can evolve and spread rapidly, leading to the

**22**

emergence of newly-mutated human pathogens, more virulent strains, as well as antibiotic and drug resistant organisms [24, 25]. In this context, genomic surveillance aims are to: (i) to perform global surveillance of pathogens using whole genome sequencing and (ii) to understand drug resistance, emergence and spread of viral pathogens. Several approaches have been developed and are widely used for the quick detection and identification of viral pathogens (i.e., diagnostics). Some of them are based on different serological and molecular strategies including, for example, assays based on real-time polymerase chain reaction [26]. Even though these kinds of approaches present high sensitivity and specificity for their purpose, they are more suitable for diagnostics only and cannot provide detailed genomic information [27].

Bearing these limitations in mind, the main point of developing new genomic surveillance tools is to answer the following inquiry: what sort of questions is important for genomic surveillance that cannot be addressed by conventional RT-qPCR or serology? (i) RT-qPCR assays do not allow genotype classification, neither does it help identify particular and/or characteristic transmission routes; (ii) RT-qPCR assays also do not allow to determine how fast a viral pathogen is being transmitted and in what direction it is spreading; (iii) serological and molecular assays also cannot help identify epidemiologically linked individuals, neither predict future outbreaks; and (iv) finally, serological and some molecular approaches cannot help to identify novel pathogenic agents and are, therefore, unsuitable for pathogen discovery [27].

Next generation sequencing (NGS) technologies produce significantly more raw data than other molecular diagnostic assays, including Sanger sequencing, and are also capable of informing not just pathogen diagnostics but also epidemiology [28]. This is why whole genome sequencing of viral genomes by using new technologies plays an important role in the fight against emerging and re-emerging epidemics [29, 30]. The availability of high-throughput sequencing has also provided immense insights into the ecology of health care-associated pathogens [31]. Therefore, realtime sequencing of entire pathogen genomes has become a standard and indispensable research tool for the critical role of genomic surveillance in the prevention and control of emerging infectious diseases [32], which justifies why NGS can be considered a powerful strategy that also allows the discovery of novel potential viral pathogens [33, 34].

Considering pathogen surveillance in mind, bioinformatics tools and the combination of genomic and epidemiological data from viral infections can give essential information for understanding the past and the future of an epidemic, because genomic data generated by real-time sequencing can provide important information on how and when viruses were introduced in a particular site, their pattern and determinants of dissemination in neighboring locations and the extent of genetic diversity, i.e., its dynamics, making it possible to establish an effective surveillance framework on tracking the spread of infections to other geographic regions [21, 22, 34]. In this context, recently established international networks for real-time, portable genomic sequencing, genomic surveillance and data analysis made it possible to monitor the evolution of viral genomes, to understand the origins of outbreaks and epidemics, to predict future outbreaks and to assist in the maintenance of updated diagnostic methods [33–35]. Additionally, genomic surveillance framework allows to determine, through genome sequencing, the real-time molecular epidemiology of viruses circulating and co-circulating in different regions in a specific area, and also to detect and characterize the early emergence of new pathogens in large urban centers, generating data that can inform outbreak control responses [27, 34]. Generated data regarding the molecular, epidemiological, phylogenetic and geographical aspects of circulating viral pathogens in a specific setting contribute to a better understanding of those

viral infections in a national and international context, assuming an important role in solving issues relevance to Public Health [35]. As a result, studies involving more in-depth molecular and dispersion analysis of circulating pathogens may help the World Health Organization appropriately adopt measures to control epidemics and to monitor the dynamics and spreading of new viral strains. However, even though NGS has advantages over diagnostics routine, all of the different strategies and technologies, developed by Illumina, Thermo Scientific, Oxford Nanopore and others, are not yet considered a panacea. Remaining challenges include dealing with high data throughput, which requires sophisticated computational processing as well as the annotation of large amounts of sequencing data, high DNA or RNA input sample requirements (in some cases hundreds of nanograms), which often raises the need for previous PCR-based amplification approaches. On top of all this, there are relatively few researchers in the area with sufficient bioinformatics expertise and who are able to engage in near-patient or disease surveillance activities [35].

## **3. Bioinformatics tools and phylogenetic tools**

The advent of next generation sequence (NGS) and advancements in bioinformatics present an opportunity to tap into new insights that are crucial to the establishment of an open, global digital surveillance system. NGS technologies have enabled the production and deposit of vast amounts of whole genomes into public repositories [36–38] ushering the field of genomics into era of big data. This has in turn increased the scale of genomic studies from the analysis of single or few genomes to an ever-increasing large number of genomes [39, 40].

Toward the development of global surveillance system, bioinformatics provides the tools to answer pertinent questions including the identification of organisms responsible for an outbreak, the source of an outbreak and evolutionary information of pathogens crucial for understanding the unique phenotypes such as drug resistance, virulence and disease outcome.

Several bioinformatics tools and pipelines have been developed to facilitate the processing, analysis and visualization of these data in order to derive useful information from it [41]. The major fields of interest addressed by these tools include comparative genomics which involves comparing the genetic content of one organism against that of another; prediction of the function of genes and sequences of the coding regions; identification of evolutionary events and inference of phylogenetic relationships. These fields of study play a critical role in elucidating pathogen evolution, niche adaptation, population structure and host-pathogen interaction. Furthermore, these findings inform vaccine and drug design, as well as the identification of virulence genes.

### **4. Bioinformatics pipelines and workflows**

Bioinformatics pipelines and workflows comprise of a series of third-party executable command line software assembled to perform a specific task or analysis. A complete pipeline will, therefore, be able to support the end of analysis of a given field of study such as phylogenetics or variant detection. Pipelines can thus be broken down into two major components i.e. the data processing component and the analytical component that performs the core analysis of the pipeline. Below, we review some of the prominent bioinformatics pipelines and workflows that support the processing and analysis of NGS data to provide insights on relevant global surveillance of arboviral outbreaks.

**25**

*Mosquito-Borne Viral Diseases: Control and Prevention in the Genomics Era*

Viral discovery and identification from isolates and metagenomic samples present major challenges to bioinformatics in general. This is because viral genomes are prone to very high variability and deviation from reference genomes [42], continuous emergence of new viruses with no available references, high intrapopulation diversity, and the relative rareness of viral DNA fragments in metagenomic samples [43]. These challenges have largely been addressed through the following pipelines.

Genome detective (http://www.genomedetective.com/app/) is an easy to use web-based software application that assembles the genomes of viruses quickly and accurately, designed to generate and analyze whole or partial viral genomes directly from NGS reads within minutes [44]. The application gains accuracy by using a novel alignment method that uses a combination of both amino acids and nucleotide scores to construct genomes by reference-based linking of de novo contigs. Speed and accuracy are also gained by using DIAMOND [45] with a UniProt90 reference dataset to sort viral taxonomy units. The use of DIAMOND and UniRef90 allowed genome detective to identify viral short reads at least 1000 times faster than when Blastn and the viral nucleotide database of NCBI were used. The software was optimized using synthetic datasets to represent the great diversity of virus genomes. The application was then validated with next-generation sequencing data

One of the major difficulties in this process is the correct de novo assembly of viral genomes from crude metagenomic deep sequencing reads, including large amounts of bacteria and human related sequencing reads. Such read contaminations often force the server to overload during de novo assembly and might cause misassembly of the resultant contigs. Pre-filtering by host-mapping subtraction could lead to efficient de novo assembly, allowing the rapid and accurate procurement of a complete viral genome sequence. In addition to the accuracy of de novo assembly, the exclusion of human-related sequences can circumvent conflicting ethical issues

by avoiding analyzing the personal genetic information of patients [46, 47].

de novo assembly in order to ensure performance is not compromised.

perform phylogenetic analysis to provide evolutionary insights.

**5.4 TAR-VIR: a pipeline for TARgeted VIRal strain reconstruction from** 

**5.3 Virus identification pipeline (VIP)**

**metagenomic data**

VirusTAP is web-based, integrated NGS analysis tool designed to facilitate rapid and accurate viral genome assembly from raw reads by just clicking on several selections. Like genome detective, it ensures that non-viral reads are eliminated prior to

VIP (https://github.com/keylabivdc/VIP) is a web-based virus discovery and identification tool [46]. With a single click, it will filter out background-related reads, classify reads on basis of nucleotide and remote amino acid homology, and

TAR-VIR is a non-reference based NGS analysis tool for the reconstruction of viral strains from metagenomic samples [46, 47]. It was developed to classify RNA

*DOI: http://dx.doi.org/10.5772/intechopen.88769*

**5.1 Genome detective**

of hundreds of viruses.

**5.2 VirusTAP: viral genome-targeted assembly pipeline**

**5. Virus discovery and identification tools**
