**2. SARS-CoV-2 viral genome – development of molecular tests for diagnostic and surveillance of the emergent variants**

One of the first events that led to the diagnosis of Coronavirus disease (COVID-19) following *SARS-CoV-2* infection was the identification of the infectious agent that causes a new disease of unknown origin by characterizing the nucleic acid signature.

The first patient was hospitalized on 12th of December 2019 and on 10th of January a viral genome sequence was already released. The first metagenomic RNA sequencing report of a sample of bronchoalveolar lavage fluid from a patient who was admitted to the Central Hospital of Wuhan on 26th of December 2019 while experiencing a severe respiratory syndrome, identified a new RNA virus strain from the family *Coronaviridae*, which was later named *SARS-CoV-2* (Wuhan-Hu-1, GenBank accession number MN908947) [5]. Confirmation of the results obtained by deep meta-transcriptomic sequencing, regarding the genome sequence of this virus and also its termini, was done by real-time reverse-transcription PCR (rRT – PCR), and this was the beginning of a new era, the era of "COVID-19" because at that time rRT-PCR was routinely used to detect causative viruses from respiratory secretions, but it was not considered a gold standard diagnostic technique. What followed turned this technique into the gold standard in terms of diagnosing COVID-19 disease and SARS-CoV-2 infection [6–9].

The first three determined genomes of the novel coronavirus (SARS-CoV-2), namely: Wuhan/IVDC-HB-01/2019 (GISAID accession ID: EPI\_ISL\_402119) (HB01), Wuhan/IVDC-HB-04/2019 (EPI\_ISL\_402120) (HB04), and Wuhan/ IVDC-HB-05/2019 (EPI\_ISL\_402121) (HB05) were compared [10]. The three genomes were almost identical and the findings showed that the SARS-CoV-2

*SARS-CoV-2 Variant Surveillance in Genomic Medicine Era DOI: http://dx.doi.org/10.5772/intechopen.107137*

**Figure 1.**

*Schematic diagram of SARS-COV-2 virus genome and most important encoded proteins.*

genome, which is approximately 30kb in size, was a positive sense, single-stranded RNA with a 5′-cap and a 3′-poly-A tail that contained 14 open reading frames (ORFs) encoding 27 proteins (**Figure 1**). The 5′-terminus contains orf1ab and orf1a genes, which encode the polyproteins pp1ab and pp1a. These two polyproteins are further processed by viral proteinases Nsp3 and Nsp5 resulting in 16 nonstructural proteins (Nsps), Nsp1 to Nsp10 and Nsp12 to Nsp16, responsible for viral replication. The 16 nonstructural proteins form a replicase/transcriptase complex (RTC) together. The activity of this complex is dependent on the involvement of viral enzymes Nsp7-Nsp8 primase, the Nsp12 RNA-dependent RNA polymerase (RdRp), the Nsp13 helicase/triphosphatase, the Nsp14 exoribonuclease (the first identified proofreading enzyme encoded by an RNA virus), Nsp15 endonuclease, and Nsp10-Nsp16 N7- and 2′O-methyltransferases. The 3′-terminus encode the four structural proteins spike (S), envelope (E), membrane (M), and nucleocapsid (N) and eight accessory proteins (3a, 3b, p6, 7a, 7b, 8b, 9b, and orf14) [8, 10–13].

After SARS-CoV-2 genome virus sequences were obtained, the similarities and differences between SARS-CoV-2 and other SARS viruses offered the possibility to establish key sequences in the genome for use in diagnosis and surveillance. The release of the first SARS-CoV-2 sequence allowed rapid evaluation of the rRT – PCR techniques for the detection of specific sequences of the SARS-CoV-2 genome and immediately a diagnostic workflow was established [6]. Sequences that offered sensitivity and specificity to the diagnosis were selected, so the detection of a sequence in the E gene provided sensitivity to the test, but not specificity is given the high percentage of similarity with other coronaviruses. The specificity of the test was given by the use of specific primers for certain sequences in genes with less homology to other coronaviruses, as N, S, Orf1ab, and RdRp (located in ORF1ab gene), and in order to increase the sensitivity of the test, the simultaneous detection of several targets have been employed [6, 8].

In addition to rRT-PCR as a standard method for diagnosing SARS-CoV-2 infection, other methods involving the amplification of nucleic acids (NAATs) have been used to detect viral RNA, including digital PCR (dPCR), reverse transcription loopmediated isothermal amplification (RT-LAMP), and clustered regularly interspaced short palindromic repeats (CRISPR)-based assays. All of that could be a useful tool for surveillance and timely identification of emerging strains. Moreover, NGS has been used since the beginning of the COVID-19 pandemic for the characterization and analysis of viral genetic material and mutation surveillance [8].

There are many studies that evaluated different NAAT strategies for the detection of SARS-CoV-2 and compared their sensitivity and specificity and their conclusion

was that rRT-PCRs were significantly more sensitive than other methods [14]. However, for population surveillance, there are need for detection methods that have an increased specificity, are less expensive, and are faster than NGS and rRT-PCR.

**dPCR** has many advantages over rRT-PCR including higher precision with absolute nucleic acid quantification, it has higher sensitivity and it is not as sensitive to PCR inhibitors or mismatch primer/template. However, this technique has a complicated workflow and depends on expensive instruments and consumables, which results in a higher cost per test [15, 16]. As an important advantage, there are studies that propose using dPCR for SARS-CoV-2 viral load measurement directly from crude lysate without nucleic acid purification [17].

**RT-LAMP** was previously used for the detection of the Middle East respiratory syndrome coronavirus (MERS-CoV) and severe acute respiratory syndrome coronavirus (SARS-CoV) global outbreaks. RT-LAMP is a reliable and rapid screening test, which can also be used under non-laboratory conditions, but the sensitivity of RT-LAMP is poor, with an important percentage of positive patients remaining undetected [18].

**CRISPR-based assays** represent a system based on CRISPR-associated endonucleases (Cas), CRISPR-Cas12a, and CRISPR-Cas13a, that recognizes and cleaves nucleic acids in a sequence-specific way. Recently, a CRISPR-based diagnostic platform that combines nucleic acid pre-amplification with CRISPR-Cas enzymology was established for the detection of SARS-CoV-2 RNA. The great advantage is that the detection via fluorescent and colorimetric readouts provides results in less than 1 hour, but even if it is highly sensitive and specific, the multistep nucleic acid amplification process may affect precise target quantification. Additionally, the preparation and testing of reaction components need optimization [19, 20].

Although NAAT techniques have high sensitivity and specificity for the detection of SARS-CoV-2*,* in the management of the COVID-19 pandemic, a faster detection method was required, which would involve lower costs and also non-laboratory conditions and expertise. These needs have led to the development of rapid tests that detect SARS-CoV-2 viral proteins, intensively used in the detection of other viral and bacterial infections, but which have as a limitation the lower specificity and sensitivity than NAAT-type tests. During the infection with SARS-CoV-2 in the nasopharynx and oropharynx of infected people, high concentrations of S and N protein were detected and because of that, they became ideal candidates for diagnostic targets for the **detection of viral protein** by antigen–antibody (Ag-Ab) reaction. Thus, monoclonal antibodies against viral N and S proteins react with the viral proteins N and/or S present in patients' specimens and this interaction can be easily visualized [21–23]. The major limitations of this technique are that it could generate false negative results for patients with low viral loads, and has lower sensitivity for cycle of quantification >30. The negative results need to be confirmed using molecular tests, particularly when the clinical context is suggestive of SARS-CoV-2 infection.

SARS-CoV-2 has proofreading mechanisms, which make the mutation rate lower compared to other RNA viruses such as HIV and influenza, however, the selection pressure and immune evasion mechanisms have led to mutations that can affect the properties of the virus, thus surveillance of viral evolution is utterly necessary.

Genomic surveillance involves the analysis of similarities and differences between sequences obtained by **viral genome sequencing**.

The development of **NGS** techniques has led to a huge amount of genomic sequence data [3]. As it was shown for emerging infectious diseases, such as SARS, MERS, Zika, and Ebola, **whole-genome sequencing (WGS) metagenomics** 

**technique** offers the possibility to rapidly obtain the full sequence of pathogen genomes, tracing origins, spread and transmission chains of outbreaks, and monitoring the pathogen evolution [24–28].

Metagenomics applications were used for rapid identification and characterization of SARS-CoV-2 and brought critical novel information [5, 29]. The application is simple, cost-effective, and does not require reference sequence for analysis.

In order to obtain complete or nearly complete assemblies of the genome of SARS-CoV-2 clinical samples, shotgun metatranscriptomics – saturation RNA sequencing – has been successfully used. The principle of the method was based on host gene expression monitoring and consists of either enrichment of the poly(A) + RNA fraction, or depletion of host rRNA [30]. Depending on the manufacturer and NGS technology the workflow consists of RNA fragmentation, first- and second-strand cDNA synthesis, and library preparation. Most of the studies were developed on Illumina platforms and the Oxford Nanopore Technology (ONT) [31].

**Amplicon-based sequencing** approach was developed later, after the enrichment of the knowledge regarding the SARS-CoV-2 genome, as the method is highly specific. The typical workflow consists of first-strand cDNA synthesis followed by genome amplification with multiplex PCRs. The primers used in multiplex PCRs produce a pool of amplicons that cover almost the entire viral genome. Amplicon sequencing is highly specific and robust, but it presents some limitations regarding differences in primer efficiency, amplification across the genome can be biased, with decreased coverage in specific genomic regions and/or 3′ and 5' UTRs regions are not targeted leading to an incomplete assembly [30]. For library preparation, several commercial and noncommercial protocols are available, and libraries can be sequenced on benchtop platforms (i.e., Illumina NextSeq and Miseq; Ion torrent platforms, etc.) [30].

**Hybrid capture-enrichment sequencing** is similar to amplicon-based sequencing that allows to target regions of a genome and enrich through hybridization to specific biotinylated probes. This approach was initially developed for exome sequencing [32]. Libraries obtained can be sequenced on benchtop platforms (Illumina NextSeq and MiSeq, Ion torrent, etc.). Hybrid capture-enrichment method uses a larger number of fragments/probes, providing more complete profiling of the target sequences and more robust to genomic variability [33].

**Direct RNA sequencing** is relatively recent approach in sequencing technologies that do not require RNA revers-transcription and allow the direct determination of the sequence of single nucleic acid molecules, without amplification [34]. This technology provides longer reads than regular NGS methods, but with higher error rates [35]. However, this method can provide the sequence of a single mature and precursor transcripts, and information about complex transcriptional patterns, which accompany coronavirus infection (recombination, alternative transcript maturation, rare transcriptional isoforms, etc.) [12].

The global effort of NGS for SARS-CoV-2 in COVID-19 pandemics generated a massive number of reads that had to be analyzed organized and stored in international databases with global access. Basically, the NGS data analysis involves several essential steps: quality control of the NGS data, removal of host/rRNA data, reads assembly, taxonomic classification, and virus genome verification [36].

#### **2.1 SARS-CoV-2 genome data analysis**

The assembly of the SARS-CoV-2 genome is a quite straightforward process, as the viral genome is small and does not contain any large repetitive sequence. The

main method for the assembly of NGS data that provides a complete and accurate representation of the genome (highly contiguous and accurate assemblies) is based on Overlap Layout Consensus, de Bruijn graphs, or, in general, reference-based assembly [37]. SARS-CoV-2 sequencing with a 30x coverage is generally considered sufficient to generate high-quality assembly [30]. The coverage is dependent on the sequencing platform and on the sequencing strategy, however, data obtained from targeted-enrichment-based library preparation methods (hybrid capture and amplicon sequencing) provide a sufficient viral genomic read.

The first step of bioinformatics workflow is to establish the quality of the reads. Fastq files are processed for subsequent analyses as follows: removing the adapter sequences and filtering low-quality/complexity reads, error correction, etc. [36].

Metagenomics sequencing protocols provide uniform coverage, but the number of viral reads depends on the viral load of the sample and may contain reads, derived from viral sub-genomic RNAs and replication intermediates [38]. For metagenomics reads assembly-efficient software tools are currently available [39].

In order to obtain an accurate representation of the genomic sequence of a SARS-CoV-2 strain, de-novo assembly method could be used, but there were also available, less sensitive methods, such as reference-guided assembly algorithms [40, 41].

#### **2.2 SARS-CoV-2 genome verification and classification**

Taxonomic classification is the following step after the reads are assembled into contigs. The quality of contigs can be evaluated by read mapping. The reliable contigs with unassembled overlaps are fused to form longer viral contigs using contig assembly tools (e.g., SEQMAN and Geneious).

#### **2.3 SARS-CoV-2 phylogenetic analyses**

The best-known portals for the real-time monitoring of the evolution SARS-CoV-2 strains are Nextstrain [42] and the HyPhy COVID-19 [43]. These systems provide real-time information of on worldwide distribution of different clades and lineages of SARS-CoV-2 (Nexstrain), and detailed phylogenetic analyses of SARS-CoV-2 proteincoding genes (Hyphy).

#### **2.4 SARS-CoV-2 genomic data deposition and exploratory access**

At present, the GISAID [44] with EpiCov portal represents the most widely used repository of SARS-CoV-2 genomic data. Along with sequencing data, metadata are provided including the type of sample, the sequencing technology and protocol, patient status (e.g. hospitalized or released), vaccination, etc.

Exploratory access is available from the three most popular portals for SARS-CoV-2 genome data: COG-UK [45], GISAID EpiCoV [44], and the NCBI [46].

Technical advances in NGS and bioinformatics have permitted a fast identification of causative agent of COVID-19, tracking its global spread and confirming the genomic modifications when they occurred. Current bioinformatics resources are multiple, but big datasets pose challenges for data storage and analysis and a solution must be found not only for the control of the current COVID-19 pandemic but for future outbreaks response.

Although NGS is a very precise tool, allowing the detection of each mutation in a sample (thus being considered the gold standard in tracking the viral variants), it has a few drawbacks regarding the price, the duration, and the accessibility [47–49].

To overcome these limitations, other genotyping strategies have been developed [50]: Multiplex PCR tests that use either TaqMan probes or molecular beacon probes, identify and monitor specific SARS-CoV-2 variants, and, even if they target preselected known mutations, they are more rapid, cheaper options, and could easily be deployed in settings with limited resources as an alternative to genome sequencing methods [47, 51].

New appeared mutations had an important effect on the detection sensitivity of RT-PCR that could be reduced if the mutations were located where probes and primers bind [52]. Because of this, commercial variants of kits that detected several genes that included the RdRp and Orf1ab genes in addition to the S and N genes were used and commercial multiplexing tests for tracking mutations in the population, for the surveillance and sequencing prioritization were rapidly developed.

The occurrence of the mutations in the S gene led to S gene target failure or so-called S gene dropout, which generated false negative RT-PCR results. This test failure, however, turned later in new pre-screening rRT-PCR assays that analyzed simultaneous detection of del-HV69/70 and N501Y in order to distinguish between B.1.1.7 and B1.351 lineages or have been used as a marker of B.1.1.529 variant [51, 53].

A TaqMan SNP genotyping test, recently developed by a Taiwanese team [50], targets nine mutations in receptor-binding domain of the spike protein of SARS-CoV-2 (delH69/V70, K417T, K417N, L452R, E484K, E484Q, N501Y, P681H, and P681R), and it is designed to simultaneously detect five important variants (Alpha, Beta, Gamma, Delta, and Omicron).

Molecular diagnostic companies are closely tracking data collected from laboratories all over the world in order to develop commercial multiplex genotyping kits that identify and screen variants as new significant functional mutations emerge.
