**1. Introduction**

Next-generation RNA sequencing is rapidly replacing cDNA microarrays and quantitative PCR in clinics and in the public health laboratories, due to higher sensitivity and precision, as well as its cost-effective ability to identify novel RNA species. High-throughput sequencing has become widely adopted in infectious diseases. Supporting bioinformatics services have improved substantially, which aids in genotyping causative mutations for antimicrobial resistance, pathogens' virulence factors, and global epidemiological surveillance to monitor molecular dynamics of pathogen diversity [1, 2]. Integration of pathogen genome sequencing into infectious disease surveillance was established by United States Centers for Disease Control and Prevention (CDC) under Advanced Molecular Detection program in 2016 [2]. As described below, RNA sequencing became particularly

important to identify newly emerging HIV and SARS-CoV strains. RNA-Seq can be used for many different purposes, such as improve diagnostics, characterization of novel strains, investigation of disease epidemiology, spatiotemporal spread and transmission routes, assessment of evolutionary rates, and the development of countermeasures and public health policies.

The standard practice of viral RNA detection is composed of a two-step procedure - reverse transcription of RNA into cDNA and cDNA amplification. The reverse transcription reaction is prone to recombination errors leading to potential alterations in the amplicon library and inadequate sequence representation of the original sample. Although both the nucleic acid amplification test (NAAT) and realtime reverse transcriptase - polymerase chain reaction (RT-PCR) test are cost-effective, the occurrence of amplicon failures needs to be monitored for substitutions in primer binding sites. In order to circumvent possible mutations in the primer regions, it is recommended multiple primer designs targeting the same genes, when designing RT-PCR assays [3, 4]. The RNA-Seq method is capable to unbiasedly identify variants across thousands of target regions with single-base resolution, in cases where NAAT or qRT-PCR produce false-negative results.

Illumina's technology appears to be at the forefront of viral sequencing, however others such as IonTorrent and Oxford Nanopore technologies are quickly closing the gaps [5]. Illumina deploys sequencing-by-synthesis chemistry, which provides high accuracy and deep coverage of the viral genome. A limitation of Illumina is that short read length (< 400 nt) fragmented sequences should be re-assembled computationally, during which the haplotype information can be lost. Short-read sequencing data requires computational pipelines for trimming of low-quality reads, removal of optical duplicates, detection of reads orientation, and alignment to reconstruct full-length viral sequences. Short read RNA-Seq technologies allow to pull multiple samples per lane to reduce the cost, however, demultiplexing of reads from different samples can be bioinformatically challenging.

The recent shift toward emerging long-read technologies, enabled direct sequencing of individual RNA molecules as opposed to classical indirect approaches [6]. Long-read RNA-seq technologies, such as that provided by Ion Torrent or Oxford Nanopore, allow mapping of novel insertions, deletions, or substitutions in natural variants [7]. Thus, the technology offers advantages over established sequencing technologies, as it has simplified procedures for library preparation and bioinformatics analysis. Although direct RNA sequencing possesses low accuracy, than Illumina's short read method, it enables easier full-length sequencing, rapid viral genome annotation and analysis, which would be particularly useful for understanding changes in transcriptome architectures [8]. Additionally, long-read technologies are more cost-effective, portable, and provide robust reproducibility of the results, when compared to short-read RNA-seq [9]. A heterozygous variants can be detected when filtered by bioinformatic pipelines, which may identify coinfections and estimate the risk of superinfection [10]. Detailed descriptions and of long- and short- read methodologies, as well as analysis workflows, are provided in other chapters of this book.

#### **2. Sequencing of human immunodeficiency virus (HIV)**

The human immunodeficiency virus (HIV) is organized in the genus Lentivirus, within the family of Retroviridae, subfamily Orthoretrovirinae. HIV is a human retrovirus with an RNA genome that is composed of copies of a single stranded positive sense RNA [11]. The RNA genome encodes viral enzymes and is packed into nucleocapsid proteins (the viral capsid). After infecting the cell, the viral RNA

#### *Diagnostic Applications for RNA-Seq Technology and Transcriptome Analyses in Human… DOI: http://dx.doi.org/10.5772/intechopen.99156*

genome is reverse transcribed into double stranded viral DNA, which is subsequently integrated into cellular host DNA with the aid of viral and host cofactors. The HIV life cycle has several life-cycle phases; a prolonged latent phase of host genome integration, and a phase of active transcription of new viral RNA in the infected cell [11]. The of HIV replication process is characterized by high rates of nucleotide misincorporations, insertions, deletions, and recombinations. Twentieth century research studies, in 1980s and 1990s, uncovered only a partial viral genome. Determination of the HIV partial genome sequence provided the basis for the development of next-generation sequencing and real-time reverse transcriptase polymerase chain reaction (RT-PCR) methods.

Next-generation RNA-Sequencing enabled the first successful pan-HIV-1 sequencing, including subtype identification and phylodynamic diversification during the course of AIDS pandemic [12]. The current HIV nomenclature includes two types: HIV-1 and HIV-2. The HIV-1 is phylogenetically classified into of 4 groups (M, N, O, and P) [13]. Phylodynamic analyses and evolutionary clock models estimated the time of the most recent common ancestor (HIV-1 pandemic group) to be around the late 19th – early 20th centuries, in Central Africa, and where it may have been associated with an epidemic in primates [14]. The major (M) pandemic group, which causes human-to-human transmission, is classified into 9 subtypes. The most prevalent M subtypes are A, B, C, D, and G; while subtypes F, H, J, L and K are collectively responsible for approximately 1% of all HIV infections [14–16]. An important genetic feature of HIV is that it is prone to recombination. The highest levels of inter-subtype re-combinations are found in HIV-1 infected specimens, from patients in the Sub-Saharan region of Africa. Analysis of a whole-genome HIV-1 sequence, from the Congo Basin, identified ancestral HIV species which clustered basal to all major subtypes; many of which underwent purifying selection and are no longer in circulation. However, 72 additional recombinant forms of HIV remain in circulation [17].

Intra-individual mixtures of recombinant genomes have been reported throughout the world. It is hypothesized that inter-subtype recombinant viruses have an advantage of transmissibility over parental strains [18–24]. To achieve full viral genome sequence coverage, several approaches were employed. One such approach was the amplification of two large fragments ("half genomes"), spanning the full HIV-1 sequence, including all critical regions. Using this approach, molecular surveillance studies of circulating and recombinant forms of HIV, has been conducted in Cameron. Researchers used primers which target two overlapping "half genome" sequencing approach: the 5′ region (*gag* and *pol* genes), the 3′ region (*env* and *nef*), which overlap the accessory genes (*vif, vpr,* and *vpu*) [25]. The two-amplicon approach followed by deep sequencing allowed to characterize several novel circulating recombinant forms, CRFs, (CRF02\_AG, \_AE, \_01A1, and F2, CRF\_36cpx and \_37cpx, etc.) and more than a dozen unique recombinant forms (URF, NYU6541\_6, NYU6556\_3, etc.) that clustered separately from their reference sequences, as determined by the maximum likelihood phylogenetic algorithm [26]. The introduction of circulating recombinant forms (CRF01\_AE and CRF02\_AG) into the Asian-Pacific region from Continental Asia was identified using similar "half genome" sequencing approach [27]. Another approach, the "switching mechanism at the 5′ end of RNA transcript" (SMART), leverages the templateswitching capability of certain RT enzymes toward full-length template. During RT reaction, three additional nucleotides are added to the 3′ end of the first cDNA strand of viral RNA template, which serve as an anchor for selective amplification of full length template with a set of 5′ adaptor-ligated primer [28]. This method was adopted to capture full-length HIV sequence in clinical samples, in which viral loads were > 5 log10 copies/mL [29]. Neighbor-joining phylogenetics trees built from this study data revealed all known viral recombinant species from all four groups

(M, N, P, O), and rare M-subtypes (J and H). A successful strategy has been to study HIV species diversity by embedding a degenerate block of nucleotides, within the primer, during first round of cDNA synthesis, to avoid PCR-related errors (and sequencing errors [30, 31]. A unique ID (Primer ID), created by an incorporated primer tag, can correct allelic skewing during sequencing [32]. Using this approach the authors were able to identify minor variants in the HIV-1 protease gene resulting in multiple alleles that conferred drug-resistance [30].

The percentage of people living with HIV drug resistance appears to be increasing [33]. RNA-seq has been key to revealing so-called viral invasive genes and the acquisition of drug-resistance mutations throughout of the 9.7 Kb of viral genome [34]. Achievement of long amplicon length allowed quantification of viral load and multiple drug-resistant mutations using viral RNA and pro-viral DNA as a template [35–38]. RNA-based long-read sequencing is also being explored for surveillance of viral diversification and assessment of HIV quasispecies [39, 40]. The FDA has cleared Sentosa HIV-1 RNA-Seq genotyping SQ assay for clinical use, to detect and monitor antiretroviral drug resistance mutations in the blood of HIV-infected patients. This product helps clinicians to prescribe effective combination of antiretroviral drugs, such as reverse transcriptase, integrase, envelope (gp41), or CCR5/CXCR4 inhibitors [39]. Phylogenetic clustering of genotyped samples from HIV patients who diagnosed between 1999 and 2012 identified strong associations between molecular clusters and epidemiological hotspots for transmitted drug resistance [41].

### **3. Sequencing of SARS-CoV-2**

Viruses of the order Nidovirales - family Coronaviridae - subfamily Coronavirinae are positive-sense, single-stranded sequences of RNAs; ranging from 26 to 32 kilobases in length [42]. The novel SARS-CoV-2 species was identified via metagenomic RNA sequencing and made publicly available, in NCBI Virus, within two months of unknown pneumonia cases outbreak in China (accession number NC\_045512) [43]. The SARS-CoV-2 viral genome (made up of 29,903 nucleotides) is phylogenetically related to genus beta-coronavirus (subgenus sarbecovirus), which has demonstrated the ability to crossover, from the animal kingdom, and subsequently cause infection in humans [44]. A viral RNA genome encodes several structural proteins (spike, S; nucleocapsid, N; transmembrane, M; envelope, E). SARS-CoV-2 is known to produce 16 non-structural proteins, and at least six accessory proteins (encoded primarily by reading frames ORF1 and ORF2) [45]. Each viral transcript has a 5′ cap structure and a 3′ poly(A) tail [46]*.* Genomic characterization of SARS-CoV-2 receptor-binding domains, in conjunction with phylogenetic analysis and homology modeling, revealed the sequence for viral port of entry proteins that conferred human-to-human transmission [47]. Direct RNA sequencing via nanopore and DNA nanoball methodologies identified that SARS-CoV-2 generates a tiered series of canonical and non-canonical subgenomic RNAs, through a process involving homology recombinations between transcriptional regulatory sequences [48]. The function of these transcripts is currently unknown [49].

Sequencing-based genomic surveillance has been employed by UK, USA, and Canadian consortia, to coordinate efforts to sequence large numbers of SARS-CoV-2 genomes. An executive summary "Genomic sequencing of SARS-CoV-2" was issued by the WHO, in January 2021. It discusses the implementation of NGS, for "maximum impact on public health" and provide detailed overview of current sequencing methodologies [50].

In lockstep with the continuing advances of next-generation sequencing technologies, the bioinformatics community has developed many computational tools

#### *Diagnostic Applications for RNA-Seq Technology and Transcriptome Analyses in Human… DOI: http://dx.doi.org/10.5772/intechopen.99156*

capable of keeping up with big data analysis and interpretation. Data sharing repositories such as GISAID (EpiCoV), NCBI Virus/GenBank, COG-UK, ENA provide free access to sequencing data (**Figure 1**). Bioinformatics workspaces (e.g., Terra [51], UshER [52], NextStrain [53, 54], PANGO Lineages [55], CoV-Glue [56], see **Figure 1**) allow users to assign their own sequences to globally circulating lineages [57].

Genomics researchers who track phylogenetic dynamics of SARS-CoV-2 developed several schemes to describe the accumulation of mutations and detectable divisions within circulating SARS-CoV-2 populations (**Figure 2**). Three popular clade naming conventions are used:

1.PANGO Lineages [55];

2.Clades by NextStrain [58];

3.Clades by GISAID [59].

The PANGO Lineages nomenclature refers to corresponding outbreaks caused by distinct lineages (**Figure 2A**). Lineage A is suggested to be ancestral because it shares more nucleotides similarities with related coronaviruses in animals [60]. Lineages B presumably originated from United Kingdom, and lineages P – from Brazil [61, 62]. The Clades by NextStrain and GISAID nomenclature refers to the divergence of newly emergent clades (see **Figure 2B** and **C**) [57, 59, 63]. A formal naming system for SARS-CoV-2 evolutionary lineages has not been universally adopted. The nomenclatures from PANGO Lineages and GISAID are updated on CDC and WHO web [64, 65]. Novel phylodynamic models for visualization of phylogenomic datasets have been employed in different web formats. For example, Auspice, Augur, NextClade are open-source interactive web tools for visualizing phylogenomic data within NextStrain interface [66]. UShER (Ultrafast Sample placement on Existing tRee) is available in UCSC, and the virus phylogeny is a part of T-BioInfo platform [52]. Novel human coronavirus lineages are now regularly reported, and lineage nomenclature is continually updated [64]. The WHO Virus Evolution Working Group has been working on simplifying the nomenclature, for SARS-CoV-2 variants of interests (VOI) and variants of

#### **Figure 1.**

*Diagnostic applications for RNA-Seq data analyses in human diseases caused by RNA viruses. The upper boxes represent bioinformatics workspaces for data analysis as described in the text below. The lower boxes represent major public repositories/databases commonly used for collecting and sharing genome sequence data (see abbreviations list). Note: Viral sequencing data are deposited in GISAID and NCBI virus; the ENA (EMBL-EBI), GEO, and SRA (NCBI) contain sequences from all domains of life, including human specimens). Double-headed arrows represent interactive nature between bioinformatics tools and data repositories, which allows for variants identification and gene expression assessment.*

#### **Figure 2.**

*Genomic epidemiology of novel severe acute respiratory syndrome coronavirus 2 (global subsampling). 2A. PANGO lineages. 2B. Clades by NextStrain. 2C. Clades by GISAID. These phylogenetic trees show timeresolved visualization of evolutionary relationships between SARS-CoV-2 viruses from the ongoing COVID-19 pandemic. Exported from https://nextstrain.org/ncov/global. The dots on the trees' branches are color-coded in accordance with phylogeny classification by (2A) PANGO lineages; (2B) by NextStrain clades; (2C) clades by GISAID. Figure legends on the left represent lineages or clades names. Accessed on Apr 20th.*

concern (VOC). This group has recommended easy-to-pronounce variant naming based on the Greek alphabet (e.g. Alpha, Beta, Gamma, Delta - for VOC; and Epsilon thought Lambda – for VOI) [67].

Genomic epidemiology is a novel discipline that assesses viral genomic diversification, including the acquisition of pathogenic mutations, and molecular dynamics

#### *Diagnostic Applications for RNA-Seq Technology and Transcriptome Analyses in Human… DOI: http://dx.doi.org/10.5772/intechopen.99156*

of a pandemic, at the population level. Real-time viral sequencing helps delineate the origin of imported cases. For example, the assessment of WGS data (in several European countries) enabled a more precise understanding of SARS-CoV-2 transmission patterns, during various phases of the current pandemic. Molecular epidemiology data aided in identifying multiple introduction events, from the community spread in the Netherlands, Brazil, and several other countries [68–71]. WGS analysis leveraged with phylogenetic methodology approximated the source of lineage exportation. For example, the first outbreak of COVID-19 in Taiwan was imported from China. However, the second was from Italy, as strains that caused the second outbreak were phylogenetically more related to GISAID Accession IDs from the Italian outbreak [72]. As determined by the neighbor-joining method using MEGAX software, the circulating Moroccan lineage is more closely related to the South African lineage (20H/501Y.V2). This is not surprising given that travel within the continent is more predominant than travel between continents [73, 74]. In the United States, epidemiological investigation of sequencing data showed that SARS-CoV-2 varies at the single-nucleotide polymorphism (SNP) resolution, within various States. Additionally, SARS-CoV-2 genomic surveillance identified introduction and community spread of more transmissible strains in the states of California, Ohio, Washington and others (e.g., 20C > 20G clade switch) [75–79]. Genome sequencing in the New York City and Boston areas identified a cryptic spread of multiple simultaneous lineage introductions, from European or Asian travel-related exposures [80–82]. Using Global Epidemic and Mobility Model (GLEAM), Davis et al. estimated the time for the arrival and cryptic phase of the COVID-19 epidemic, which began in early to mid-January on both the East and West coasts of the USA [83]. During the cryptic phase of transmission, monophyletic clades of imported SARS-CoV-2 were spreading until air travel restrictions were dictated [83, 84].

US Center for Disease Controls and Prevention regularly updates the SARS-CoV-2 variant classification, adding novel mutations that have an impact on the immune response or virus virulence factor. Variants are generally classified as a variant of interest, a variant of concern, and a variant of high consequence [64]. Genome-wide sequencing analysis of samples, from various geographic locations, has revealed novel mutations dispersed throughout SARS-CoV-2 genome [85]. Weber et.al has recently detected a number of hotspot mutations, juxtaposing SARS-CoV-2 sequences from various geographic regions, which occurred *de novo*, during the dissemination of infection [60]. These differences in genome variation are not necessarily all pathogenic, but may be important in vaccine design and assessment of immunogenicity [86]. Molecular clock computations estimate that the average evolutionary rate for coronaviruses is roughly 10−4 nucleotide substitutions per site, per year, with point mutations that contribute to the diversification of global strains [87, 88]. Although most acquired genetic changes do not substantially affect virulence or transmissibility, those that do cause the corresponding phenotypic changes in SARS-CoV-2 - are of public health importance [67, 85].

#### **4. Advances in diagnostic sequencing methods**

Diagnostic sequencing methods have reached a state of capability, which enables the detection of low quantities of viral RNA, with greater sensitivity and specificity. Several amplicon construction approaches have been utilized: targeted capturebased, tiling hexamer-based, and standard amplicon-based [89–91]. A newly commercialized methodology that includes the hybridization capture method yielded near-complete genome coverage, even in samples with relatively low viral loads (cycle thresholds, Ct > 25) [92]. Several studies reported that 93 to 99% of genome

coverage is the samples with Ct values ranging from 26 to 32. The median on-target percentage of genome coverage dropped below 50%, in higher Ct value samples (Ct > 30) [93]. Public Health Laboratories have recently utilized capture-based methods followed by sequencing to evaluate if SARS-CoV-2 is present in wastewater samples. The goal of the study was the detection the meta-transcriptomic differences between SARS-CoV-2 variants, found in wastewater, and compare the differences to locally reported clinical genotypes [77]. An Illumina Flex for Enrichment kit and Illumina Respiratory Virus Oligo Panel were used to enrich the samples for virus cDNA amplicon.

The tiling amplicon-based approach has been adapted for SARS-CoV-2 sequencing. It is based on modified reverse transcription (RT with designed primer pool) and an overlapping long amplicon multiplex PCR strategy [94]. Primers can be designed in a random-hexamer-primed RT fashion, and a not-so-random-hexamer-primed fashion, to eliminate human rRNA during cDNA synthesis. The number of primers can reach up to 400, in metagenomics protocols, to improve genome coverage. However, less than 20 primers are used more widely for amplicon construction [95–97]. The tiling amplicon-based approach has been validated by many laboratories and adapted for nanopore sequencing technologies as well as Illumina and Sanger sequencing [97–99]*.* A Swift Biosciences' Normalase® tiling amplicon SARS-CoV-2 panel achieved 80% success of partial genome recovery from samples with a Ct value between 32 and 34, and 40% - for samples with a Ct value between 34 and 36 [100]. A 98 non-overlapping primer pair set have been designed by a group of scientists from ARTIC network in early March of 2020. This primer set has been modified on 3 occasions, reducing primer-dimers formation during RT-PCR (a major cause of coverage bias) [101]. Recently, the ARTIC's CoronaHiT platform (Coronavirus High Throughput) has been launched and provided accurate consensus SARS-CoV-2 genomes when at least 20x coverage is obtained. It is important to note that in the low-titer samples, with Ct > 32 (< 100 viral genome copies/uL), sequencing results became less reliable for all sequencing methods (long- and short- read) [96, 102–104].

#### **5. Assessment of host response by RNA-Seq**

The host-pathogen interactions have been studied by RNA-Seq in research settings. Transcriptome analysis of hosts' diseased tissues has been an integral part of the discovery of immune response biomarkers. University of Minnesota Division of Infectious Diseases (USA) leads intensive transcriptomic studies for discoveries of novel biomarkers of immune reconstitution inflammatory syndrome (IRIS) in the HIV-infected population [105, 106]. IRIS presents as an exaggerated immune reaction (in a form of cytokine release syndrome) that occurs in immunocompromised patients who have commenced antiretroviral treatments (ART) [106]. Several predictive and diagnostic biomarkers have already been identified in peripheral blood, utilizing Galaxy, JMP Genomics, or Partek Flow software (**Figure 1**). For example, the low expression of type I interferon, interferon-response genes, and components of antiviral defense pathways were identified as a risk factor for subsequent occurrence of nonfatal forms of IRIS [107, 108]. More recently, transcriptomic predictors of fatal IRIS have been delineated. These include the overexpressed genes that encode numerous proinflammatory cytokines (e.g., IL6), chemokines, stress response kinase signaling and production of reactive oxygen species in monocytes pathways [109]. Interestingly, the deficiency in innate antiviral defense genes, and poly-immuno-cytopenia have also been identified as significant contributors to subsequent cytokine release syndrome during COVID-19 [110–112]. The reduced IFN I and IFN III type gene expression and elevated monocyte- and granulocyte- derived chemokines at an early stage of

*Diagnostic Applications for RNA-Seq Technology and Transcriptome Analyses in Human… DOI: http://dx.doi.org/10.5772/intechopen.99156*

COVID-19 have been proposed as predictors of subsequent cytokine release syndrome and disease severity [113].

In a small percentage of the human population, the SARS-CoV-2 infection results in a critical illness that begins with uncontrolled overexpression of proinflammatory cytokines (the so-called "cytokine storm") [114]. A cytokine storm may lead to pathophysiological events such as activation of the coagulation cascades, systemic oxidative stress, and various forms of cell death [115]. Clinically, these events manifest as an acute respiratory distress syndrome (ARDS), or death from multiple organ failure [116, 117]. SARS-CoV-2 is recognized by innate immune cells and, in cases of ARDS, elicits highly abundant transcriptional gene expression, which is somewhat comparable to that of immunocompromised patients who have developed fatal IRIS [109, 118, 119]. The upregulation of inflammasome, Toll-like receptor signaling, HMGB1, and oxidative stress response transcripts, have been recently demonstrated as biomarkers of fatal immune reconstitution inflammatory syndrome [109]. Additionally, NF-kB-associated inflammatory genes and strong neutrophile activation/degranulation signatures differentiate fatal IRIS events from those of survivors. Likewise, fatal ARDS in COVID-19 patients have been characterized by maladapted hyperactivation of the same innate inflammatory pathways described for fatal IRIS [120–123]. Systemic upregulation of cytokines TNFa, IL6, IL1B, and NLRP3 are noteworthy of mention as they are targetable molecules for drugs such as adalimumab, tocilizumab, anakinra, and RAPA-501-Allo, respectively [124–127].

SARS-CoV-2 has the propensity to selectively induce mortality in elderly population (over 65 years of age) and in immunocompromised individuals [128, 129]. The most reasonable explanation for this propensity is the depletion of the thymic population, and the inability to mount adaptive T cell immune responses in timely manner [130]. Innate immune cells become the first line of defense against SARS-CoV-2, however in the absence of proper regulatory feedback – are thrown into an uncontrolled proinflammatory loop. Similarly, the irreparable thymic damage and a waning adaptive immunity in AIDS patients enable the sustained HIV replication, which may explain why serious complications such as fatal IRIS are hinged on responses driven by innate immune cells.

There have been isolated cases of SARS-CoV-2 reinfection in individuals who had no known immune disorders [131–133]. Since SARS-CoV-2 has not diversified significantly, these patients have been re-infected with variants from the same or adjacent clade [132]. Studies have revealed a number of SARS-CoV-2 epitopes are recognized by CD8+ T cells in COVID-19 convalescent subjects (as determined by TruSight HLA sequencing panel) [134]. It is hypothesized that re-infected individuals are unable to mount a sufficient response to primary infection, which result in secondary COVID-19 manifestation. Many of these patients had tested positive for anti- SARS-CoV-2 antibodies after the reinfection, but it is unknown whether they developed antibodies after the first infection at all. It has been hypothesized that susceptibility to re-infection is genetically driven [135]. Several hypothetical models had been proposed [136]. Thus, simultaneous RNA-seq of host and pathogen during viral infection would be helpful to characterize the molecular responses in case of COVID-19 reinfection.

There are now several vaccine types that are being used, in an attempt to reach herd immunity, in various countries [137]. The most successful vaccines are mRNAbased, which have an estimated 92% efficacy [138, 139]. Still, there is an accumulating body of evidence of the so-called 'vaccine breakthrough' phenomenon, when SARS-CoV-2 infection occurs in persons who are fully vaccinated [140, 141]. The immune gene variants that contribute to poor T cell memory and undelaying vaccine breakthroughs are not yet known [142]. As vaccination increases, RNA-Seq

may become a useful tool in exploring the consequence of vaccine dose-spearing strategies, the emergence of novel vaccine-escape variants, intra- host variant diversity, as well as in understanding the phenomenon of vaccine breakthrough.
