**5. The present and future of HLA typing**

We will mainly focus our review on the second and third generation of sequencers which made their way into many aspects of personalized medicine, genetic diseases research and clinical diagnostics, providing reduced hands on time, higher throughput, higher sensitivity and lower cost-per-base compared to Sanger sequencing and other techniques.

Transplantation success depends on many factors; one of them being the similarity of the sequence of genes, mainly those of the MHC genetic loci, as long as others (minor histocompatibility complex; KIR, MIC-A, and MIC-B), between donor and recipient of the graft. Characterization of these genomic sequences in both persons before HSCT is of great importance for selecting the most appropriate transplant, in order to avoid GVHD, enhance engraftment rate and assist GVT effect [38–40].

#### **5.1. The second generation of sequencers (typically named next-generation sequencers; NGS)**

When it comes to HLA typing, NGS technology overcomes many of the cons older techniques present.

First of all, it allows setting the phase of linked polymorphisms within the amplicons produced during the first steps of the technique, meaning that it helps determine in which of the two alleles, the identified groups of variants, belongs to. In heterozygous samples this is a major concern with older techniques, as recognized polymorphisms resulted in two or more different allele combinations that produced identical consensus sequences.

It also allows determination of a large number of sequences in a single reaction. This way many exonic, and also important intronic sequences, can be simultaneously analyzed. The expression levels of HLA genes are also very important, thus detection of polymorphisms outside of exonic sequences, within regulatory intronic regions, is also necessary [29, 40].

Another advancement, all NGS systems share, compared to older techniques, is the higher level of coverage they provide, leading to increased accuracy as previously described. Coverage refers to the number of times each nucleotide position is read, and later on, successfully aligned to a reference genome, during all sequencing runs. The higher the coverage of a single position, the higher the confidence level of base calling [38].

Many companies rally toward building the fastest, less expensive and more accurate sequencing system, along with user friendly analyses pipelines, since the huge amount of data extracted from these machines require extensive bioinformatics knowledge. Algorithms that deal with a variety of issues concerning data analysis have been developed during the past few years. Many of these are nicely summarized in the publications from Szolek et al. [41] and Hosomichi et al. [16], although new algorithms are continuously deployed, with the prospect of simplifying and making data analysis more precise, for their implementation in everyday practice of clinical laboratories and biomedical research [42].

NGS, or second-generation sequencing technologies, constitute various strategies relying on a combination of template preparation, sequencing and imaging followed by in silico genome alignment and assembly methods. Of them, the most widely utilized, sequencers for HLA typing, are those of Illumina (MiSeq) and Roche/454 (454 GS FLX Titanium/454 GS Junior) [16].

#### *5.1.1. Library preparation*

#### *5.1.1.1. PCR based*

The first step the technologies of Roche/454 and Illumina NGS systems usually utilize is a fragmentation step, where gDNA is digested into smaller fragments, followed by a PCR, for the amplification of DNA samples. Primers that bind to specific sequences of genomic DNA (gDNA) are designed, and the intermediate region of interest is enzyme amplified.

This row of events may be reversed, meaning that long range PCR, with the addition of suitable enzymes and primers, may precede the fragmentation step.

Each primer, except for the gDNA complementary sequence, also includes a number of additional nucleotides. These mainly comprise of the system-specific adapter sequences and the multiplex identifier tags (MIDs). The adapters assist the amplicons bind to a solid surface (either this is a bead or a slide) and provide a universal priming site for sequencing primers. MIDs help recognize the individual sample to whom the amplicon sequence generated and then sequenced, belongs to. This is particularly useful, for discriminating samples' reads (demultiplexing), in case more than one samples are prepared and pooled together for sequencing. This barcoding method is called "amplicon sequencing."

A variation of this technique, also known as "shotgun sequencing," utilizes simple primers. The adapters and MIDs are ligated to the amplified sequences after the PCR and prior to sample pooling [16, 40].

#### *5.1.1.2. Hybridization based (target enrichment)*

First of all, it allows setting the phase of linked polymorphisms within the amplicons produced during the first steps of the technique, meaning that it helps determine in which of the two alleles, the identified groups of variants, belongs to. In heterozygous samples this is a major concern with older techniques, as recognized polymorphisms resulted in two or more different

It also allows determination of a large number of sequences in a single reaction. This way many exonic, and also important intronic sequences, can be simultaneously analyzed. The expression levels of HLA genes are also very important, thus detection of polymorphisms outside of

Another advancement, all NGS systems share, compared to older techniques, is the higher level of coverage they provide, leading to increased accuracy as previously described. Coverage refers to the number of times each nucleotide position is read, and later on, successfully aligned to a reference genome, during all sequencing runs. The higher the coverage of a

Many companies rally toward building the fastest, less expensive and more accurate sequencing system, along with user friendly analyses pipelines, since the huge amount of data extracted from these machines require extensive bioinformatics knowledge. Algorithms that deal with a variety of issues concerning data analysis have been developed during the past few years. Many of these are nicely summarized in the publications from Szolek et al. [41] and Hosomichi et al. [16], although new algorithms are continuously deployed, with the prospect of simplifying and making data analysis more precise, for their implementation in everyday

NGS, or second-generation sequencing technologies, constitute various strategies relying on a combination of template preparation, sequencing and imaging followed by in silico genome alignment and assembly methods. Of them, the most widely utilized, sequencers for HLA typing, are those of Illumina (MiSeq) and Roche/454 (454 GS FLX Titanium/454 GS Junior) [16].

The first step the technologies of Roche/454 and Illumina NGS systems usually utilize is a fragmentation step, where gDNA is digested into smaller fragments, followed by a PCR, for the amplification of DNA samples. Primers that bind to specific sequences of genomic DNA

This row of events may be reversed, meaning that long range PCR, with the addition of suitable

Each primer, except for the gDNA complementary sequence, also includes a number of additional nucleotides. These mainly comprise of the system-specific adapter sequences and the multiplex identifier tags (MIDs). The adapters assist the amplicons bind to a solid surface (either this is a bead or a slide) and provide a universal priming site for sequencing primers.

(gDNA) are designed, and the intermediate region of interest is enzyme amplified.

exonic sequences, within regulatory intronic regions, is also necessary [29, 40].

allele combinations that produced identical consensus sequences.

58 Umbilical Cord Blood Banking for Clinical Application and Regenerative Medicine

single position, the higher the confidence level of base calling [38].

practice of clinical laboratories and biomedical research [42].

enzymes and primers, may precede the fragmentation step.

*5.1.1. Library preparation*

*5.1.1.1. PCR based*

Target enrichment utilizes biotin labeled DNA or RNA oligonucleotide sequences (probes) (55–120 bp) which hybridize to their complementary target region of previously fragmented gDNA. Streptavidin magnetic bead particles are used for probe/DNA hybrid capture. PCR is then applied to amplify the captured gDNA fragments [16, 40].

#### *5.1.2. Clonal amplification and cyclic-array sequencing*

#### *5.1.2.1. Roche/454 (454 GS FLX Titanium/454 GS Junior) sequencing-by-synthesis; single-nucleotide addition (SNA)*

Once the library is ready, the second step toward sequencing with the Roche/454 instrumentation, includes an "emulsion PCR" (emPCR) for clonal amplification of the amplicons already produced. During emPCR, the DNA sequences are converted into single strands under alkali conditions and captured on beads, in a unique single-stranded molecule per bead fashion. Then, they get mixed with oil and aqueous buffer to create a system of droplets inside of whom clonal PCR amplification takes place.

Once this step takes place, the beads containing the amplicons are placed into the wells of a PicoTiter Plate (PTP) for a pyrosequencing reaction. During pyrosequencing, only one out of the set of four different nucleotides or dNTPs (dATPαS, dTTP, dGTP, dCTP), is added into the PTP, in each round. A series of enzymatic reactions, between enzymes and their substrates (ATP sulfurylase, luciferase, luciferin, DNA polymerase and adenosine 5' phosphosulfate; APS) leads to the release of inorganic pyrophosphate (PPi) only when a specific dNTP is incorporated. The release of PPi transforms ATP which drives luciferin into oxyluciferin that emits visible light.

The light emitted is viewed by a charge-coupled diode (CCD) camera and translated into a single peak per base incorporated, with a computer software. More than one nucleotide may be incorporated per cycle, in the presence of homo-polymeric sequences (consecutive runs of the same base). When this happens, light of equal amount to the number of the nucleotides added, is emitted, resulting to an analogously higher peak.

Each time, the un-incorporated bases are degraded by apyrase. Subsequently, another set of dNTPs is released one by one, in the reaction system and another pyrosequencing round is performed.

Many studies, utilizing Roche NGS platforms, GS FLX Titanium and GS Junior, for HLA typing have been conducted so far. The comparative advantages this technology offers, over the rest of its kind, are the long sequence reads (around 400 up to maximum 1000 amplicons), which capture critical phase information of nearby DNA variants, and also the speed in which a complete run is performed (10–24 h), depending of course on which machine is used.

The 454 GS FLX Titanium is capable of providing up to 1 million (M) reads per run depending on the sequencing protocol, while the bench top format 454 GS Junior provides around 0.1 M reads.

Despite the advantages of long reads and rapidity of the technique, there are several inborn disadvantages. These include, high cost of pyrosequencing reagents, high error rate in case of homo-polymers (typically more than six), and emPCR, the latest being a challenging reaction that if semiautomated could reduce manpower. Insertion mutations are the most common error type, followed by deletions [16, 29, 34, 35, 38–40, 43–45].

#### *5.1.2.2. Illumina (MiSeq and MiniSeq) sequencing-by-synthesis; cycling reversible terminator (CRT)*

Illumina utilizes a different sequencing-by-synthesis approach called CRT. During clonal amplification, instead of emPCR, this system incorporates a glass slide with lanes (flowcell). High density of primers complementary to a sequence of the adapters of the fragmented DNA amplicons, are already attached to the slide where sequencing is later performed. Through a process called clustering each fragment is isothermally amplified to create a cluster of clonally amplified fragments, in a process called bridge amplification.

After amplification sequencing begins with the binding of the first sequencing primer on each fragment of every cluster and its subsequent extension to produce the first read. All four dNTPs are fluorescently tagged and compete with each other for addition to the growing chain. Once a nucleotide that is complementary to the original sequence is incorporated, a washing step removes all unbound nucleotides and a light signal of characteristic wavelength and intensity is emitted. This signal that differs between the dNTPs is captivated by a CCD camera and recorded by the computer [40].

The fluorescent molecule of each nucleotide incorporated needs to be cleaved before continuing with the second cycle, due to its reversible terminator chemistry, that will not allow further nucleotides to be added on the extending sequence. Once the light signal of the incorporated molecule is emitted and received by the camera, the dye is removed and the second cycle is ready to begin after an additional washing step. The length of the read depends on the number of sequencing cycles that are pre-determined by the user [38].

Illumina instruments are shortread sequencers in opposition to those of Roche/454. They provide read lengths of as low as 25 bp until up to 300 bp, with many intermediate options. The MiSeq and the most recent MiniSeq bench top solution both offer an option of 44–50 M Paired End (PE) reads, more than enough for HLA typing of many samples in the same run, and competent to the GS machines concerning runtime (13–24 h) (Goodwin\_2016). PE reads denoted that two distinct sequencing reads are performed, one from each end of the template DNA fragments.

CRT sequencing method overcomes the disadvantages of SNA, by only incorporating a single nucleotide at a time, however as the sequencing reaction proceeds, the error rate of the machine increases. This is due to incomplete removal of the continuous fluorescence signals, which lead to higher background noise levels. Sequencing errors accumulate toward the read end, thus longer reads, that can be trimmed, are preferred compared to shorter ones. Longer reads also prevail due to more precise mapping on the reference genome.

The chemistry of Illumina analyzers is also more prominent to substitution errors, rather than InDel errors, especially when the previously incorporated nucleotide is guanine (G).

Nevertheless, the vast amount of reads may provide increased depth of sequencing coverage that tends to overcome the high error rate inherent in this technology [38, 40].

#### *5.1.2.3. Thermo-Fischer (Ion PGM)*

Many studies, utilizing Roche NGS platforms, GS FLX Titanium and GS Junior, for HLA typing have been conducted so far. The comparative advantages this technology offers, over the rest of its kind, are the long sequence reads (around 400 up to maximum 1000 amplicons), which capture critical phase information of nearby DNA variants, and also the speed in which a

The 454 GS FLX Titanium is capable of providing up to 1 million (M) reads per run depending on the sequencing protocol, while the bench top format 454 GS Junior provides around 0.1 M

Despite the advantages of long reads and rapidity of the technique, there are several inborn disadvantages. These include, high cost of pyrosequencing reagents, high error rate in case of homo-polymers (typically more than six), and emPCR, the latest being a challenging reaction that if semiautomated could reduce manpower. Insertion mutations are the most common

*5.1.2.2. Illumina (MiSeq and MiniSeq) sequencing-by-synthesis; cycling reversible terminator (CRT)* Illumina utilizes a different sequencing-by-synthesis approach called CRT. During clonal amplification, instead of emPCR, this system incorporates a glass slide with lanes (flowcell). High density of primers complementary to a sequence of the adapters of the fragmented DNA amplicons, are already attached to the slide where sequencing is later performed. Through a process called clustering each fragment is isothermally amplified to create a cluster of clonally

After amplification sequencing begins with the binding of the first sequencing primer on each fragment of every cluster and its subsequent extension to produce the first read. All four dNTPs are fluorescently tagged and compete with each other for addition to the growing chain. Once a nucleotide that is complementary to the original sequence is incorporated, a washing step removes all unbound nucleotides and a light signal of characteristic wavelength and intensity is emitted. This signal that differs between the dNTPs is captivated by a CCD camera and

The fluorescent molecule of each nucleotide incorporated needs to be cleaved before continuing with the second cycle, due to its reversible terminator chemistry, that will not allow further nucleotides to be added on the extending sequence. Once the light signal of the incorporated molecule is emitted and received by the camera, the dye is removed and the second cycle is ready to begin after an additional washing step. The length of the read depends on the number

Illumina instruments are shortread sequencers in opposition to those of Roche/454. They provide read lengths of as low as 25 bp until up to 300 bp, with many intermediate options. The MiSeq and the most recent MiniSeq bench top solution both offer an option of 44–50 M Paired End (PE) reads, more than enough for HLA typing of many samples in the same run, and competent to the GS machines concerning runtime (13–24 h) (Goodwin\_2016). PE reads denoted that two distinct sequencing reads are performed, one from each end of the template

complete run is performed (10–24 h), depending of course on which machine is used.

error type, followed by deletions [16, 29, 34, 35, 38–40, 43–45].

60 Umbilical Cord Blood Banking for Clinical Application and Regenerative Medicine

amplified fragments, in a process called bridge amplification.

of sequencing cycles that are pre-determined by the user [38].

recorded by the computer [40].

DNA fragments.

reads.

The Thermo-Fischer Ion instruments acquire a pH-mediated sequencing detection method. The sequencing reaction is the previously described sequencing-by-synthesis SNA approach, but the detection of the incorporated nucleotides is substantially different.

The addition of a new dNTP on the extending DNA strand involves the formation of a covalent bond and the release of pyrophosphate and a positively charged hydrogen ion (proton). The shift in the pH level is detected by an ion-sensitive layer with a sensor on the bottom of the microwells of a semiconductor chip, where sequencing takes place.

There are different sequencing chips with increasing number of wells allowing for different strategies to be applied. The read length ranges from 200 to 400 bp sequenced, and depending on the chip as low as 0.4 M and as high as 5.5 M reads can be exported from a PGM run.

The breakthrough of this technique regards the non-need for optic devices which contribute to increased error calling, lower speed, higher cost, and larger instrument size. Also, the employment of unmodified nucleotides circumvents potential biases arising from their incorporation. Another positive characteristic is the runtime ranging from 2 to7 h that outweighs other competitors.

However, the same drawbacks of SNA sequencing-by-synthesis method that were addressed previously, also apply here. These constitute higher InDel errors and difficulties during homopolymer region (>6 bp) sequencing [40, 43, 46].

#### **5.2. The third generation of sequencers (single-molecule sequencing)**

While the development and optimization of second-generation sequencers is still ongoing, the third generation, that analyzes single-molecule templates, without the need for DNA preamplification, is already on the field.

They promise even lower cost-per-base, easier sample preparation from less amount of starting gDNA material, significantly faster run times, simplified primary data analysis and longer read lengths (hundreds of base pairs and more).

Longer reads simplify sequence assembly and facilitate polymorphism analysis and complete haplotype phasing, both especially important for accurate HLA typing and clarification of phase ambiguities [16].

Also, no need for PCR step overrides any potential biases rising from AT-rich and GC-rich target sequences, avoids incorporation of additional nonexisting variants due to PCR amplification errors and reduces template preparation time [38, 43].

We distinguish two platforms among them. These are Pacific Biosciences (RS II) and Oxford Nanopore (MinION) [47].

#### *5.2.1. Pacific biosciences (RS II)—single-molecule real-time (SMRT) sequencing*

The Pacific Biosciences (PacBio) instruments utilize a flowcell with many individual transparent bottom wells (zero-mode waveguide wells; ZMW), each holding 20 zeptoliters (10−21 L).

SMRT technology uses short single-stranded (ss) hairpin adaptors (SMRTbell adaptors) that ligate on the ends of the DNA fragments. This results in ssDNA regions at the ends, and doublestranded DNA (dsDNA) regions in the middle of the fragments.

Size-selection follows in order to retain sequences of preferred length from as low as less than 3 kb, up to around 20 kb, according to the purpose of the experiment.

A unique phi29 DNA polymerase molecule anchored to the bottom of each well binds a single DNA molecule and starts copying it. The labeled dNTPs incorporate one at a time. Upon binding the fluorophore emits light visualized with a laser and a camera, then the dye is cleaved and the polymerase may incorporate the next labeled dNTP. Each color change at every single one of the wells captured by the camera corresponds to a different dNTP added to the amplifying sequence.

Each template is sequenced multiple times in a circular fashion. These multiple passes are used to generate a consensus read of insert, known as circular consensus sequence (CCS) [17, 38, 46].

Generally, the runtime and throughput of the instrument can be tuned by the user. Longer templates require longer times in order to extract consistent results.

The method of PacBio template preparation lasts 4–6 h, much less time compared to the one needed for completion of the corresponding procedure for second-generation sequencers. In addition, there is no need for a PCR step as previously described, resulting in reduced biases and errors. The turnover rate is also reduced, with runs of RSII instruments finishing within 4 h. The average read length is 10–15 Kb (20,000 bp), longer than any other second-generation sequencer [43, 46].

The main drawback of this technique is the high error rate, due to the short interphase interval between two nucleotide incorporation events. Most errors appear as stochastic events and are not biased anyhow, thus repeated circular sequencing of each nucleotide many times, results in higher coverage and improved accuracy, up to 99% [38].

#### *5.2.2. Oxford Nanopore (MinION)*

Longer reads simplify sequence assembly and facilitate polymorphism analysis and complete haplotype phasing, both especially important for accurate HLA typing and clarification of

Also, no need for PCR step overrides any potential biases rising from AT-rich and GC-rich target sequences, avoids incorporation of additional nonexisting variants due to PCR ampli-

We distinguish two platforms among them. These are Pacific Biosciences (RS II) and Oxford

The Pacific Biosciences (PacBio) instruments utilize a flowcell with many individual transparent bottom wells (zero-mode waveguide wells; ZMW), each holding 20 zeptoliters (10−21 L).

SMRT technology uses short single-stranded (ss) hairpin adaptors (SMRTbell adaptors) that ligate on the ends of the DNA fragments. This results in ssDNA regions at the ends, and double-

Size-selection follows in order to retain sequences of preferred length from as low as less than

A unique phi29 DNA polymerase molecule anchored to the bottom of each well binds a single DNA molecule and starts copying it. The labeled dNTPs incorporate one at a time. Upon binding the fluorophore emits light visualized with a laser and a camera, then the dye is cleaved and the polymerase may incorporate the next labeled dNTP. Each color change at every single one of the wells captured by the camera corresponds to a different dNTP added to the

Each template is sequenced multiple times in a circular fashion. These multiple passes are used to generate a consensus read of insert, known as circular consensus sequence (CCS) [17, 38, 46].

Generally, the runtime and throughput of the instrument can be tuned by the user. Longer

The method of PacBio template preparation lasts 4–6 h, much less time compared to the one needed for completion of the corresponding procedure for second-generation sequencers. In addition, there is no need for a PCR step as previously described, resulting in reduced biases and errors. The turnover rate is also reduced, with runs of RSII instruments finishing within 4 h. The average read length is 10–15 Kb (20,000 bp), longer than any other second-generation

The main drawback of this technique is the high error rate, due to the short interphase interval between two nucleotide incorporation events. Most errors appear as stochastic events and are not biased anyhow, thus repeated circular sequencing of each nucleotide many times, results

fication errors and reduces template preparation time [38, 43].

62 Umbilical Cord Blood Banking for Clinical Application and Regenerative Medicine

stranded DNA (dsDNA) regions in the middle of the fragments.

3 kb, up to around 20 kb, according to the purpose of the experiment.

templates require longer times in order to extract consistent results.

in higher coverage and improved accuracy, up to 99% [38].

*5.2.1. Pacific biosciences (RS II)—single-molecule real-time (SMRT) sequencing*

phase ambiguities [16].

Nanopore (MinION) [47].

amplifying sequence.

sequencer [43, 46].

MinION is a third-generation sequencer from Oxford Nanopore that uses a tiny bio-pore of nanoscale in diameter, with an attached exonuclease.

The fragments of dsDNA are primed with two adapters, a leader and a hairpin, one at each end. The hairpin adapter holds the two strands together in ssDNA conformation, while the leader adapter directs the DNA through the exonuclease which cleaves each base and guides it via the pore.

The concept is that the ssDNA molecule that passes through the α-hemolysin pore (αHL), disrupts the continuous ionic flow applied along αHL. The disruption is detected by standard electrophysiological techniques. The current modulation differs for each of the nucleotides that goes through the pore, a property that assists in discriminating them. Ionic current is resumed after each trapped nucleotide squeezes out of the pore.

This type of sequencing needs no polymerase enzyme, there is no need for DNA polymerization and incorporation of nucleotides, no need for pH alteration detection. As all it needs is just a molecule of ssDNA and two suitable adapters, one at each end, that help guide it through the exonuclease and the pore, the cost of sequencing is substantially lowered.

Also, this way of sequencing is fluorescent-tag free, along with the pros that were described before, concerning Thermo-Scietific's Ion technology, although these two differ in concept. Also, the avoidance of using enzymes like polymerase constitutes Nanopore sequencing more reliable as it is less sensitive to temperature alterations during sequencing.

A drawback of this technique is the large error rate of up to 30%, mainly for InDel detection, due to the ability of the technique to detect more than 1000 different signals originating from variation in the nucleotides coming through the pore, especially when modified bases present on native DNA are taken into account. Also homopolymers are difficult to recognize due to the same feature of this technique [43, 46].
