**3. Definitions and concepts**

**2. Haplotype terminology**

346 Next Generation Sequencing - Advances, Applications and Challenges

haplotypes, as listed in Table 1.

**Table 1.** Terminology

matic package.

haplotypes.

Several other aspects are clear.

A review of current literature reveals a staggering collection of terms synonymous with

Ancestral haplotypes

Linkage groups

Hapmaps Haplogroup Haplobanks Haploblocks Haplotype block

remain until there is recognition of the conceptual background.

to represent the nonrecombinant descendants from a single ancestor" [5].

**•** Linkage groups relate to closely linked loci but do not define haplotypes.

**•** Trios can be misleading since the coverage of the family is limited.

Conserved extended haplotypes

Linkage disequilibrium haplotypes

Even if it were possible to define the various neologisms, it seems certain that confusion will

We introduced the term *ancestral haplotypes* to emphasise the persistence of the founding pool [3, 4]. Such haplotypes are conserved over thousands of generations; they allow identification of remote ancestors and their contributions to the creation of individual members of the species with their diseases. Unfortunately, others use the same term in different ways and even in the opposite sense, that is, to refer to *the single* original haplotype which is presumed to have mutated to give rise to all the so-called variants now present. Indeed, as just one example of the problem, the reader has to be able to interpret the following: "we identified all nonredun‐ dant haplotypes with a frequency of ≥10% and consisting of at least 10 SNPs, which are likely

To yet further confound matters, increasingly, the term *haplotype* is being used to describe any combination of alleles or markers, such as SNPs, without regard to their reproducibility, inheritance, polymorphism or biological significance. Currently, there are conflicting methods of detection. The problems appear to be increasing as ephemeral concepts diverge and as claims for better approaches focus on just one or another competing technology or bioinfor‐

**•** Linkage disequilibrium is affected by relative frequencies and therefore fails to detect rare

**•** Haplobanks. The Tokunaga group has established some important principles with the intention of establishing haplotype-matched pluripotential stem cell banks [6]. Unfortu‐ In the presequencing era, there was a clear understanding of what was meant by the term *haplotype*: Combinations of alleles at different loci segregating together in multigenerational family studies [8]. Some seem unaware of this long history and have had to rediscover the concept [2].

The implications were apparent at least 50 years ago: a specific allele A1 at locus A is inherited together with a specific allele B1 at an adjacent, "closely linked" locus B [9]. The fact that these two alleles segregated together through multiple generations was unexpected and lead to controversy but, in retrospect, clearly implied that


The repeated cosegregation of alleles came to be known as a haplotype: from άπλφούς = single [9].

It is worth emphasizing that it was the cosegregation as haplotypes through "phased" multigenerational families (rather than "unphased" populations) which foretold the later demonstration that there was a continuous haplospecific sequence. It is also pertinent, with the benefit of hindsight and in view ofrecent confusion, that the haplotypes, defined in one family, occurred in other families of similar remote ancestry raising the radical possibility of conserva‐ tion beyond that expected from close linkage alone. In other words, recombination is patchy and does not necessarily disperse the components of duplications, even after thousands of meioses.TheissueoflinkagedisequilibriumandthelimitsofLDmappingareconsideredbelow.

The implications of haplotypes, as listed above, became even clearer as the HLA A and HLA B locus alleles and then HLA DR alleles were defined during the 1970s. However, in this case, the loci were widely separated. Over time, it became clear that each of the A-B and B-DR haplotypes were some 800 kb in length. Patently, close linkage could not explain these haplotypes; either there was selection for *cis* interaction or there was suppression of recombi‐ nation [3, 4].

Through their studies of diseases, the Alper–Yunis group discovered that the B-DR haplotypes contained specific alleles at duplicated loci which had no structural or functional relevance to HLA (i.e. complement and 21 hydroxylase loci) but which happen to be located within the major histocompatibility complex [10–16]. Thus, *cis* interaction alone could be rejected as the sole explanation.

The importance of discovery through disease was illustrated at a meeting held in 1982 [3, 4]. As shown in Table 2, it was disease associations which allowed the initial discovery of ancestral haplotypes; note, these three disease-associated haplotypes could have only been discovered through their associations. Two share DR3 and two share B18 but the frequencies differ. Thus, the three haplotypes cannot be detected by linkage disequilibrium.


MG = myasthenia gravis, SLE = systemic lupus erythematosus, IDDM = insulin-dependent (type 1) diabetes mellitus.

Adapted from ref. [4]

**Table 2.** MHC haplotypes and disease associations

Once the numerous other ancestral haplotypes were defined, multigenerational family studies identified cosegregating combinations of multiple alleles at separated loci, i.e. haplotypes stretching over nearly 2 Mb from HLA A to DR. A haplotype was defined by the alleles "inherited *en bloc* from one parent and implies the transmission of all of the chromosomal segment" from one generation to the next [4].

When haplotypes defined in one family were compared with those identified in apparently unrelated families, sharing was immediately apparent. There were specific combinations of alleles at all the numerous unrelated loci as these were defined and typed. However, and increasingly relevant today, as summarized in refs. [3, 4, 17, 18]:


**Figure:1 Historic recombinations of AH 8.1**


Adapted from ref. [18].

major histocompatibility complex [10–16]. Thus, *cis* interaction alone could be rejected as the

The importance of discovery through disease was illustrated at a meeting held in 1982 [3, 4]. As shown in Table 2, it was disease associations which allowed the initial discovery of ancestral haplotypes; note, these three disease-associated haplotypes could have only been discovered through their associations. Two share DR3 and two share B18 but the frequencies differ. Thus,

**Designation A Cw B Bf C2 C4A C4B DR Disease**

8.1 1 7 8 S C Q0 1 3 MG, SLE, IDDM 18.2 – – 18 F1 C 3 Q0 3 IDDM 18.1 25 – 18 S Q0 4 2 2 C2 deficiency MG = myasthenia gravis, SLE = systemic lupus erythematosus, IDDM = insulin-dependent (type 1) diabetes mellitus.

Once the numerous other ancestral haplotypes were defined, multigenerational family studies identified cosegregating combinations of multiple alleles at separated loci, i.e. haplotypes stretching over nearly 2 Mb from HLA A to DR. A haplotype was defined by the alleles "inherited *en bloc* from one parent and implies the transmission of all of the chromosomal

When haplotypes defined in one family were compared with those identified in apparently unrelated families, sharing was immediately apparent. There were specific combinations of alleles at all the numerous unrelated loci as these were defined and typed. However, and

**1.** The combinations observed are *not* a simple function of allele frequencies; only some of

**2.** Many haplotypes are rare combinations of frequent alleles at some loci but rare alleles at

**6.** Many of these nonrandom combinations are associated with a disease (such as systemic

**7.** With a few dramatic exceptions (such as 21 hydroxylase and C2 deficiency carried by what we now call the 47.1 and 18.1 ancestral haplotypes), the individual alleles do not explain

the three haplotypes cannot be detected by linkage disequilibrium.

348 Next Generation Sequencing - Advances, Applications and Challenges

sole explanation.

Adapted from ref. [4]

other loci.

**Table 2.** MHC haplotypes and disease associations

segment" from one generation to the next [4].

**3.** Very few alleles are entirely haplospecific.

**4.** Haplotype frequencies are often less than 1%.

increasingly relevant today, as summarized in refs. [3, 4, 17, 18]:

the components inherited *en bloc* are in linkage disequilibrium.

**5.** The same haplotypes are found in multiple, apparently unrelated, families.

lupus erythematosus) or function (such as TNF production).

the haplospecific effects on disease and function.

**Figure 1.** Historic recombinations of AH 8.1. The HLA-B8 allele is carried by one ancestral haplotype marked by A1, Cw7, B8, BfS, C4AQ0, C4B1, DR3. All the haplotypes in data set 1 carrying HLA-B8 are represented. These haplotypes have been sorted so that haplotypes that carry all alleles of 8.1 from HLA-A to DR are shown at the top of the figure, followed by haplotypes that extend from HLA-B to DR. Telomeric recombinants are shown at the bottom. The boxed areas represent those portions of the 8.1 ancestral haplotype that are carried by unrelated B8-containing haplotypes. Vertical lines approximately indicate the region where historical recombination has occurred.

Some of these points are illustrated in Figure 1. It can be seen that subjects with B8 can be listed to show conservation but also historic recombinations between HLA A and B, between C4B and DR, and between HLA B and Bf.

By the mid-1990s, and long before the rediscoveries of the 2000s [2], such analyses led to the conclusion that there are polymorphic frozen blocks (PFB), as illustrated in Figure 2.

**Figure 2.** Ancestral haplotypes and polymorphic frozen blocks within the human major histocompatibility complex. Each ancestral haplotype has its own unique DNA sequence which includes single nucleotide polymorphisms (SNPs), copy number variations, segmental duplications, insertion and deletion events (indels) including retroviral and retro‐ viral-like elements (RLEs). The full length is approximately 4 Mb. Higher degrees of diversity indicated by shading define polymorphic frozen blocks (PFB). Recombination occurs far more frequently between, rather than within, these blocks. Mutations within blocks are effectively suppressed. Adapted from refs. [17, 20] and [21]. Reproduced with per‐ mission from ref. [22].

PFB throughout the genome are the latter-day equivalents of loci. Sequences which define ancestral haplotypes are the equivalent of alleles. The diversity is multifactorial with contri‐ butions from reiterative speciation as follows [17]:


These elements all contribute to the haplospecificity of the sequence of ancestral haplotypes as shown in Figure 3. Similar distribution of diversity has been found by many others [5, 17, 19, 20, 23, 24]. The same patterns are also found in primates [25].

#### Adapted from ref. [26].

Some of these points are illustrated in Figure 1. It can be seen that subjects with B8 can be listed to show conservation but also historic recombinations between HLA A and B, between C4B

By the mid-1990s, and long before the rediscoveries of the 2000s [2], such analyses led to the

**Figure 2.** Ancestral haplotypes and polymorphic frozen blocks within the human major histocompatibility complex. Each ancestral haplotype has its own unique DNA sequence which includes single nucleotide polymorphisms (SNPs), copy number variations, segmental duplications, insertion and deletion events (indels) including retroviral and retro‐ viral-like elements (RLEs). The full length is approximately 4 Mb. Higher degrees of diversity indicated by shading define polymorphic frozen blocks (PFB). Recombination occurs far more frequently between, rather than within, these blocks. Mutations within blocks are effectively suppressed. Adapted from refs. [17, 20] and [21]. Reproduced with per‐

PFB throughout the genome are the latter-day equivalents of loci. Sequences which define ancestral haplotypes are the equivalent of alleles. The diversity is multifactorial with contri‐

These elements all contribute to the haplospecificity of the sequence of ancestral haplotypes as shown in Figure 3. Similar distribution of diversity has been found by many others [5, 17,

conclusion that there are polymorphic frozen blocks (PFB), as illustrated in Figure 2.

and DR, and between HLA B and Bf.

350 Next Generation Sequencing - Advances, Applications and Challenges

mission from ref. [22].

**•** Duplication

**•** Polymorphism

**•** Indels

**•** Retroviral integration

butions from reiterative speciation as follows [17]:

19, 20, 23, 24]. The same patterns are also found in primates [25].

**Figure 3.** Sequence diversity is packaged as polymorphic frozen blocks (PFB). SNPs and indel occur in similar loca‐ tions within PFB. (a) The SNP profile after removing indels. Peaks higher than 20 SNPs per 1000 nucleotides are trun‐ cated. (b) The location of indels. Peaks higher than six indels per 1000 nucleotides are truncated. (c) The position of indels greater than 100 nucleotides.
