**3. The genomic era and its new technologies**

The last decade has seen an incredible advance in the biologist's armamentarium for investigating complex diseases. This biological revolution has its roots in the development of molecular biology techniques such as gene cloning using restriction enzymes (Nathans and Smith, 1975), DNA hybridisation (Southern, 1975), Sanger sequencing (Sanger *et al.,* 1977) and the polymerase chain reaction (PCR) (Saiki *et al.*, 1985; Mullis and Faloona, 1987). However, it was human genome project that provided the real impetus.

#### **3.1 The human genome project**

The human genome project began in 1989 and was initially headed by the co-founder of DNA's structure and Nobel laureate, James D Watson. He was succeeded by Francis Collins who headed the public sequencing effort while a private consortium (Celera) also undertook the challenge of sequencing the entire (haploid) human genome, an estimated three billion base pairs. The twin consortia announced the completion of the draft human genome in late 2000 (public (Lander *et al.,* 2001) and Celera (Venter *et al.,* 2001)) although a so-called completed version was only announced in 2003. The templates for these human genomes were pooled DNA samples while the entire genome of a single individual was only

Alzheimer's Disease: Approaches to Pathogenesis in the Genomic Age 399

Consortium, 2010). NGS has many current and potential applications in studying complex

Microarrays (a generic term for a gene chip) are utilised for GWAs but they are more commonly known as a platform in genome-wide expression or transcriptomic studies. Microrrays were initially manufactured by printing the cDNAs of interest onto a glass microscope slide but now short oligonucleotides (probes) complementary to known exonic regions are directly synthesised onto a glass microscope slide or microscopic beads. Probes are typically designed to complement (bind to) the 3' untranslated regions of transcripts although newer generation arrays have multiple probes across whole genes. The detection of signal with the microarray platform relies on probe-hybridisation and non-specific binding can potentially affect the relative detection of lowly expressed transcripts. Modern microarrays have around 50,000 probes representing the approximately 20,000 known protein coding genes, meaning there is redundancy for some genes. The accompanying software will usually report only the maximally expressed or maximally differentiated probe for each gene. The ability to make 20,000 comparisons in a single experiment is an extremely powerful and seductive tool that importantly makes no *a priori* hypotheses to the importance of any one gene. However, this number of comparisons always exceeds the number of individuals being sampled creating statistical dilemmas for the confident detection of real differences. This 'curse of multidimensionality' similarly affects the analysis

The sequencing of the human genome project was carried out using a technique called dye terminator chemistry developed by Fred Sanger 30 years before (Sanger *et al.,* 1977). This technique relies on the amplification of the DNA sequence with incorporation of terminating dideoxynucleotides and the subsequent arrangement of all the fragments (now by software) to derive the final sequence. Although capable of long accurate reads (up to 1500 bases) and eventually automated with capillary technology the technique was, by

The term next generation sequencing (NGS) generically refers to platforms that allow the sequencing of several hundred thousand templates simultaneously (Margulies et al., 2005). For most NGS platforms the multiple templates are sequentially exposed to a known nucleotide whose attachment to its complementary base generates ATP, which is subsequently used to produce a luminescent signal. The signals across the array are captured as an image, before the excess nucleotides are washed away and the process repeated. The actual cycle number or read length is platform-dependent, but will vary from 75 to about 400 bases. This in-parallel sequencing is obviously much quicker than the singlesample Sanger method and now allows relatively small laboratories with a single instrument to perform whole genome sequencing that was, until very recently, the domain of dedicated sequencing centres. There are now also third generation sequencing systems based on single-molecule analysis. These promise to be quicker and cheaper than NGS platforms and more accurate due to longer read lengths although they may still not quite

diseases and will be a recurring theme for the remainder of this chapter.

of GWAs or any –omic platform (Somorjai *et al.,* 2003).

achieve the lengths seen with Sanger sequencing.

**3.4 Next generation sequencing** 

today's standards, very slow.

**3.3 Microarrays** 

published in 2007 (Levy *et al.,* 2007). However it is currently estimated that actually only 93% of the human genome has been sequenced with repeat regions within telomeres and centromeres remaining outstanding. The public contribution to the first human genome is estimated to have cost US\$3 billion.

It was predicted prior to and during the project that the human genome would contain between 40,000 and 100,000 protein coding genes and this number would greatly exceed other mammals, explaining, in particular, our cognitive uniqueness. However, it now appears that we have only about 20,000 to 25,000 protein-coding genes (1-2% of the entire genome), a figure that is, not only similar to other mammals, but dwarfed by some plant species (International Human Genome Sequencing Consortium, 2004).

In disease research we are particularly interested in the genetic variation between affected and unaffected individuals. As the sequencing of whole genomes for all the individuals in a research cohort remains prohibitively expensive, researchers have had use to methods that can approximate the total genetic variation.

### **3.2 Genetic variation**

In monogenic forms of diseases such as AD, single, rare genetic polymorphisms that we commonly call mutations cause the disease. These mutations generally segregate in families or very rarely originate *de novo* in the germ line of our patient of interest (proband). The former have been, and will continue to be for the next few years at least, discovered by gene linkage studies in multi-generational families.

In contrast most AD cases are regarded as sporadic with no familial or geographic clustering. They are also called idiopathic (literally unknown pathogenesis). However despite lacking a Mendelian pattern of inheritance we do know that genetics contributes to the susceptibility of sporadic AD. What we do not know for any particular sufferer is the magnitude of that genetic contribution or the number of genes involved. The 'common variant' hypothesis for complex diseases states that a variable number of commonly occurring gene variants combine to make up the genetic component of disease causation. A common variant or polymorphism is defined as one where the minor allele frequency is greater than 1% in the population.

The most numerous genetic variants are single nucleotide polymorphisms (SNPs) and there are thought to be around 30 million SNPs in total meaning that two unrelated persons, chosen at random, will differ at about 1 in every 1,200 to 1,500 bases. In 2002 the National Institutes of Health started a \$138 million project called the HapMap Project to catalogue the common SNPs in European, East Asian (Han Chinese and Japanese and African (Yoruba)) populations. By 2010 HapMap had released details of SNPs and inferred copy number polymorphisms in 1300 individuals from these four ethnic groups (Altshuler *et al.,* 2010). As will be discussed below, this detailed data on human genetic variation combined with gene chip technology allows the simultaneous testing of more than 2 million SNPs in what are called genome wide-association studies (GWAs).

The remaining 'uncatalogued' common variants will consist of rarer SNPs and smaller copy number variations (<500 base pairs) (Conrad *et al.,* 2010). The '1000 human genome project' aims to use next generation sequencing (NGS) to bridge this gap (The 1000 Genomes Project Consortium, 2010). NGS has many current and potential applications in studying complex diseases and will be a recurring theme for the remainder of this chapter.

#### **3.3 Microarrays**

398 Neuroscience – Dealing with Frontiers

published in 2007 (Levy *et al.,* 2007). However it is currently estimated that actually only 93% of the human genome has been sequenced with repeat regions within telomeres and centromeres remaining outstanding. The public contribution to the first human genome is

It was predicted prior to and during the project that the human genome would contain between 40,000 and 100,000 protein coding genes and this number would greatly exceed other mammals, explaining, in particular, our cognitive uniqueness. However, it now appears that we have only about 20,000 to 25,000 protein-coding genes (1-2% of the entire genome), a figure that is, not only similar to other mammals, but dwarfed by some plant

In disease research we are particularly interested in the genetic variation between affected and unaffected individuals. As the sequencing of whole genomes for all the individuals in a research cohort remains prohibitively expensive, researchers have had use to methods that

In monogenic forms of diseases such as AD, single, rare genetic polymorphisms that we commonly call mutations cause the disease. These mutations generally segregate in families or very rarely originate *de novo* in the germ line of our patient of interest (proband). The former have been, and will continue to be for the next few years at least, discovered by gene

In contrast most AD cases are regarded as sporadic with no familial or geographic clustering. They are also called idiopathic (literally unknown pathogenesis). However despite lacking a Mendelian pattern of inheritance we do know that genetics contributes to the susceptibility of sporadic AD. What we do not know for any particular sufferer is the magnitude of that genetic contribution or the number of genes involved. The 'common variant' hypothesis for complex diseases states that a variable number of commonly occurring gene variants combine to make up the genetic component of disease causation. A common variant or polymorphism is defined as one where the minor allele frequency is

The most numerous genetic variants are single nucleotide polymorphisms (SNPs) and there are thought to be around 30 million SNPs in total meaning that two unrelated persons, chosen at random, will differ at about 1 in every 1,200 to 1,500 bases. In 2002 the National Institutes of Health started a \$138 million project called the HapMap Project to catalogue the common SNPs in European, East Asian (Han Chinese and Japanese and African (Yoruba)) populations. By 2010 HapMap had released details of SNPs and inferred copy number polymorphisms in 1300 individuals from these four ethnic groups (Altshuler *et al.,* 2010). As will be discussed below, this detailed data on human genetic variation combined with gene chip technology allows the simultaneous testing of more than 2 million SNPs in what are

The remaining 'uncatalogued' common variants will consist of rarer SNPs and smaller copy number variations (<500 base pairs) (Conrad *et al.,* 2010). The '1000 human genome project' aims to use next generation sequencing (NGS) to bridge this gap (The 1000 Genomes Project

species (International Human Genome Sequencing Consortium, 2004).

estimated to have cost US\$3 billion.

can approximate the total genetic variation.

linkage studies in multi-generational families.

called genome wide-association studies (GWAs).

greater than 1% in the population.

**3.2 Genetic variation** 

Microarrays (a generic term for a gene chip) are utilised for GWAs but they are more commonly known as a platform in genome-wide expression or transcriptomic studies. Microrrays were initially manufactured by printing the cDNAs of interest onto a glass microscope slide but now short oligonucleotides (probes) complementary to known exonic regions are directly synthesised onto a glass microscope slide or microscopic beads. Probes are typically designed to complement (bind to) the 3' untranslated regions of transcripts although newer generation arrays have multiple probes across whole genes. The detection of signal with the microarray platform relies on probe-hybridisation and non-specific binding can potentially affect the relative detection of lowly expressed transcripts. Modern microarrays have around 50,000 probes representing the approximately 20,000 known protein coding genes, meaning there is redundancy for some genes. The accompanying software will usually report only the maximally expressed or maximally differentiated probe for each gene. The ability to make 20,000 comparisons in a single experiment is an extremely powerful and seductive tool that importantly makes no *a priori* hypotheses to the importance of any one gene. However, this number of comparisons always exceeds the number of individuals being sampled creating statistical dilemmas for the confident detection of real differences. This 'curse of multidimensionality' similarly affects the analysis of GWAs or any –omic platform (Somorjai *et al.,* 2003).

#### **3.4 Next generation sequencing**

The sequencing of the human genome project was carried out using a technique called dye terminator chemistry developed by Fred Sanger 30 years before (Sanger *et al.,* 1977). This technique relies on the amplification of the DNA sequence with incorporation of terminating dideoxynucleotides and the subsequent arrangement of all the fragments (now by software) to derive the final sequence. Although capable of long accurate reads (up to 1500 bases) and eventually automated with capillary technology the technique was, by today's standards, very slow.

The term next generation sequencing (NGS) generically refers to platforms that allow the sequencing of several hundred thousand templates simultaneously (Margulies et al., 2005). For most NGS platforms the multiple templates are sequentially exposed to a known nucleotide whose attachment to its complementary base generates ATP, which is subsequently used to produce a luminescent signal. The signals across the array are captured as an image, before the excess nucleotides are washed away and the process repeated. The actual cycle number or read length is platform-dependent, but will vary from 75 to about 400 bases. This in-parallel sequencing is obviously much quicker than the singlesample Sanger method and now allows relatively small laboratories with a single instrument to perform whole genome sequencing that was, until very recently, the domain of dedicated sequencing centres. There are now also third generation sequencing systems based on single-molecule analysis. These promise to be quicker and cheaper than NGS platforms and more accurate due to longer read lengths although they may still not quite achieve the lengths seen with Sanger sequencing.

Alzheimer's Disease: Approaches to Pathogenesis in the Genomic Age 401

(LOAD), with 60 years of age often given as the arbitrary cut off. With reference to monogenic forms of AD we might think of mutations as an increased 'dose' (severity) of an aetiological agent and monogenic AD can present as young as 20 years of age. In fact, very rarely, AD-causing mutations are actually gene multiplications with direct gene dosage

Sporadic forms of a disease are defined by having no familial or geographic clustering but this term is slightly misleading because there is certainly a genetic component in these common forms of AD. Epidemiological studies suggest that AD sufferers have a 2.5 fold greater likelihood of family history of the disease (Sutherland *et al.*, 2011b). In our 'dose' analogy above the late-onset nature of these common forms of AD would be consistent with common genetic variants, conferring only slight alterations to protein function or expression, being the causative agent. We also generally assume that the genetic component will be multifaceted with both additive and interactive relationships with environmental exposures. The latter refers to a scenario where a potential genetic risk factor only modifies disease risk if that individual has been exposed to a certain environmental stress. We will return to the discussion of AD genetics shortly but it is useful at this stage to introduce the

The A peptide was initially purified from amyloid-containing AD brain tissue (Glenner and Wong, 1984) and the cored plaques of AD and Down syndrome patients (Masters *et al.,* 1985). The parent protein, APP, was then isolated from a human brain cDNA library (Kang *et al.,* 1987). The predicted 695 amino acid long protein with a single transmembrane domain was described as being similar to the prion protein, a hypothesised neuronal surface receptor. There are three major APP isoforms, 695, 751 and 770 amino acids in length with

Following translation the APP protein is retained in the secretory pathway, and is transported to the cell membrane. The protein is subsequently degraded via proteolytic cleavage by either -, and -secretases or alternatively - and -secretases at the cell membrane or following the endocytosis of APP as part of normal membrane turnover (LaFerla *et al.,* 2007). In addition a variable fraction of APP undergoes post-translational proteolytic cleavage within the secretory pathway. The -secretase is a multi-unit enzyme, that includes either the presenilin 1 (PS1) or presenilin 2 (PS2) proteins and it cleaves APP (and other proteins such as Notch) within its transmembrane domain. -secretase activity is carried out by a family of proteins (including ADAM 9) and this cleavage results in the secretion of an extracellular fragment of APP (sAPP-α) and the retention of a C-terminal fragment (CTF) of 83 amino acids (Fig. 2). sAPP appears to act as a neuroprotectant, and has neurotrophic effects on synaptic plasticity (Postina, 2008). Alternatively, and mutually exclusive to -secretase cleavage, the -secretase enzyme (BACE1, also called Asp2, memapsin 2, is the major form in the brain) cleaves APP at variable sites about 16 amino acids proximal to the -secretase site (Vassar *et al.,* 2009). The combination of -secretase and -secretase cleavages releases a variety of peptides that are collectively called A (1,2,3 to 39-

amyloid precursor protein (APP) and its metabolite A.

APP695 being the most common in neural tissue (Yoshikai *et al.,* 1990).

effects.

**4.1 APP metabolism** 

43) (Fig. 2).

One of the major applications of NGS is full-length mRNA sequencing called RNA-Seq. Unlike microarrays RNA-Seq provides a digital readout of all transcripts including those that are lowly expressed and it does not rely on prior knowledge of the genome for probe design. Most current preparatory methods for RNA-Seq continue to use a PolyA fraction (mRNA) but total RNA methods, which require the deletion of the abundant architectural RNA species (ribosomal and transfer RNA), are improving. The latter allows the full repertoire of both coding and non-coding RNA to be quantified.

#### **3.5 Proteomics**

The term proteome describes the full complement of proteins, including post–translational variants, produced in a particular cell or tissue (Wilkins *et al.,* 1996). For many biologists, proteins remain the key functional entities that can only be approximated by transcriptomic analyses. Certainly there is not necessarily a linear relationship between mRNA and proteins levels. In a generic proteomic analysis, the lysate would be separated by twodimensional gel electrophoresis and the 'spots' of interest excised, digested, and the resultant peptide fragments subjected to mass spectroscopy (MS). Individual spectra are compared to databases to derive peptide identity and by computation, the likely parent protein. This process is also referred to peptide mass fingerprinting. It is now more common to use tandem MS where a specific peptide is further fragmented and fragments subjected to MS (peptide fragmentation fingerprinting). Proteomics can be made semi-quantitative by spiking in stable isotopes as reporter ions allowing the relative abundance of the peptides in the overall spectra to be calculated.
