**3.2 Genetic markers**

130 Aquaculture

genome. These characteristics, combined with the lack of a closely related guide sequence, mean that sequencing and assembling the Atlantic salmon genome are extremely

Fig. 1. Schematic representation of the phylogenetic relationships among fish species. Species in bold have publically available full genome sequences, while those that are underlined are currently being sequenced. Note that the full spectrum of teleosts is not represented by the genome sequences that are currently available. Species are listed by their

An expressed sequence tag (EST) is a sequenced section of a transcribed, or expressed gene. ESTs are developed by isolating the mRNA, i.e., the transcribed DNA, from a tissue, which is then reverse-transcribed into cDNA (complementary DNA). Thus, an EST library is the full repertoire⎯essentially an inventory⎯of expressed genes within a genome in a given tissue at a given time. More than 499,000 ESTs have been identified for Atlantic salmon, the sequences of which are available publicly in an online database (Rise et al., 2004; Koop et al.,

Perhaps the most common use of ESTs is to study gene expression (described below). Other uses of ESTs include the identification of molecular markers, or DNA tags, which are

common names with orders in parentheses. Gar is used as an outgroup.

**3. Genomics resources** 

2008) as well as within the NCBI database.

**3.1 ESTs** 

challenging.

Genetic markers are short sequences of DNA that are associated with a specific location, or locus, on a chromosome. To be useful, a genetic marker must be sufficiently variable (polymorphic) among individuals or populations. Markers are valuable tools in molecular biology and they have many practical uses. For example, they are used to examine ancestry and to define discrete populations, as well as for forensic work, paternity testing and tissue typing for transplants given that the genotypes of the markers are inherited and thus an individual's combination of marker genotypes (commonly called a "DNA fingerprint") is unique.

### **3.2.1 Microsatellite markers**

Satellite markers are a group of DNA markers that are characterized by containing tandemly repeated sequences of nucleotides (e.g., …ACACACAC....). Depending on their size, which is a function of the number of repeated bases in a repeat unit and the number of tandemly arrayed units, satellite markers are classed as microsatellites (short repeat unit) or minisatellites (long repeat unit). Microsatellite markers are heavily utilized in genomics. They act as DNA anchors or tags within a genome because: 1) like all molecular markers, their genotypes are heritable, 2) the repeat units make them more vulnerable to mutation than other types of DNA sequence (e.g., coding sequence), which means they are variable at all levels, from individual to species, and generally have several alleles, 3) their locations (loci) within the chromosome can be determined relative to one another, and 4) their small size (vs. minisatellites, which can be very long) facilitates ease of detection using PCR. The development of a suite of microsatellite markers is often one of the first steps in a genomics program for a species of interest.

The first microsatellites characterized for Atlantic salmon came from genomic libraries (McConnell et al. 1995; O'Reilly et al., 1998; Sanchez et al., 1996; Slettan et al., 1995). These were classified as "anonymous" or "Type II" markers as they were not associated with a particular gene. The mining of large EST databases allowed the identification of Type I microsatellites, which could be related directly to a gene (Anti Vasemagi et al., 2005; Ng et al., 2005,). More recently, the sequencing of the ends of Bacterial Artificial Chromosome (BAC) clones has provided a rich source of microsatellites (Danzmann et al., 2008; Phillips et al., 2009). Microsatellites have been used extensively to examine the population structure of Atlantic salmon (e.g., King et al., 2001) as well as for assigning parentage and relatedness of individuals in a mixed family stock (Norris et al., 2000), which precludes the necessity of rearing families in individual tanks or using physical tags. Many of the microsatellite markers that have been developed for one species of salmonid can be used in others (e.g., McGowan et al., 2004; Presa & Guyomard, 1996; Scribner et al., 1996), which enables genomic comparisons among salmonids such as Atlantic salmon, brown trout (*Salmo trutta*), rainbow trout and Arctic charr (Danzmann et al., 2005, 2008; Woram et al., 2004). Most aquaculture species have a set of microsatellites now, and the list is too extensive to reference here.

#### **3.2.2 Single Nucleotide Polymorphisms (SNPs)**

A single nucleotide polymorphism (SNP) (pronounced *snip*) is a sequence change in a single nucleotide – A, T, C or G - that occurs with sufficient frequency in a population to serve as a genetic marker. SNPs can be within genes or within non-coding regions of the genome. Because they are usually bi-allelic (i.e., exhibit only two alleles, such as an A or a T at a given locus), they are easily identified and studied. The development of novel sequencing technologies has made it possible to obtain thousands SNP markers for genotyping (see Molecular Ecology Resources (2011) 11: Supplement 1, for several recent articles on this subject). Moreover, new genotyping technologies enable the simultaneous analysis of a great number of these markers, which allows the construction of high density genetic maps (Goddard & Hayes, 2009). These techniques are now being widely used in species of agricultural importance and have led to the development of numerous dense SNP arrays, which are tools that are used to screen entire genomes for the genotypes of known SNPs (Matukumalli et al., 2009; Muir et al., 2008; Ramos et al., 2009).

There have been several attempts to identify SNPs in individual genes in Atlantic salmon (Ryynanen & Primmer, 2006) and other salmonids (Smith et al., 2005), but these were smallscale efforts. It was not until large EST databases became available that an intensive search was undertaken for SNPs that could be tied to actual genes (Hayes et al., 2007). Resequencing BAC-end sequences (described below) was another approach taken to identify SNPs for Atlantic salmon (Lorenz et al., 2010), but this process was highly labor intensive relative to the information gained. A next generation sequencing strategy (Van Tassell et al., 2008) has also been used to identify SNPs in Atlantic salmon; however, the repetitive nature of the genome made it difficult to distinguish between putative SNPs arising from duplicated regions and those from a unique locus. Recently a custom SNP-array (iSelect by Illumina) containing approximately 6,500 SNP markers was developed for Atlantic salmon. This has been used to search for generic genetic differences between farmed and wild Atlantic salmon (Karlsson et al., 2011) and to assess the level of heterozygosity in Tasmanian Atlantic salmon (Dominik et al., 2010). In addition, the SNP-array was used to construct a relatively dense SNP-based linkage map (described below) (Lien et al., 2011). For other aquaculture species the availability of SNPs is still somewhat limited, with only a large number of SNPs being reported for catfish (*Ictalurus* sp.), (Wang et al., 2010) rainbow trout (Sanchez et al., 2009) and Atlantic cod (*Gadus morhua*) (Hubert et al., 2010; Moen et al., 2008b).

#### **3.3 Linkage maps**

One of the most useful applications of genetic markers is the generation of a genetic map, also known as a linkage map, of an organism's genome. Specifically, markers that are located physically close to one another on a chromosome, or 'linked', tend to be inherited together. By determining the frequency that alleles at different genetic loci are inherited together (i.e., the degree of linkage between makers) one can estimate the recombination distances that separate markers (i.e., the relative positions of markers within a genome). With sufficient markers (i.e., a "dense" linkage map), the number of linkage groups should correspond to the number of different chromosomes comprising a genome. Linkage maps are extremely useful tools in genomics and form the foundation for QTL mapping (described below).

The first linkage maps for Atlantic salmon were constructed using relatively few microsatellite markers (50–60) (Gilbey et al., 2004; Moen et al., 2004). The number of microsatellite markers was expanded to approximately 2,000 thanks to BAC-end sequencing, and a relatively dense genetic map covering the 29 pairs of chromosomes in the European strain of Atlantic salmon was built (Danzmann et al., 2008). Additional maps were constructed using SNPs derived from ESTs (Moen et al., 2008a) and BAC-end sequences (Lorenz et al., 2010), and these were incorporated into a dense SNP map comprising ~5,600 SNPs (Lien et al., 2011). In terms of other salmonids, linkage maps have been constructed for rainbow trout (Guyomard et al., 2006; Nichols et al., 2003; Rexroad et al., 2005; Sakamoto et al., 2000), brown trout (Gharbi et al., 2006), coho salmon (*Oncorhynchus kisutch*) (McClelland & Naish, 2008), and Arctic charr (Woram et al., 2004). Other aquaculture species of interest for which linkage maps have been developed include catfish (Kucuktas et al., 2009; Waldbieser et al., 2001), Japanese flounder (Coimbra et al., 2003) European sea bass (Chistiakov et al., 2005), tilapia (Lee et al., 2005), common carp (Sun & Liang, 2004), olive flounder (Kang et al., 2008), Atlantic cod (Hubert et al., 2010; Moen et al., 2009b) and the list is expanding.

#### **3.4 Bacterial Artificial Chromosome (BAC) libraries**

#### **3.4.1 BACs and BAC libraries**

132 Aquaculture

A single nucleotide polymorphism (SNP) (pronounced *snip*) is a sequence change in a single nucleotide – A, T, C or G - that occurs with sufficient frequency in a population to serve as a genetic marker. SNPs can be within genes or within non-coding regions of the genome. Because they are usually bi-allelic (i.e., exhibit only two alleles, such as an A or a T at a given locus), they are easily identified and studied. The development of novel sequencing technologies has made it possible to obtain thousands SNP markers for genotyping (see Molecular Ecology Resources (2011) 11: Supplement 1, for several recent articles on this subject). Moreover, new genotyping technologies enable the simultaneous analysis of a great number of these markers, which allows the construction of high density genetic maps (Goddard & Hayes, 2009). These techniques are now being widely used in species of agricultural importance and have led to the development of numerous dense SNP arrays, which are tools that are used to screen entire genomes for the genotypes of known SNPs

There have been several attempts to identify SNPs in individual genes in Atlantic salmon (Ryynanen & Primmer, 2006) and other salmonids (Smith et al., 2005), but these were smallscale efforts. It was not until large EST databases became available that an intensive search was undertaken for SNPs that could be tied to actual genes (Hayes et al., 2007). Resequencing BAC-end sequences (described below) was another approach taken to identify SNPs for Atlantic salmon (Lorenz et al., 2010), but this process was highly labor intensive relative to the information gained. A next generation sequencing strategy (Van Tassell et al., 2008) has also been used to identify SNPs in Atlantic salmon; however, the repetitive nature of the genome made it difficult to distinguish between putative SNPs arising from duplicated regions and those from a unique locus. Recently a custom SNP-array (iSelect by Illumina) containing approximately 6,500 SNP markers was developed for Atlantic salmon. This has been used to search for generic genetic differences between farmed and wild Atlantic salmon (Karlsson et al., 2011) and to assess the level of heterozygosity in Tasmanian Atlantic salmon (Dominik et al., 2010). In addition, the SNP-array was used to construct a relatively dense SNP-based linkage map (described below) (Lien et al., 2011). For other aquaculture species the availability of SNPs is still somewhat limited, with only a large number of SNPs being reported for catfish (*Ictalurus* sp.), (Wang et al., 2010) rainbow trout (Sanchez et al., 2009) and Atlantic cod (*Gadus morhua*) (Hubert et al., 2010; Moen et al.,

One of the most useful applications of genetic markers is the generation of a genetic map, also known as a linkage map, of an organism's genome. Specifically, markers that are located physically close to one another on a chromosome, or 'linked', tend to be inherited together. By determining the frequency that alleles at different genetic loci are inherited together (i.e., the degree of linkage between makers) one can estimate the recombination distances that separate markers (i.e., the relative positions of markers within a genome). With sufficient markers (i.e., a "dense" linkage map), the number of linkage groups should correspond to the number of different chromosomes comprising a genome. Linkage maps are extremely useful tools in genomics and form the foundation for QTL mapping

**3.2.2 Single Nucleotide Polymorphisms (SNPs)** 

(Matukumalli et al., 2009; Muir et al., 2008; Ramos et al., 2009).

2008b).

**3.3 Linkage maps** 

(described below).

A bacterial artificial chromosome (BAC) clone is comprised of a special plasmid and a large segment of genomic DNA from the species of interest. Specifically, the entire genome of a selected individual is cut into pieces of approximately 150,000–300,000 base pairs, each of which is inserted into the plasmid, thereby generating thousands of closed, circular "hybrid DNA constructs". Thus, a BAC library refers to all of the BACs, which in combination comprise the entire genome, for a representative individual of a species. Each BAC is then inserted ("transformed") into live *E. coli* cells; thus, the foreign (non-bacterial) DNA can be replicated as if it were a bacterial chromosome using the cellular machinery found within the bacterial cell. Ideally there should be redundancy of the segments of the genome represented in a BAC library such that overlaps of individual clones occur. The ability to identify which BACs share a segment of the genome enables a network called a physical map to be constructed. Therefore, the physical map represents the genome, and has traditionally been used to assist with sequence assembly following whole genome sequencing. BAC libraries also facilitate the identification of target sequences or regions, as the library can be "screened" for genetic markers or other sequence-specific probes. An Atlantic salmon BAC library comprising approximately 300,000 clones and providing an 18 fold coverage of the genome is available (Thorsen et al., 2005), and it was used to generate a physical map (Ng et al., 2005). BAC libraries have been constructed for several aquaculture species including rainbow trout (Palti et al., 2009), catfish (Xu et al., 2007), common carp (Li et al., 2011b) and barramundi (*Lates calcarifer*) (Xia et al., 2010).

#### **3.4.2 BAC-end sequencing**

Once a BAC library has been generated, typically the next step is to select a subset of BACs for end sequencing. Because the sequence of the plasmid vector within the BAC is known, the bacterial sequences at the junctions of the genomic DNA and vector DNA can act as sequence primers. Thereby, the first 500–1,000 nucleotides (based on Sanger sequencing) of the genomic DNA inserts⎯i.e., the BAC-ends⎯can be determined. End-sequencing a large subset of BACs can provide extensive insight into the full genome sequence. For example, the 207,869 BAC-end sequences for the Atlantic salmon BAC library cover approximately 3.5% of the whole genome sequence. This glance into a genome is very powerful, as it can provide information about the complexity of the genome (i.e., repeat content), and the BACend sequences can be a source of molecular markers (as described above for Atlantic salmon (Danzmann et al., 2008; Lorenz et al., 2010)). Finally, the BAC-end sequences can be used for comparative synteny analyses by aligning them against other, fully sequenced genomes, which can provide insight into the gene content of the BACs, thus providing a partial, putative annotation of segments of the genome (Sarropoulou et al., 2007). This is particularly powerful if the two genomes being compared are closely related, as one can act as a reference genome (e.g., common carp and zebrafish, which are both Cypriniformes) (Xu et al., 2011). Even when there is no obvious reference genome there is value in this exercise as it often suggests syntenic regions among phylogenetically distantly related species, and can lead to candidate genes being suggested for specific traits (see below for an example with ISA resistance in the MAS section). The Atlantic salmon BAC end sequences have been made available in a publicly accessible searchable database (www.asalbase.org) along with comparative genomic information for four of the five published fish genomes. Similar work is ongoing with aquaculture species such as catfish (Liu et al., 2009), rainbow trout (Genet et al., 2011) and sea bream (Kuhl et al., 2011).

#### **3.4.3 Integration of physical and linkage maps and chromosomes**

Once both a physical map (i.e., a BAC library is assembled into overlapping, contiguous BACs), and a linkage map (generated using the relationships between variable markers) exist for a species, the next step is to integrate the two maps. This procedure is relatively straight forward if the linkage map contains many markers from BAC-ends, as illustrated by the integration of the Atlantic salmon physical and linkage maps (Danzmann et al., 2008). However, this is seldom the case and considerable effort is normally required to carry out this process (e.g., see preliminary data for rainbow trout) (Palti et al., 2011). The integration of the linkage and physical maps then allows BACs containing particular markers to be selected for fluorescent *in situ* hybridization on chromosomes so that the each linkage group can be assigned to a particular chromosome. This information is now available for several salmonid species including Atlantic salmon (Phillips et al., 2009) and rainbow trout (Phillips et al., 2006).

#### **3.5 Quantitative trait loci (QTL)**

Generally, the characteristics that are desirable for selection are complex, or 'quantitative', which means that they are not simply governed by a single gene, but rather are controlled by a suite of genes and regulatory factors that interact to produce a phenotypic characteristic. The resulting phenotype usually falls somewhere on a spectrum. This is not always the case; traits such as genetic disorders can be determined by a single gene (e.g., cystic fibrosis and Duchenne muscular dystrophy), but most phenotypic traits associated with health, growth, meat quality and the capacity for utilizing new and ecologically sustainable food items are complex traits that are strongly influenced by the environment, and thus show a range of phenotypes.

the genomic DNA inserts⎯i.e., the BAC-ends⎯can be determined. End-sequencing a large subset of BACs can provide extensive insight into the full genome sequence. For example, the 207,869 BAC-end sequences for the Atlantic salmon BAC library cover approximately 3.5% of the whole genome sequence. This glance into a genome is very powerful, as it can provide information about the complexity of the genome (i.e., repeat content), and the BACend sequences can be a source of molecular markers (as described above for Atlantic salmon (Danzmann et al., 2008; Lorenz et al., 2010)). Finally, the BAC-end sequences can be used for comparative synteny analyses by aligning them against other, fully sequenced genomes, which can provide insight into the gene content of the BACs, thus providing a partial, putative annotation of segments of the genome (Sarropoulou et al., 2007). This is particularly powerful if the two genomes being compared are closely related, as one can act as a reference genome (e.g., common carp and zebrafish, which are both Cypriniformes) (Xu et al., 2011). Even when there is no obvious reference genome there is value in this exercise as it often suggests syntenic regions among phylogenetically distantly related species, and can lead to candidate genes being suggested for specific traits (see below for an example with ISA resistance in the MAS section). The Atlantic salmon BAC end sequences have been made available in a publicly accessible searchable database (www.asalbase.org) along with comparative genomic information for four of the five published fish genomes. Similar work is ongoing with aquaculture species such as catfish (Liu et al., 2009), rainbow trout (Genet et

al., 2011) and sea bream (Kuhl et al., 2011).

et al., 2006).

**3.5 Quantitative trait loci (QTL)** 

and thus show a range of phenotypes.

**3.4.3 Integration of physical and linkage maps and chromosomes** 

Once both a physical map (i.e., a BAC library is assembled into overlapping, contiguous BACs), and a linkage map (generated using the relationships between variable markers) exist for a species, the next step is to integrate the two maps. This procedure is relatively straight forward if the linkage map contains many markers from BAC-ends, as illustrated by the integration of the Atlantic salmon physical and linkage maps (Danzmann et al., 2008). However, this is seldom the case and considerable effort is normally required to carry out this process (e.g., see preliminary data for rainbow trout) (Palti et al., 2011). The integration of the linkage and physical maps then allows BACs containing particular markers to be selected for fluorescent *in situ* hybridization on chromosomes so that the each linkage group can be assigned to a particular chromosome. This information is now available for several salmonid species including Atlantic salmon (Phillips et al., 2009) and rainbow trout (Phillips

Generally, the characteristics that are desirable for selection are complex, or 'quantitative', which means that they are not simply governed by a single gene, but rather are controlled by a suite of genes and regulatory factors that interact to produce a phenotypic characteristic. The resulting phenotype usually falls somewhere on a spectrum. This is not always the case; traits such as genetic disorders can be determined by a single gene (e.g., cystic fibrosis and Duchenne muscular dystrophy), but most phenotypic traits associated with health, growth, meat quality and the capacity for utilizing new and ecologically sustainable food items are complex traits that are strongly influenced by the environment, A deep biological understanding of phenotypic patterns, ranging from intracellular molecular signatures to whole individual physiological, morphological and behavioral characteristics of genes of commercial importance is synonymous with an understanding of the functioning of the underlying gene regulatory networks and how these are molded by the environment. This calls for the identification of genes and characterization of their functional repertoires in various systemic settings. Specifically, the first step in understanding the mechanisms that govern complex traits is to determine the heritability of the trait – i.e., the extent to which it is determined genetically and by the environment. Heritability estimates vary tremendously depending on the population, the experimental design and the researchers. Garcia de Leaniz et al. provide the best available review of heritability estimates for Atlantic salmon (Garcia de Leaniz et al., 2007). What is clear is that the genes do play a role in the variation detected in the majority of economically important production traits.

The two most common approaches to identify genes associated with complex traits are to examine candidate genes, based on previous knowledge of a biochemical process, or by mapping quantitative trait loci (QTL). We will not discuss the candidate gene approach here, but rather concentrate on QTL analysis. QTL are regions of the genome that contain, or are linked to, genes that contribute to a particular trait (Lynch & Walsh, 1998). Note that the term "QTL" does not refer to a particular gene, but rather its putative location in the genome. This association, or linkage, is determined using the genetic markers within a linkage map, and how alleles at these markers co-segregate with the trait in question. Thus, finding functional markers to assist in MAS is largely dependent on the variability and polymorphic level of the species. In most agricultural examples, there are highly inbreed strains that make MAS usable. When a species is highly variable, such as the case with fish and most marine species, the QTL markers have to be very close to the causal variant to be usable in MAS breeding studies. The more variable the species, the closer the marker has to be to be used in selective breeding. Given sufficiently high linkage, the markers that define QTL can feed directly into marker assisted selection (MAS). The number, positions and the magnitude of the QTL affecting a trait are determined by statistical associations between marker alleles or genotypes and particular trait phenotypes. QTL analysis, therefore, relies heavily on robust, dense linkage maps as well as large families that are accurately pedigreed and carefully described with respect to phenotype. Indeed, a limiting factor for effective QTL analysis is the difficulty in characterizing "phenomes", the full set of phenotypes of an individual (Houle et al., 2010).

Growth is one of the most economically important traits for aquaculture species. It is no surprise therefore, that several QTL have been identified for body weight and condition factor in Atlantic salmon (Baranski et al., 2010; Boulding et al., 2008; Moghadam et al., 2007; Reid et al., 2005) rainbow trout (Haidle et al., 2008; Martyniuk et al., 2003; O'Malley et al., 2003; Wringe et al., 2010), coho salmon (McClelland & Naish, 2010; O'Malley et al., 2010) and Arctic charr (Moghadam et al., 2007; Kuttner et al., 2011). QTL for several other traits in salmonids have been mapped, including upper temperature tolerance (Jackson et al., 1998; Perry et al., 2001; Somorjai et al., 2003), spawning time (O'Malley et al., 2003), developmental rate (Nichols et al., 2008), and resistance to pathogens (Gheyas et al., 2010; Gilbey et al., 2006; Houston et al. 2008, 2009, 2010; Jones et al., 2002; Moen et al. 2007, 2009a). Given that many of the microsatellite markers derived from one salmonid species amplify the DNA from other salmonid species, it has been relatively straightforward to carry out comparative analyses of the rainbow trout, Arctic charr and Atlantic salmon genomes (Danzmann et al., 2005).

Most of the QTL studies described above were based on a limited number of microsatellite markers; thus, many regions of the genome were sparsely represented. Therefore, it was not possible to obtain precise and complete information about the number and locations of most of these QTL, and to identify specific genes and the underlying genetic variation, or to obtain markers with sufficiently robust associations with the QTL to warrant their integration into MAS programs. Cost has been the main factor limiting the number of microsatellites that could reasonably be used. However, with the advent of SNP arrays, this problem has largely been overcome for agriculture animals such as cattle, pigs and chicken (Groenen et al., 2009; Matukumalli et al., 2009; Ramos et al., 2009) and great progress is being made for some aquaculture species such as Atlantic salmon (Kent M et al., in preparation).

#### **3.6 Expression profiling (microarrays and qPCR)**

The concept of gene expression was introduced above in the discussion on expressed sequence tags, or ESTs and EST libraries, which are assemblages of all small sequences of all the expressed genes in a tissue at a particular time point. The term "expression profiling" refers to the examination of the genes expressed in an individual, or group of individuals, under specific conditions. This is based on the fact that gene expression varies with genotype (i.e., the suite of genes and their alleles that are present within an individual's genome) and environment, which cues the expression of particular genes as well as the extent of expression. Expression profiling studies generally compare groups (treatment groups) of individuals that differ in some way, such as genetic background, or environmental surroundings. Thereby, one can examine which genes are relatively overand under-expressed across treatment groups using statistical analyses, thus gaining insight into the genes that govern particular traits that differ between the subjects.

There are two approaches to studying gene expression that are now standard: real-time PCR (also called quantitative real time PCR or qPCR, or sometimes abbreviated qrtPCR), and microarray analysis. The former is used to examine the expression of a single gene, or a few genes at a time. The technique follows the general principles of regular PCR, in which DNA (in this case, having been reverse-transcribed from RNA, which is present in amounts that are directly proportional to the extent of gene expression) is amplified from a template strand, with the key difference that the amount of product is measured in real time. This is accomplished in one of two ways: 1) the DNA components in the reaction mixture are tagged with dyes that fluoresce when integrated into the DNA molecule, and the level of fluorescence (i.e., corresponding to the amount of product) is measured after each PCR cycle, or 2) sequence-specific probes are similarly tagged, and fluoresce when hybridized to the complementary DNA target, thus providing a direct measurement of the amount of target present after each PCR cycle. The benefits of qPCR are that the process is gene-specific and the results are relatively accurate and reproducible. The cons are that only a small number of genes within a genome can be examined at once, and that sequence-specific PCR primers must be designed, which is dependent on knowing the sequence of the gene *a priori*. This is further complicated when examining gene families or duplicated genes, as it is often

other salmonid species, it has been relatively straightforward to carry out comparative analyses of the rainbow trout, Arctic charr and Atlantic salmon genomes (Danzmann et al.,

Most of the QTL studies described above were based on a limited number of microsatellite markers; thus, many regions of the genome were sparsely represented. Therefore, it was not possible to obtain precise and complete information about the number and locations of most of these QTL, and to identify specific genes and the underlying genetic variation, or to obtain markers with sufficiently robust associations with the QTL to warrant their integration into MAS programs. Cost has been the main factor limiting the number of microsatellites that could reasonably be used. However, with the advent of SNP arrays, this problem has largely been overcome for agriculture animals such as cattle, pigs and chicken (Groenen et al., 2009; Matukumalli et al., 2009; Ramos et al., 2009) and great progress is being made for some aquaculture species such as Atlantic salmon (Kent M et al., in

The concept of gene expression was introduced above in the discussion on expressed sequence tags, or ESTs and EST libraries, which are assemblages of all small sequences of all the expressed genes in a tissue at a particular time point. The term "expression profiling" refers to the examination of the genes expressed in an individual, or group of individuals, under specific conditions. This is based on the fact that gene expression varies with genotype (i.e., the suite of genes and their alleles that are present within an individual's genome) and environment, which cues the expression of particular genes as well as the extent of expression. Expression profiling studies generally compare groups (treatment groups) of individuals that differ in some way, such as genetic background, or environmental surroundings. Thereby, one can examine which genes are relatively overand under-expressed across treatment groups using statistical analyses, thus gaining insight

There are two approaches to studying gene expression that are now standard: real-time PCR (also called quantitative real time PCR or qPCR, or sometimes abbreviated qrtPCR), and microarray analysis. The former is used to examine the expression of a single gene, or a few genes at a time. The technique follows the general principles of regular PCR, in which DNA (in this case, having been reverse-transcribed from RNA, which is present in amounts that are directly proportional to the extent of gene expression) is amplified from a template strand, with the key difference that the amount of product is measured in real time. This is accomplished in one of two ways: 1) the DNA components in the reaction mixture are tagged with dyes that fluoresce when integrated into the DNA molecule, and the level of fluorescence (i.e., corresponding to the amount of product) is measured after each PCR cycle, or 2) sequence-specific probes are similarly tagged, and fluoresce when hybridized to the complementary DNA target, thus providing a direct measurement of the amount of target present after each PCR cycle. The benefits of qPCR are that the process is gene-specific and the results are relatively accurate and reproducible. The cons are that only a small number of genes within a genome can be examined at once, and that sequence-specific PCR primers must be designed, which is dependent on knowing the sequence of the gene *a priori*. This is further complicated when examining gene families or duplicated genes, as it is often

into the genes that govern particular traits that differ between the subjects.

2005).

preparation).

**3.6 Expression profiling (microarrays and qPCR)** 

not feasible to design primers specific for each gene, a problem of particular relevance when there is no genome sequence for the study organism.

Whereas qPCR targets single genes at a time, DNA microarrays are specifically designed to examine the gene expression at thousands of genes at once. A microarray (also referred to as a "chip") is comprised of thousands of single-stranded DNA spots, or "features" attached in known locations to a surface, such as a glass slide, with each spot corresponding to a gene. To analyze gene expression using a microarray, cDNA (again, reverse-transcribed from RNA) that has been fluorescently tagged is washed over the slide, and complementary strands will hybridize to corresponding spots on the microarray. Thus, when the level of fluorescence is measured, the intensity of each spot will correspond to extent of the expression of that particular gene. Statistical tests are then used to determine the relative expression of each gene on the microarray for the individual/tissue/environmental condition being tested. This method is extremely powerful given the number of genes that can be assessed at once. For example, several microarrays have been designed using the Atlantic salmon EST library, including an initial 3.7K (i.e., 3700 spots) array (Rise et al., 2004), followed by 16K (von Schalburg et al., 2005) and a 32K cDNA (Koop et al., 2008) arrays, as well as two recent 44K oligonucleotide arrays. These resources, which are collectively referred to as the GRASP (Genomic Research on All Salmon Project) arrays, have been used extensively for gene expression studies throughout the salmonid genomics world. For example, recent studies have used the GRASP arrays to look for genetic markers of immune responses in Atlantic salmon (LeBlanc et al., 2010), as well as of thermal stress in rainbow trout (Lewis et al., 2010) and Arctic charr (Quinn et al., 2011a, 2011b), and to predict spawning failure in wild sockeye populations (Miller et al., 2011).

Despite their utility, however, microarrays (particularly those based on cDNA clones) have inherent drawbacks, such that they are resource intensive and highly susceptible to technical variations as well as statistical pitfalls (Martin et al., 2007). The latter may result in Type I errors (false positive results), a particular concern when using a large microarray, as the false-positive rate increases as the number of spots increases. Thus, often microarray data are confirmed, or validated, by selecting a few genes to test using qPCR. Another challenge is the potential for cross-hybridization between similar transcripts, which is a problem of particular concern for duplicated genomes, or for large gene families, such as that of the hemoglobin genes, which contains multiple members of high sequence similarity (Quinn et al., 2010). This can mute out evidence of differential expression between individual transcripts, and instead, produce the general trends for all highly similar genes on the microarray. In addition, this phenomenon may cause information to be lost if the elevated expression of a small number of family members is spread among all similar genes on the microarray such that no spots meet the statistical filters assigned. The use of transcriptspecific oligonucleotide arrays may overcome some of these challenges. For these reasons, profiling using microarrays is sometimes described as an exploratory tool, for which the data must be interpreted with a caution, or which must be followed up or correlated with other types of analysis (Lewis et al., 2010; Quinn et al., 2011a).

#### **3.7 Bioinformatics and genomics databases**

The advancement of genomics technologies has resulted in the generation of vast amounts of biological data. This has fueled the need for computational systems and techniques to manage such data as well as for data analysis, the development of which has led to the advent of the field of bioinformatics. Specifically, bioinformatics refers to the application of computer science and information technology to biology to increase the understanding of biological processes. Bioinformatics and genomics now go hand in hand; indeed, often the bottlenecks for obtaining and interpreting genomics results occur in the bionformatics steps, as high throughput genomics technologies produce more data than the available computational tools can handle. Individuals specializing both in biology and computer science are becoming increasingly in demand. For genomics, the general goals of bioinformatics include the development of computational tools for data management (e.g., databases and interactive online tools), data manipulation (map assembly, sequence/genome assembly, protein structure alignment) and data mining (e.g., pattern recognition, genome annotation, sequence comparisons, predictions of gene expression and genome-wide association studies). Given that the fundamental goals of genomics research are shared across all genomes, standard computational tools can often be used for numerous genomes, requiring only minor tweaking and manipulating to accommodate the specific resources for the species of interest.

The concept of "open data", or public accessibility is generally adopted for all genomics projects, and most genomics data are posted online in publically accessible databases. Indeed, it is a requirement for publication in most peer-reviewed academic journals that sequences as well as gene expression data be made available online before any results can be considered for publication within the journal. Besides preventing redundant work and enabling research groups to build off of the work of others, open access data allows the potential for cross-species comparisons, which is particularly true among the Salmonidae. That is, online, public databases such as those managed by NCBI (National Center for Biotechnology Information), Ensembl (Ensembl Genome Browser), and GEO (Gene Expression Omnibus) provide an opportunity to make predictions about the genes in closely related species for which there are few genomic resources available (e.g., Arctic charr). Furthermore, for most species for which there is a large genomics program, a speciesspecific database will be generated. For example, the Atlantic salmon database, publically available at www.asalbase.org, comprises both the linkage and physical maps, as well as all of the BAC-end sequences and marker information. Also included are tools that enable comparative synteny analyses between Atlantic salmon and the other sequenced species. The next step for Asalbase will be integrating the sequence scaffolds for the whole genome sequence. Finally, the last decade saw many research labs pack away their pipettes and centrifuges and replace them with large computer networks and powerful servers, as the wealth of genomics data available online now allows entire experiments, such as comparative synteny analyses, evolutionary studies and even gene expression metaanalyses to be conducted "*in silico*", or with computers alone. This not only greatly improves the speed of research, but also reduces the cost and number of live specimens, or research animals required.

#### **4. MAS in aquaculture**

#### **4.1 Examples of QTL-based markers for aquaculture**

Over the last decade breeders have begun to include molecular genetics in their strategies for genetically improving salmonids for aquaculture production efficiency. Markers that

manage such data as well as for data analysis, the development of which has led to the advent of the field of bioinformatics. Specifically, bioinformatics refers to the application of computer science and information technology to biology to increase the understanding of biological processes. Bioinformatics and genomics now go hand in hand; indeed, often the bottlenecks for obtaining and interpreting genomics results occur in the bionformatics steps, as high throughput genomics technologies produce more data than the available computational tools can handle. Individuals specializing both in biology and computer science are becoming increasingly in demand. For genomics, the general goals of bioinformatics include the development of computational tools for data management (e.g., databases and interactive online tools), data manipulation (map assembly, sequence/genome assembly, protein structure alignment) and data mining (e.g., pattern recognition, genome annotation, sequence comparisons, predictions of gene expression and genome-wide association studies). Given that the fundamental goals of genomics research are shared across all genomes, standard computational tools can often be used for numerous genomes, requiring only minor tweaking and manipulating to accommodate the specific

The concept of "open data", or public accessibility is generally adopted for all genomics projects, and most genomics data are posted online in publically accessible databases. Indeed, it is a requirement for publication in most peer-reviewed academic journals that sequences as well as gene expression data be made available online before any results can be considered for publication within the journal. Besides preventing redundant work and enabling research groups to build off of the work of others, open access data allows the potential for cross-species comparisons, which is particularly true among the Salmonidae. That is, online, public databases such as those managed by NCBI (National Center for Biotechnology Information), Ensembl (Ensembl Genome Browser), and GEO (Gene Expression Omnibus) provide an opportunity to make predictions about the genes in closely related species for which there are few genomic resources available (e.g., Arctic charr). Furthermore, for most species for which there is a large genomics program, a speciesspecific database will be generated. For example, the Atlantic salmon database, publically available at www.asalbase.org, comprises both the linkage and physical maps, as well as all of the BAC-end sequences and marker information. Also included are tools that enable comparative synteny analyses between Atlantic salmon and the other sequenced species. The next step for Asalbase will be integrating the sequence scaffolds for the whole genome sequence. Finally, the last decade saw many research labs pack away their pipettes and centrifuges and replace them with large computer networks and powerful servers, as the wealth of genomics data available online now allows entire experiments, such as comparative synteny analyses, evolutionary studies and even gene expression metaanalyses to be conducted "*in silico*", or with computers alone. This not only greatly improves the speed of research, but also reduces the cost and number of live specimens, or research

Over the last decade breeders have begun to include molecular genetics in their strategies for genetically improving salmonids for aquaculture production efficiency. Markers that

resources for the species of interest.

animals required.

**4. MAS in aquaculture** 

**4.1 Examples of QTL-based markers for aquaculture** 

exhibit strong associations with QTL that play a substantial role in a desirable trait can be integrated into MAS programs. That is, rather than using trial and error to cross individuals that exhibit phenotypic traits of interest, breeders can purposely select individuals that exhibit the marker associated with the desirable QTL. This fine-scale, targeted approach to artificial selection has the potential to be substantially more precise, efficient and effective than traditional, phenotype based approaches.

One example of an economically important trait for which there is a QTL is Infectious Salmon Anemia (ISA) resistance in Atlantic salmon. Previous studies determined that there is a genetic component to ISA resistance in Atlantic salmon (Gjøen & Bentsen 1997, Ødegård et al., 2007) and a QTL for this trait was mapped (Moen et al., 2004). A more recent study confirmed this QTL and showed that it explains 6% of the phenotypic variation and 32-47% of the additive genetic variation (Moen et al., 2007). A comparative genomics analysis of the most likely QTL region suggested several candidate genes for ISA resistance (Li et al., 2011a). These genes remain to be validated based on the association of polymorphisms within them and the trait. However, it is anticipated that a combination of fine mapping and genomic sequence information will identify the causative allele that confers resistance and that this will be incorporated into a MAS breeding program.

Another excellent example of the potential of MAS involves resistance to a pathogen that causes Infectious Pancreatic Necrosis (IPN) in Atlantic salmon. A major QTL for IPN resistance in Atlantic salmon has been identified and confirmed (Gheyas et al., 2010; Houston et al., 2008; Moen et al., 2009a). A four-marker haplotype has been identified that strongly segregates with IPN resistance in Norwegian Atlantic salmon broodstock, providing a solid framework for linkage-based MAS in this population (Moen et al., 2009a).

These results support the prediction that selective breeding for resistance ISA and IPN (caused by viruses) as well as furunculosis (caused by a bacterium) should be possible given the heritabilities for these traits (Kjoglum et al., 2008). However, the speed at which selective breeding for these traits can be performed efficiently and effectively will depend on the ability to identify the underlying genetic basis for the resistance, and this requires a reference sequence for Atlantic salmon. Other traits of interest for QTL identification for MAS in aquaculture include tissue quality, the ability to thrive on commercial diets (i.e., increased vegetable oil content, pathway analysis for vaccine development, and ability/suitability for high-density rearing conditions. Overall, integrating MAS into broodstock development programs can provide an effective and efficient, science-based means of minimizing animal stress while maximizing growth, health and sustainability

#### **4.2 Challenges facing MAS for aquaculture**

Currently, given the lack of fully sequenced genomes for aquaculture species, genomicsbased selection, or MAS is fully dependent on a having a suite of well-developed genomics tools for the genome of the species of interest. In addition to the cost associated with the development of these tools, the complex nature of fish genomes combined with the large number of diverse fish species of interest for aquaculture complicates this goal. However, the availability of fully sequenced reference genomes for a few key species groups would go a long way to fill in the major under-represented gaps in the fish phylogeny. As illustrated in Figure 1, there is no reference genome from a closely related species for either the Salmoniformes or the Gadiformes. However, work on the genomes of Atlantic salmon (Davidson et al., 2010) and Atlantic cod (Johansen et al., 2009) is in progress, and the results are eagerly anticipated by the aquaculture community that breeds these and related species.

### **5. Role of whole genome sequencing (WGS) in genomics**

#### **5.1 Benefits of whole genome sequences**

The availability of a complete genome sequence for Atlantic salmon will have a major impact on all sectors of the international salmonid community. For the aquaculture industry it will provide a complete suite of genetic markers for the identification of the genes and alleles responsible for production traits. Companies will be able to develop tailored broodstock using nucleotide or allele assisted selection rather than the more general marker assisted selection. In conjunction with traditional breeding practices, this approach promises rapid gains that will make companies who embrace this technology more competitive.

#### **5.2 Methods and approaches to whole genome sequencing**

The conventional approach to sequencing whole genomes is to break the genome into small parts, and sequence each part using Sanger sequencing (reviewed in Hutchison, 2007). Although this approach remains the gold standard for sequence and assembly quality, limitations with respect to cost, labor-intensiveness and speed have fueled the demand for new approaches to DNA sequencing. In recent years, several novel high-throughput sequencing platforms, or 'next generation' sequencing approaches have entered the market (for a review, see Metzker, 2010). Most of these are targeted to the goal of re-sequencing an entire human genome for <\$1,000, and their capabilities with respect to sequencing whole genomes *de novo* (i.e., for the first time, and without a reference sequence for comparative assembly), remain unknown. This is complicated by the diverse, highly repetitive and overall complex nature of the fish genomes given that, these technologies are currently limited by their sequence read lengths and difficulties with specific types of repetitive sequences. Generally, the next generation sequencing platforms are capable of producing vast amounts of sequence data in very short periods of time for significantly lower cost than traditional Sanger sequencing, but the bottleneck lies in developing the computational tools to handle the data, and ultimately, to assemble the short sequence reads into contiguous (gap-free) stretches of DNA sequence when no reference genome is available (Quinn et al., 2008).

Aside from their potential contribution to whole genome sequencing, next generation sequencing technologies have countless implications for genomics. One of the most common uses for next generation sequencing in human genomics is re-sequencing, i.e., sequencing the genomes of individuals for which there is a full reference sequence. This could be applied to aquaculture to look for allelic variations between individuals of the same species that show differences in characteristics of interest (e.g., disease resistance or growth). Similarly, next generation sequencing has been used for "transcriptome sequencing" (also referred to as "transcriptomics") i.e., sequencing the entire repertoire of transcribed RNA of a species. This is a form of expression profiling, as the relative number of transcripts of genes can be compared against one another, thus providing insight into the expression levels of the genes at a given time. This approach is being used much more extensively as the cost of next generation sequencing technologies decreases. Finally, next generation sequencing provides an efficient and effective means identifying molecular markers, including microsatellites, SNPs (Chan, 2009) and RAD (Restriction site Associated DNA) markers, which are markers (e.g., SNPs) contained within RAD tags, or the sequences immediately flanking specific DNA sites that are targeted by restriction enzymes. The use of next generation sequencing technologies to sequence RAD tags to identify RAD markers is referred to as RADseq, and its popularity is increasing quickly given its utility for identifying individual-specific markers, and thus to generate an indexed marker library for an individual organism. Therefore, as the capabilities and availability of next generation technologies continue to advance, we are sure to see drastic changes in the approaches to genomics centered around high-throughput, next generation technologies as well as bioinformatics-based analyses. These advances will provide vast opportunities for integrating genomics into aquaculture breeding programs.
