**6. Genomic analysis**

## **6.1 Approach using candidate genes**

The candidate gene approach consists in selecting a particular gene considered as the most likely site of a mutation. The main criteria for selecting a gene as a candidate are the following: 1) genes are selected because they are defective in similar animal species (usually humans or mice) 2) genes are selected based on their function. The analysis of the candidate gene consists in sequencing the entire gene and comparing two groups (healthy *vs* sick animals). However, the presence of a mutation in a gene is not in itself sufficient to identify the cause of the disorder. Unfortunately, for many genetic diseases the relative candidate gene has not been identified and very similar hereditary diseases can result from mutations on completely different genes. As an example, in the Bedlington terrier dog breed, the hereditary copper toxicosis is phenotypically identical to the Wilson's disease in humans. However, the gene involved in the human disease is not responsible for the disease in dogs. In conclusion, the approach with candidate genes has the advantage of allowing the identification of the specific mutation and therefore the creation of a targeted genetic tests.

#### **6.2 Linkage analysis**

The method of linkage analysis is based on completely different assumptions from the candidate gene approach. The main difference is that no assumptions are made about which gene is responsible for the disease, nor, more generally, the chromosomal tract involved. In this method, the whole genome is potentially subjected to analysis, without directing attention to any particular region. The search for the causal mutation takes place through the use of genetic markers whose chromosomal position is known. The more such markers are physically close to the mutation site, the more likely they will be co-inherited together with the mutation from one parental generation to the next. In a very simplified way, linkage analysis evaluates whether any of the variants of the markers appear in the population is associated with the presence of the disease. The ideal markers, and normally used to perform this type of study, are microsatellites, considered as practically ideal genetic markers because they are abundantly scattered throughout the genome and generally highly polymorphic. The number of microsatellites used to perform a linkage analysis is not fixed but generally the higher it is, the higher the probability that the study has success. This assumption derives from the fact that not directing attention towards specific genes and particular chromosomal portion, genome screening it must be as large as possible, i.e. it must contain the highest possible number of markers in order to understand the whole genome (so-called genomewide screening). Generally, to perform a linkage study within a family tree informative are employed between 200 and 300 microsatellites using pedigrees with at least a hundred animals. For a given area of the genome, the probability of a recombination event occurring between a marker and a disease gene is directly proportional to their distance. The probability of occurrence of this event is expressed as a recombination fraction (θ). If θ is equal to 0.5, the marker and the disease gene are not linked and are therefore independently segregated. In other words, the probability that the marker and gene are inherited, associated or separated is identical. Conversely, if the marker and disease gene are linked together, the θ is less than 0.5. The lod score (Z) is the parameter which is used to estimate the linkage between 2 genetic loci. Z is the logarithm of the ratio between the probability that the 2 loci are linked (θ <0.5) and the probability that the 2 loci are randomly recombined (θ = 0.5). Traditionally the linkage is accepted if the lod score is at least 3. Linkage analysis leads to the identification of a chromosomal region where the locus of the disease is probably located. The analysis must continue with the so-called refinement, that is, a further linkage analysis. Only later, the analysis proceeds through a gene candidate approach. All the genes of the region are identified and a sequence analysis is performed.

#### **6.3 Genomic markers**

#### *6.3.1 Mitochondrial markers*

Animal mtDNA is a cycular molecule ranging from 14,000 to 26,000 bp. The mtDNA codes for 13 proteins. Mitochondria contain most of the genes that code for cell energy production and electron transfer (NADH deydrogenase subunits, cytochrome oxidase subunits, ATPase 6 and 8, cytochrome b, rRNAr, RNA, 12S and 16S) [14, 15]. The choice of the sequence to be used for the genetic analysis depends on the phylogenetic hypothesis to be tested: D loop, sequences that evolve rapidly; cytocrome b, sequences that evolve moderately; Cytocrome oxidase I, sequences that evolve slowly. The mitochondrial control region (CR) sequence is the most popular marker. The mtDNA is uniparental (maternal line), characterized by a high

**9**

*Canine Genetics and Genomics*

*6.3.2 Microsatellite markers*

structure.

*DOI: http://dx.doi.org/10.5772/intechopen.95781*

evolution rate (5–10 times higher than nuclear genes) and the lack of introns and recombinations. The mtDNA is used to clarify the direction of hybridization and the incidence of introgression. In the case of hybridization, erroneous inferences can be obtained only using the evolutionary history of the females. In phylogeographic studies, information from various loci of the nuclear genome are also included [16–18]. The use of both parents allows a better analysis of the population

Nuclear microsatellites (one to six in tandem repeated nucleotides) are used in population genetics for the description of the population structure and kinship identification [19]. The reason for the wide use of microsatellites is due to the fact that are co-dominant, multi-allelic, highly reproducible and with a high resolution. The information per locus is about 10 times more informative than SNP markers. The most common repeats are di, tri and tetra-nucleotides. Microsatellite loci with a di-nucleotide motif are generally used, since they are easier to isolate and high density (on average every 30–50 kb) [20]. Microsatellites are also known as SSR (Simple Sequence Repeats) or STRs (Short Tandem Repeats). The maximum length is about 200 bp. Microsatellites are distributed throughout the genome with greater prevalence in non-coding regions. They are neutral in terms of selection. The typical problems encountered in the genotyping analysis are: homoplasy (condition of equality in the type and number of microsatellite repeats between two alleles) [21]; stutters (in the form of allelic pre-peaks); null alleles (NA) (possible mutations in the pairing site of the primers can prevent the pairing to the target sequence, causing the non-amplification of some alleles. The genetic analysis of microsatellites produce the following data: the distribution of allele frequencies for each microsatellite locus, the percentage of expected (HE) and observed (HO) heterozygosity, the estimates of the Fst values; Nei distances; conformity to the Hardy–Weinberg

Starting in the 2000s, the analysis of SNPs led to the beginning of a new era in molecular genetics. The direct study of the genome using SNPs markers allows to integrate the genealogical information and to obtain high levels of accuracy in the estimation of the main genetic parameters of the population. The development of new sequencing techniques has made it possible to study the consequences of gene flow using a larger number of markers. At the beginning, the Sanger's technology was used to sequence the genomes of different animal species. This sequencing technique produces reads (>700 bp) with a very low error (<0.01%) and high cost (>600 US \$ per Gb). This technique was subsequently improved through the use of the Celera assembler with a significant reduction in time and costs. New generation sequencing technologies (Next Generation Sequencing - NGS), also known as High Throughput Sequencing (HTS) technologies, have evolved rapidly offering an ever greater number of sequenced bases at a lower cost. In 2006, the first secondgeneration NGS technologies (Second-Generation Sequencing - SGS) appeared. Illumina (MiSeq, HiSeq and NovaSeq) is the most popular platform, due to its high performance and low cost. This technology is based on the fragmentation of DNA, amplification in multiple reactions in parallel, obtaining short reads, between 100 and 300 bp. Depending on the library, it is possible to sequence only one end of the fragment, single reads (single end) or both ends. The distance between the read pairs is called insert size (mate pair (2–5 kb); paired end (<1 kb)). Since 2013,

equilibrium (HWE) of the allele frequencies for each locus.

**6.4 Next generation sequencing (NGS)**

#### *Canine Genetics and Genomics DOI: http://dx.doi.org/10.5772/intechopen.95781*

evolution rate (5–10 times higher than nuclear genes) and the lack of introns and recombinations. The mtDNA is used to clarify the direction of hybridization and the incidence of introgression. In the case of hybridization, erroneous inferences can be obtained only using the evolutionary history of the females. In phylogeographic studies, information from various loci of the nuclear genome are also included [16–18]. The use of both parents allows a better analysis of the population structure.
