**6. The use of molecular data in measuring genetic distances**

The advent and explorations in molecular genetics led to a better definition of Euclidean distance by Beaumont *et al;* (1998) to mean a quantitative measure of genetic difference calculated between individuals, populations or species at DNA sequence level or allele frequency level.

Various genetic distance measurements are proposed for analyzing DNA-based data for the purpose of genetic diversity studies. Powel *et al.* (1996), identified different DNA-based marker techniques to include Random Amplified Polymorphic DNA (RAPD), Amplified Fragment Length Polymorphism (AFLP), Restriction Fragment Length Polymorphic (RFLPs) and the most recent Simple Sequence Repeats (SSR) and Microsatellite (MT) of single nucleotide polymorphism (SNPs). The above nucleotide differences can be used effectively to run individual or combined data sets of morphological, biochemical or DNA based data. For DNA based data, where the amplification products are equated to alleles, the allele frequencies can be calculated and the genetic distance between i and j individuals estimated as follows.

$$\mathbf{d(ij)} = \mathbf{1} < \left[ \sum^n (X\_{ai} - X\_{aj}) \right]^{\bigvee^{n-1}}$$

Where Xai is frequency of the allele a for individual I, and n is the number of alleles per loci; r is the constant based on the coefficient used. In its simple form, i.e *r* = 1, genetic distance can be calculated as:

$$\text{d}\begin{bmatrix} \text{(ij)} \ = \text{ 1/2} \left[ \sum\_{}^{n} (X\_{ai} - X\_{aj}) \right] \end{bmatrix}$$

Where r = 2, d(*i,j)* is referred to as Rogers (1972) measure of distance (RD), where

$$\text{RD}\_{\text{ij}} = \text{I}/2[\text{ }\Sigma(\text{ x}\_{\text{ai}}\text{-}\text{x}\_{\text{aj}})\text{ }2]^{1/2}$$

Where allele frequencies are to be calculated for some of the molecular markers, the data must first generate a binary matrix for statistical analysis. Binary data has been long and widely used before the advent of molecular marker data to measure genetic distance by Rogers (1972); Nei and Chesser (1983) coefficient and known as GDMR and GDNL respectively.

In the use of any given statistical formula to determine genetic diversity in molecular based data, one specific problem usually encountered is the failure of some genotypes to show amplification for some primer pairs. Robinson and Harris (1999) noted that lack of amplification may be due to "null alleles". Most often, it is difficult to ascribe lack of amplification to "null allele". It is therefore the reposed confidence of the researcher, that a "null allele" status of a genotype will not be considered as missing data during computation of genetic similarity- distance matrix so as to avoid gross error during result interpretation.

DNA based marker data have been successfully used to measure genetic distance in some crops (Pritchard *et al.* (2000) in pigeon pea; Beaumont *et al.* (1998) in wheat; Franco *et al.*, (2001) in maize; Dje *et al.* (2000) in Sorghum.

#### **7. Grouping techniques in measuring genetic diversity**

Genetic relationship among and with breeding materials can be identified and classified using multivariate grouping methods. The use of established multivariate statistical algorithms is important in classifying breeding materials from germplasm, accessions, lines, and other races into distinct and variable groups depending on genotype performance. Such groups can be resistant to diseases, earliness in maturity, reduced canopy drought resistant etc. The widely used techniques irrespective of the data source (morphological, biochemical and molecular marker data) are cluster analysis, Principal Component Analysis (PCA), Principal Coordinate Analysis (PCOA) Canonical Correlation and Multidimensional Scaling (MDS).

Cluster analysis presents patterns of relationships between genotypes and hierarchical mutually exclusive grouping such that similar descriptions are mathematically gathered into same cluster (Hair *et al*. 1995); (Aremu 2005). Cluster analysis have five methods namely unweighted paired group method using centroids (UPGMA and UPGMC), Single Linkages (SLCA), Complete Linkage (CLCA) and Median Linkage (MLCA). UPGMA and UPAMC provide more accurate grouping information on breeding materials used in accordance with pedigrees and calculated results found most consistent with known heterotic groups than the other clusters (Aremu *et al.,* (2007a).

Principal components, canonical and multidimentional analyses are used to derive a 2-or 3 dimensonal scatter plot of individuals such that the geometrical distances among individual genotypes reflect the genetic distances among them. Wiley (1981), defined principal component as a reduced data form which clarify the relationship between breeding materials into interpretable fewer dimensions to form new variables. These new variables are visualized as different non correlating groups.

Principal components analysis first determines Eigen values which explain the amount of total variation displayed on the component axes. It is expected that the first 3 axes will explain a large sum of the variations captured by the genotypes. Cluster and principal component analysis can be jointly used to explain the variations in breeding materials in genetic diversity studies.
