**3. Chloroplast genome**

As the result of the extensive research conducted in the past two decades, cpDNA analysis brought about fundamental changes to the systematics of plants. The chloroplast genome is ideal for phylogenetic analyses of plants for several reasons. First, it occurs abundantly in plant cells and is taxonomically ubiquitous. And since it is well researched, it can be easily tested in the laboratory conditions and analyzed in comparative programs. Moreover, it of‐ ten contains marker structural features cladistically useful, and, above all, it exhibits moder‐ ate or low rate of nucleotide substitution [89]. In regard to the mitochondrial genome, and also to cpDNA, researchers use in their studies two distinct phylogenetic approaches [90], namely taxonomic checking of specific traits features of molecular cpDNA and sequencing of specific genes or regions.

### **3.1. Chloroplast genome of soybean**

In estimating the phylogeny of plants belonging to *Glycine*, particular attention was paid to unusual and specific features of cpDNA. In the course of many studies on the variability of chloroplast genome, a breakthrough came in 1993, with a study on assessing phylogeny of seed plants. The study used a huge database of the nucleotide sequences of the *rbcL* gene [91], encoding the ribulose-1,5-bisphosphate carboxylase, large subunit. The accumulation of a number of comparative data on this chloroplast gene made it a frequent object of research. This is due to the fact that this gene's locus is large (> 1400 bp), and provides many phyloge‐ netically informative traits. The rate of the *rbcL* evolution proved to be appropriate for as‐ sessing issues related to phylogeny of plants, especially on the medium and high taxonomic levels. Over the years other sequences from other species as well as many other genes with another chloroplast *atpB* gene coding H+ -ATPase subunits [92-95]. The *atpB-rbcL* sequence reaches different lengths in *Glycine* as well as in other seed plants. The study by Chiang (1998) shows that the size of the *atpB-rbcL* space in the studied species ranges from 524 bp to 1000 bp [5], where in the non-coding region the occurrence of deletions and insertions, as well as a number of nucleotide substitutions is a common phenomenon, which can also be observed in *Glycine.* In *Glycine max*, its chloroplast genome differs from the core set chloro‐ plast DNA genes because of the presence of a single, large inversion of approximately 51 kb, in the area between the *rbcL* gene and the *rps16* intron [96]. This inversion is also present in other legumes: the mutation was reported in *Lotus* and *Medicago* [96]. In addition, the noncoding *atpB-rbcL* region is rich in AT, due to which most non-coding regions rich in these base pairs show a small number of functions [97, 98]. Therefore, this predisposes them for faster evolution, and hence for use in molecular systematics.

Kanazawa et al. (1998) gathered 1097 *G. soja* plants from all over Japan and analyzed their RFLP of mitochondrial DNA (mtDNA) using five probes (*coxI, coxII, atp6, atp9, atp1=atpA*) [58] (Table 1). 20 different types of mitochondrial genomes labeled as combinations of types I to VII and types from a to k were identified and characterized in this study. Nearly all the

A Comprehensive Survey of International Soybean Research - Genetics, Physiology, Agronomy and Nitrogen

The mitochondrial *atpA* gene was also analysed [48]. It was shown that in soybean this gene has a sequence in 90-97% identical with mitochondrial genes of other plants [71-81]. Se‐ quence similarity is limited to the *atpA* coding region. An intriguing feature of the *atpA* open reading frame of soybean is an 642 nt overlap in the putative translation termination site on‐ to an unidentified open reading frame of the *orf214*. The ends of the open reading frame con‐ tain four tandems of UGA codon that covers four tandems of AUG codon that initiates an unidentified *orf214* frame. The *atpA-orf 214* region was found in soybean mtDNA in multiple sequence contexts. This can be attributed to the presence of two recombination repeats.

The open reading frame shares 79% of nucleotide identity with the *orf214* and is located in the same *atpA* locus position as in common bean *orf209* [82]. Since such organization is a re‐ peat of overlapping the *atpB* and *atpE* reading frames in several chloroplast genes [83, 84], the probability that the *orf214* codes a different ATPase subunit cannot be evaluated because

So for a total of 26 mtDNA haplotypes of wild soybeans have been identified based on RFLP with probes from two mitochondrial genes: *cox2* and *atp6* [69, 86] (Table 1). The three most common haplotypes (Id, IVa and Va) are present in 43 populations. The distribution of mtDNA haplotypes varies among opulations [87]. Recently Shimamoto (2001) analyzed the genetic polymorphisms of mitochondrial genes subgenus *Soja* originating from China and Japan [88] (Table 1). As a result of these studies, 6 types of mitochondrial genomes were distinguished.

As the result of the extensive research conducted in the past two decades, cpDNA analysis brought about fundamental changes to the systematics of plants. The chloroplast genome is ideal for phylogenetic analyses of plants for several reasons. First, it occurs abundantly in plant cells and is taxonomically ubiquitous. And since it is well researched, it can be easily tested in the laboratory conditions and analyzed in comparative programs. Moreover, it of‐ ten contains marker structural features cladistically useful, and, above all, it exhibits moder‐ ate or low rate of nucleotide substitution [89]. In regard to the mitochondrial genome, and also to cpDNA, researchers use in their studies two distinct phylogenetic approaches [90], namely taxonomic checking of specific traits features of molecular cpDNA and sequencing

In estimating the phylogeny of plants belonging to *Glycine*, particular attention was paid to unusual and specific features of cpDNA. In the course of many studies on the variability of

mtDNA types described for soybean cultivars also occurred in wild soybean.

small ATPase subunits are poorly conserved [85].

**3. Chloroplast genome**

Relationships

560

of specific genes or regions.

**3.1. Chloroplast genome of soybean**

The summary phylogeny was based on sequence of several cpDNA genes from hundreds of spermatophytes including *Glycine* (Table 2). These genes can be divided into three classes. The genes encoding the photosynthetic apparatus structure form the first class. The second class includes the rRNA genes and genes encoding the chloroplast genetic apparatus. The last class consists of an average of about 30 tRNA encoding genes [99], although their num‐ ber can vary from 20 to 40 [100, 101].



**Table 2.** Chloroplast genes for the photosynthesis system, for the genetic and others.

The complete size of the *Glycine max* chloroplast genome is 152,218 bp. It contains 25,574 bp of inverted repeats (IRa and IRb), which are separated by a unique small single copy (SSC) region (17,895 bp) [98]. In addition, this genome consists of a large single region (LSC) of unique sequences with 83,175 bp. The IR extends from the *rps19* gene up to the *ycf1*. The *Glycine* chloroplast genome contains 111 unique genes and 19 duplicate copies in the IR, amounting to a total of 130 genes. The cpDNA analysis has showed the presence of 30 dif‐ ferent tRNAs in it and 7 of them are repeated within the IR regions. The genes are composed in 60% of encoding regions (52% are protein coding genes and 8% are RNA genes), and in 40% of non-coding regions, including both intergenic spacers and introns. The total content of GC and AT pairs in the *Glycine* chloroplast genes is 34% and 66% respectively. Distinctly higher percentage of AT pairs (70%) was observed in non-coding regions than in coding re‐ gions (62% AT) [98].

In comparison with other eukaryotic genomes, cpDNA is highly concentrated, for example, only 32% of the rice genome is non-coding. In *Glycine max* it is slightly more – 40%. Most of the non-coding DNA is found in very short fragments that separate functional genes. Some studies have shown complex patterns of mutational changes in the non-coding regions. Some of the best known regions in the chloroplast genome is the farther region of the *rbcL* gene in many legumes. This non-coding sequence is flanked by the *rbcL* and *psaI* (the gene encoding the polypeptide I of photosystem I).

### **3.2. Extent of IR in Glycine**

**Genes Products**

A Comprehensive Survey of International Soybean Research - Genetics, Physiology, Agronomy and Nitrogen

*psaI, J, K, L, M, N* –J, -K, -L, -M, -N-proteins *atpA, B, E* H+ -ATPase, CF1 subunits α, β, ε

*petB, D* Cytochrome b6 */f* complex, subunit b6, IV *nadA- K* NADH Dehydrogenase, subunits ND 1, NDI 1

*rps2, 7, 12, 16* 30S: ribosomal proteins CS2, CS7, CS12, CS16 *rp12, 20, 32* 50S: ribosomal proteins CL2, CL 20, CL32 *rpoA, B, C1, C2* RNA polymerase, subunits α, β, β', β''

*clpP* ATP-dependent protease, proteolytic subunit *irf168 (ycf3)* Intron- containing Reading frame ( 168 codons)

**Table 2.** Chloroplast genes for the photosynthesis system, for the genetic and others.

*atp F, H, I* CF0 subunits I, III, IV

*trnA -UGC* Alanine tRNA (UGC) *trnG- UCC* Gliycine tRNA (UCC) *rnH- GUG* Histidine tRNA (GUG) *trnI- GAU* Isoleucine tRNA (GAU) *trnK- UUU* Lysine tRNA (UUU) *trnL- UAA* Leucine tRNA ( UAA)

*matK* Maturase –like protein *sprA* Small plastid RNA

*16S rRNA* 16S rRNA *23S rRNA* 23S rRNA

Relationships

562

**Genes for the photosynthesis system**

**Genes for the genetic system**

**Others**

The complete size of the *Glycine max* chloroplast genome is 152,218 bp. It contains 25,574 bp of inverted repeats (IRa and IRb), which are separated by a unique small single copy (SSC) region (17,895 bp) [98]. In addition, this genome consists of a large single region (LSC) of unique sequences with 83,175 bp. The IR extends from the *rps19* gene up to the *ycf1*. The *Glycine* chloroplast genome contains 111 unique genes and 19 duplicate copies in the IR, amounting to a total of 130 genes. The cpDNA analysis has showed the presence of 30 dif‐ ferent tRNAs in it and 7 of them are repeated within the IR regions. The genes are composed in 60% of encoding regions (52% are protein coding genes and 8% are RNA genes), and in 40% of non-coding regions, including both intergenic spacers and introns. The total content of GC and AT pairs in the *Glycine* chloroplast genes is 34% and 66% respectively. Distinctly

Analysis of the IR (inverted repeats) regions in *Glycine max* has shown that they are separat‐ ed by a large region and a small region of a unique sequence. In cpDNA repeated sequences are usually located asymmetrically, which results in the formation of long and short regions of a unique sequence [102]. The IR in *Glycine* is a region with 25,574 bp containing 19 genes. At the IR/LSC junction, at the ends of the 5' IR, there is the repeated *rps19* gene (68 bp), and at the junction of the IR/SSC and 5' ends the duplicated *ycf1* gene (478 bp) is located. In the course of study it was shown that comparing cpDNA IR region in *Medicago, Lotus, Glycine* and *Arabidopsis* indicates that there are changes within the IR in the two legumes. *Glycine* and *Lotus* have 478 bp and 514 bp of the *ycf1* duplicated, whereas *Arabidopsis* has 1,027 bp duplicated in the IR. This contraction of the IR in these legumes accounts for the smaller size of their IR and larger size of the SSC. In addition, contraction of the IR boundary in legumes, IRa has been lost in *Medicago.* This loss has resulted in *ndhF* (usually located in the SSC) be‐ ing adjacent to *trnH* (usually the first gene in the LSC at the LSC/IRa junction). Loss of one copy of the IR in some legumes provides support for monophyly of six tribes [103-106]. Wolfe (1988) identified duplicated sequences of portions of two genes, 40 bp of *psbA* and 64 bp of *rbcL*, in the region of the IR deletion between *trnH* and *ndhF* in *Pisum sativum* and these duplications were later identified in broad bean (*Vicia faba*) [104,107]*.* According to many re‐ searchers, the IR region is considered the most conserved part of the chloroplast genome, and thus, it is responsible for stabilizing the plastid DNA molecules [108, 109]. Thus the loss of IR can be phylogenetically informative at the local level, as well as misleading at the glob‐ al phylogeny level, because the IR loss likely occurred independently in more than one group of plants. Coniferous and some legumes (*Pisum sativum, Vicia faba, Medicago sativa*), for example, contain only one IR. Perhaps the lack of repeat sequences in these plants is as‐ sociated with an increased incidence of rearrangement of chloroplast genomes [109].

Introns or intergenic sequences in legume chloroplast DNA have become extremely impor‐ tant tools in phylogenetic analyses aimed at systematizing of this species [110, 111]. More‐ over, their microstructural changes occur with great frequency in the regions of cpDNA. The body of existing research suggests that mutations in the non-coding regions and relatively fast evolution of the organelle genome encoding regions can serve as valuable markers for the separation species in their evolutionary origin [110, 111]. The systematics of plants gen‐

erally considers chloroplast indeles to be phylogenetic markers, because of their low preva‐ lence in comparison with nucleotide substitutions [5].

### **3.3. CpDNA markers**

There are many methods of generating molecular markers that rely on site-specific amplifi‐ cation of a selected DNA fragment using polymerase chain reaction (PCR) and its further processing (restriction analysis, sequencing). Initially the research on the plant genome (mostly phylogenetic studies) used non-coding and coding sequences of chloroplast DNA. With time, the genes or DNA segments located in the nuclear DNA, mitochondrial (mtDNA) and chloroplast (cpDNA) found a prominent place among plant DNA markers. Fully automated DNA sequencing made it possible to subject ever-newer regions of plant DNA to comparative sequencing.

One of the most frequently sequenced cpDNA fragments in plant phylogeny of spermato‐ phytes is the *rbcL* gene encoding a large ribulose bisphosphate carboxylase subunit (RUBIS‐ CO), whose length in most plants is 1,428, 1,431 and 1,434 bp, and insertions and deletions within it are extremely rare [94]. For many years this gene has been the subject of many comprehensive phylogenetic analyses of subgenus *Glycine* [112-114]. The *rbcL* is most com‐ monly used in the analyses at the family and genus levels, but there also exists research at the lower levels, cultivars and wild soybean [98, 115, 116]. A marker with very similar char‐ acteristics to those of the *rbcL* (the rate of evolution, the length of 1497 bp) is a gene encoding the ATP synthase β subunit – the *atpB* [94].The *matK* gene sequence, encoding maturase in‐ volved in splicing of the type II introns, and whose length is 1,550 bp is characterized by a rapid rate of evolution that allows to use it in research at the species and genus levels [117, 118]. Frequent mutations in this gene make it unsuitable for studies at higher taxonomic lev‐ els. Other popular cpDNA sequences used in phylogenetic studies of legumes include the *ndhF* (the gene encoding the NADH protein, which is a dehyd98rogenase subunit), 16S rDNA, the non-coding *atpB-rbcL* region [94], or the *trnL* (UAA) intron and mediator between the *trnL* (UAA) exon and the *trnF* (GAA) gene [96, 117- 119].

It should be noted that the rate of evolution for a specific DNA region to be used as a marker can vary significantly not only among systematic groups, but also within these groups [98]. Moreover, each DNA fragment within the same group has a different rate of evolution, such as the *ndhF* cpDNA sequence in the *Solanaceae* family, which provides about 1.5 times more information in terms of parsimony than the *rbcL* [90]. Therefore each gene or any other DNA fragment used as a genetic marker has a typical range of "taxonomic" or phylogenetic appli‐ cations, which can vary significantly within a taxon. For this reason, the *rbcL* sequence has been widely used in *Gycine* for many years at the species and genus levels [104, 117, 118].

### **3.4. The genetic diversity of soybeans**

The importance of genetic variations in facilitating plant breeding and/or conservation strat‐ egies has long been recognized [121]. Molecular markers are useful tools for assaying genet‐ ic variation and provide an efficient means to link phenotypic and genotypic variation [122]. In recent years, the progress made in the development of DNA based marker systems has advanced our understanding of genetic resources. These molecular markers are classified as: (i) hybridization based markers i.e. restriction fragment length polymorphisms (RFLPs), (ii) PCR-based markers i.e. random amplification of polymorphic DNAs (RAPDs), amplified fragmentlength polymorphisms (AFLPs), inter simple sequence repeats (ISSRs) and micro‐ satellites or simple sequence repeats (SSRs), and (iii) sequence based markers i.e. single nu‐ cleotide polymorphisms (SNPs) [121, 123]. Majority of these molecular markers have been developed either from genomic DNA library (e.g. RFLPs or SSRs) or from random PCR am‐ plification of genomic DNA (e.g. RAPDs) or both (e.g. AFLPs) [123]. Availability of an array of molecular marker techniques and their modifications led to comparative studies among them in many crops including soybean, wheat and barley [124-126]. Among all these, SSR markers have gained considerable importance in plant genetics and breeding owing to many desirable attributes including hypervariability, multiallelic nature, codominant inheri‐ tance, reproducibility, relative abundance, extensive genome coverage (including organellar genomes), chromosome specific location, amenability to automation and high throughput genotyping [127]. In contrast, RAPD assays are not sufficiently reproducible whereas RFLPs are not readily adaptable to high throughput sampling. AFLP is complicated as individual bands are often composed of multiple fragments mainly in large genome templates [123]. The general features of DNA markers are presented in Table 3.


**Table 3.** Important features of different types of molecular markers.

erally considers chloroplast indeles to be phylogenetic markers, because of their low preva‐

A Comprehensive Survey of International Soybean Research - Genetics, Physiology, Agronomy and Nitrogen

There are many methods of generating molecular markers that rely on site-specific amplifi‐ cation of a selected DNA fragment using polymerase chain reaction (PCR) and its further processing (restriction analysis, sequencing). Initially the research on the plant genome (mostly phylogenetic studies) used non-coding and coding sequences of chloroplast DNA. With time, the genes or DNA segments located in the nuclear DNA, mitochondrial (mtDNA) and chloroplast (cpDNA) found a prominent place among plant DNA markers. Fully automated DNA sequencing made it possible to subject ever-newer regions of plant

One of the most frequently sequenced cpDNA fragments in plant phylogeny of spermato‐ phytes is the *rbcL* gene encoding a large ribulose bisphosphate carboxylase subunit (RUBIS‐ CO), whose length in most plants is 1,428, 1,431 and 1,434 bp, and insertions and deletions within it are extremely rare [94]. For many years this gene has been the subject of many comprehensive phylogenetic analyses of subgenus *Glycine* [112-114]. The *rbcL* is most com‐ monly used in the analyses at the family and genus levels, but there also exists research at the lower levels, cultivars and wild soybean [98, 115, 116]. A marker with very similar char‐ acteristics to those of the *rbcL* (the rate of evolution, the length of 1497 bp) is a gene encoding the ATP synthase β subunit – the *atpB* [94].The *matK* gene sequence, encoding maturase in‐ volved in splicing of the type II introns, and whose length is 1,550 bp is characterized by a rapid rate of evolution that allows to use it in research at the species and genus levels [117, 118]. Frequent mutations in this gene make it unsuitable for studies at higher taxonomic lev‐ els. Other popular cpDNA sequences used in phylogenetic studies of legumes include the *ndhF* (the gene encoding the NADH protein, which is a dehyd98rogenase subunit), 16S rDNA, the non-coding *atpB-rbcL* region [94], or the *trnL* (UAA) intron and mediator between

It should be noted that the rate of evolution for a specific DNA region to be used as a marker can vary significantly not only among systematic groups, but also within these groups [98]. Moreover, each DNA fragment within the same group has a different rate of evolution, such as the *ndhF* cpDNA sequence in the *Solanaceae* family, which provides about 1.5 times more information in terms of parsimony than the *rbcL* [90]. Therefore each gene or any other DNA fragment used as a genetic marker has a typical range of "taxonomic" or phylogenetic appli‐ cations, which can vary significantly within a taxon. For this reason, the *rbcL* sequence has been widely used in *Gycine* for many years at the species and genus levels [104, 117, 118].

The importance of genetic variations in facilitating plant breeding and/or conservation strat‐ egies has long been recognized [121]. Molecular markers are useful tools for assaying genet‐ ic variation and provide an efficient means to link phenotypic and genotypic variation [122].

lence in comparison with nucleotide substitutions [5].

the *trnL* (UAA) exon and the *trnF* (GAA) gene [96, 117- 119].

**3.4. The genetic diversity of soybeans**

**3.3. CpDNA markers**

Relationships

564

DNA to comparative sequencing.

The genetic diversity of wild and cultivated soybeans has been studied by various techni‐ ques including isozymes [128], RFLP [87], SSR markers [124], and cytoplasmic DNA mark‐ ers [87, 128, 129]. Based on haplotype analysis of chloroplast DNA, cultivated soybean appears to have multiple origins from different wild soybean populations [129, 130].

Using PCR-RFLP method soybean chloroplast DNAs were classified into three main haplo‐ type groups (I, II and III) [113, 130, 131]. Type I is mainly found in the species of cultivated soybean (*Glycine max*)*,* while types II and III are often found in both the cultivated and wild

forms of soybean (*Glycine soja*)*.* Type III is by far the most dominant in the wild soybean spe‐ cies [113]. In *Glycine*, these types are widely used in evaluating cpDNA variability and in determining phylogenetic relationships between different types of cpDNA using different marker systems. According to Chen and Hebert (1999) [133] analysis of cpDNA sequence is not sufficient for when the analysis of population genetics, and so cpDNA polymorphism assessment methods must be constantly complemented with methods such as single-strand conformation polymorphism (SSCP) [134], or dideoxy fingerprinting (ddF) [135], and direct‐ ed termination and polymerase chain reaction (DT-PCR). However, some researchers point out that there are many disadvantages of these methods, mainly because of their high cost and large amount of work necessary for obtaining the results. In their view, a single change in the regions of *Glycine* chloroplast DNA at the species and genus levels should be located on a local-specific markers, for example, non-coding regions, using PCR and sequencing.

Analyses of non-coding regions of cpDNA have been employed to elucidate phylogenetic relationship of different taxa [90]. Compared with coding regions, non-coding regions may provide more informative characters in phylogenetic studies at the species level because of their high variability due to the lack of functional constraints. Non-coding regions of cpDNA have been assayed either by direct sequencing [136-141], or by restriction-site analy‐ sis of PCR products (PCR-RFLP) [142-146]. In Small's opinion (1998) non-coding regions, which include introns and intergenic sequences, often show greater variability at nucleoti‐ des than at the encoding regions, which makes the non-coding regions good phylogenetic markers [139]. Mutations in the form of insertions and deletions are accumulated in noncod‐ ing regions at the same rate as nucleotide substitutions, and such kinds of mutations signifi‐ cantly accelerate changes in these regions. In many cases, insertions or deletions are related to short repeat sequences. Therefore, many researchers continually focus on the analysis of non-coding regions. Using RFLP method, Close et al. (1989) found six cpDNA haplotypes and described them in types, ranging from group I to VI, including cultivated and wild soy‐ beans [147]. In the course of their research they found that groups I and II diverge from groups III to VI, thus dividing subgenus *Soja* into two main groups. They presented a hy‐ pothesis that group II can be distinguished form group III by two independent mutations. Similar groups of haplotypes in legumes were also obtained by Shimamoto et al, (1992) [128] and Kanazawa et al, (1998) [148], using a combination of *EcoR*I and *Cla*I RFLPs. In their clas‐ sification, Kazanawa et al. (1998) relied on sequential analysis and found that differences in the three types described by Shimamoto et al. (1992) resulted from two single-base substitu‐ tions: one in the non-coding region, between the *rps11 and rpl36,* and the other in the 3' part of the coding region of the *rps3*. Based on the existing reports, Xu et al. (2000) sequenced nine non-coding regions of cpDNA for seven cultivars and 12 wild forms of soybean (*Gly‐ cine max, Glycine soja, Glycine tabacina, Glycine tomentella, Glycine microphylla, Glycine clandesti‐ na*) in order to verify earlier classification of *Glycine* [113]*.* In the course of their studies, they located eleven single-base changes (substitutions and deletions) in the collected 3849 data‐ base. They located five mutations in the distinguished haplotypes I and II, and seven muta‐ tions in type III. In addition, haplotypes I and II were identical and clearly different from the taxons in type III. This research has not yielded significant results, because different types of cpDNA could not originate monophyletically, but it contributed to finding a common ances‐ tor in the course of evolution of *Glycine.* A neighbor joining tree resulting from the sequence data revealed that the subgenus *Soja* connected with *Glycine microphylla,* which formed a dis‐ tinct clad from *Clycine clandestine* and the tetraploid cytotypes of *Glycine tabacina* and *Glycine tomentella.* Several informative length mutations of 54 to 202 bases, due to insertions or dele‐ tions, were also detected among the species of the genus *Glycine.*

forms of soybean (*Glycine soja*)*.* Type III is by far the most dominant in the wild soybean spe‐ cies [113]. In *Glycine*, these types are widely used in evaluating cpDNA variability and in determining phylogenetic relationships between different types of cpDNA using different marker systems. According to Chen and Hebert (1999) [133] analysis of cpDNA sequence is not sufficient for when the analysis of population genetics, and so cpDNA polymorphism assessment methods must be constantly complemented with methods such as single-strand conformation polymorphism (SSCP) [134], or dideoxy fingerprinting (ddF) [135], and direct‐ ed termination and polymerase chain reaction (DT-PCR). However, some researchers point out that there are many disadvantages of these methods, mainly because of their high cost and large amount of work necessary for obtaining the results. In their view, a single change in the regions of *Glycine* chloroplast DNA at the species and genus levels should be located on a local-specific markers, for example, non-coding regions, using PCR and sequencing.

A Comprehensive Survey of International Soybean Research - Genetics, Physiology, Agronomy and Nitrogen

Relationships

566

Analyses of non-coding regions of cpDNA have been employed to elucidate phylogenetic relationship of different taxa [90]. Compared with coding regions, non-coding regions may provide more informative characters in phylogenetic studies at the species level because of their high variability due to the lack of functional constraints. Non-coding regions of cpDNA have been assayed either by direct sequencing [136-141], or by restriction-site analy‐ sis of PCR products (PCR-RFLP) [142-146]. In Small's opinion (1998) non-coding regions, which include introns and intergenic sequences, often show greater variability at nucleoti‐ des than at the encoding regions, which makes the non-coding regions good phylogenetic markers [139]. Mutations in the form of insertions and deletions are accumulated in noncod‐ ing regions at the same rate as nucleotide substitutions, and such kinds of mutations signifi‐ cantly accelerate changes in these regions. In many cases, insertions or deletions are related to short repeat sequences. Therefore, many researchers continually focus on the analysis of non-coding regions. Using RFLP method, Close et al. (1989) found six cpDNA haplotypes and described them in types, ranging from group I to VI, including cultivated and wild soy‐ beans [147]. In the course of their research they found that groups I and II diverge from groups III to VI, thus dividing subgenus *Soja* into two main groups. They presented a hy‐ pothesis that group II can be distinguished form group III by two independent mutations. Similar groups of haplotypes in legumes were also obtained by Shimamoto et al, (1992) [128] and Kanazawa et al, (1998) [148], using a combination of *EcoR*I and *Cla*I RFLPs. In their clas‐ sification, Kazanawa et al. (1998) relied on sequential analysis and found that differences in the three types described by Shimamoto et al. (1992) resulted from two single-base substitu‐ tions: one in the non-coding region, between the *rps11 and rpl36,* and the other in the 3' part of the coding region of the *rps3*. Based on the existing reports, Xu et al. (2000) sequenced nine non-coding regions of cpDNA for seven cultivars and 12 wild forms of soybean (*Gly‐ cine max, Glycine soja, Glycine tabacina, Glycine tomentella, Glycine microphylla, Glycine clandesti‐ na*) in order to verify earlier classification of *Glycine* [113]*.* In the course of their studies, they located eleven single-base changes (substitutions and deletions) in the collected 3849 data‐ base. They located five mutations in the distinguished haplotypes I and II, and seven muta‐ tions in type III. In addition, haplotypes I and II were identical and clearly different from the taxons in type III. This research has not yielded significant results, because different types of cpDNA could not originate monophyletically, but it contributed to finding a common ances‐

#### **3.5. Non-coding regions of the chloroplast genome as site-specific markers in Glycine**

In the chloroplast genomes of legumes, including soybean, there are many non-coding re‐ gions, which are characterized by a faster rate of evolution when compared to the coding re‐ gions. As mentioned earlier some of the chloroplast genes have introns, yet their structure differs from those occuring in the nuclear genes, since in the case of cpDNA introns have a ten‐ dency to adopt secondary structure, which affects the model in which cpDNA introns evolve and it is enforeced by the secondary structure. This restriction in changes caused by mutations affects the functional requirements related to the formation of introns [98, 108]. As there are no adequate studies on the evolution of introns, it can be assumed that their evolution is similar to that of the protein-encoding genes. The loss of introns in the course of the evolution of chloro‐ plast DNA is an interesting process. It has been discovered that *O. sativa* has 3 introns less in cpDNA than *M. polymorpha* and *N. tabacum.* The loss of an intron in the *rpl2* gene was re‐ searched in 340 species representing 109 families of angiosperms including *Glycine* [149]. When trying to determine the taxonomic position, the absence of this intron in a given gene shows that it was lost at least six times in the evolution of angiosperms. In *Glycine* 23 introns have been identified while in *Arabidopsis thaliana* there are 26 introns, mostly located in the same genes and in the same locations within those genes [98, 102].

Non-coding regions in chloroplast DNA have become a major source for phylogenetic stud‐ ies within the species *Glycine* and in many other seed plants. Earlier, the most popular phy‐ logeny sequences included encoding regions, such as the *rbcL* gene sequences that were designed to determine the phylogenetic relationships between species in major taxonomic groups [113, 136-141]. According to Taberlet et al. (1991) [119] the potential ability of noncoding regions of cpDNA was reserved for species located in the lower taxonomic levels while the non-coding regions, which include introns and intergenic sequences, often show greater variability at nucleotides than is evident in the coding regions, which predisposes them to be used in population studies involving *Glycine,* and others [139,142].

As the result, many studies on phylogenetic utility of non-coding regions have been published [110]. For example [150]: *trnH-psbA, trnS-tang;* [148]: *rps11-rpl36, rpl16-rps3,* [113]: *trnT-trnL.*

In cpDNA analysis of many plants, very conservative regions flanking areas with high vari‐ ability are used. The more conservative regions, the higher the chance for the primers de‐ signed in the PCR reaction, which will be able to join the broader taxonomic group [96, 113]. The region occurring between the *trnT* (UGU) and the *trnF* (GAA) genes is a large single copy wich is suitable because of the conservativeness of the *trn* genes and several hundred base pairs of noncoding regions. The intergenic space between the *trnT* (UGU) and the *trnL* (UAA) 5' exon ranged from 298 bp to about 700 bp in the species studied by [119]. In the plant genomes completely sequenced by Sugiute, the length of this region is different and

amounts to 770 bp in rice and 710 bp in tobacco. In *Marchantia polymorpha* it is 188 bp [151]. This region is located between the tRNA genes, just as the non-coding sequence located be‐ tween the *trnL* (UAA) 3' exon and the *trnF* (GAA). Due to its catalytic properties and its sec‐ ondary structure, the *trnL* (UAA) intron, which belongs to type I introns, is less variable and therefore of better utility for evolutionary studies at higher taxonomic levels [113]. More‐ over, depending on the species, they show high frequency of insertions or deletions, which makes them potentially useful as genetic markers.


**Table 4.** Primers used for amplification of nine non-coding regions of soybean cpDNA.

In most studied species, the *trnL* (UAA) intron ranges in size from 254 - 767 bp. Its smaller fragment – the P6 loop – reaches a length of 10 - 143 bp. It is commonly applied in DNA barcoding. Its main limitation lies in its low homologousness with the species from the Gene Bank, which amounts to 67.3%, while the homologousness of the P6 loop is 19.5%. However, it also has some advantages: conservative primers projected form and trouble-free amplifi‐ cation process. Amplification of the P6 loop can be performed even in a very degraded DNA. The intron is well known and its sequences are used to determine phylogenetic rela‐ tionships between closely related species or to identify a plant species [152]. The first univer‐ sal primers for this region were designed more than 20 years ago [119]. However, it does not

belong to the most variable non-coding regions in chloroplast DNA [108]. The *trnL* (UAA) intron is the only one belonging to group I introns in chloroplast DNA, which means that its secondary structure is highly conservative, with a possibility of changes in its conservative [113] and variability in regions [99, 153]. Consequently, comparing the diversity of the *trnL* intron sequences allows to obtain new primers that contain conservative regions and ampli‐ fy short sections contained between them [152].

Thus, in angiosperms, using non-coding regions in research at lower levels of the genome is a routine practice [108]. A large number of non-coding regions of cpDNA has been lo‐ cated in angiosperms, some of which are highly variable, whereas others show relatively small variability [108]. In studying the chloroplast genome, many researchers looked for universal primers that would allow amplification of many non-coding regions of cpDNA (Table 4) [111, 113, 148, 150].
