**3.1.3 Myosin heavy chain gene duplicates**

Mammals encode two clusters of muscle myosin heavy chain genes, one cluster containing the α- and β-cardiac muscle myosin heavy chain genes (Saez et al., 1987; Weydert et al., 1985), and one cluster containing six skeletal muscle myosin heavy chain genes in the order embryonic, 2a, 2x, 2b, perinatal, and extraocular (Sun et al., 2003; Weydert et al., 1985). These myosin genes consist of 38 exons each. Based on their gene size and number of exons the genes of the muscle myosin gene cluster should be on the upper limit of the complexity of a search for tandem gene duplicates. With the new WebScipio extension all genes of the muscle myosin cluster in *Homo sapiens* could be identified (Fig. 6). For the search the *region size* parameter was set to 300,000 nucleotides and the *minimal score for exons* to 50 %. This example also shows the advantage of the new WebScipio extension compared to the *multiple results* option in Scipio. When searching with the *multiple results* option of Scipio and the 2a gene as starting sequence, mixed genes are found for every additional gene candidate (Fig. 6). Scipio does not know about gene borders and analyses all BLAT hits according to their score. Therefore, Scipio combines the highest scoring hits to gene candidate one (2a), the next highest scoring hits to gene candidate two (2x), and so on. The third gene

Fig. 4. *Drosophila melanogaster* heat shock protein gene duplicates: The figure shows the duplications found by the algorithm with Hsp23 as query. The genomic region contains, from the left to the right side in the drawing, the identified genes Hsp67Bc, Hsp22, CG4461, Hsp26, the query gene Hsp23, and another gene duplicate Hsp27. Gene duplications on the

**3.1.2 Duplicated exons in six tandemly arrayed genes including a lost intron and a** 

The new algorithm is able to reconstruct tandemly arrayed gene duplications containing many exons and gene duplicates. The *Drosophila melanogaster* CG30047 gene includes 12 exons. Five duplicates of this gene could be identified with the algorithm (Fig. 5, top). In the second duplicated gene an intron loss could be identified. The exons 11 and 12 of CG30047 are translated as one exon in this duplicate (Fig. 5, bottom). To find such lost introns the option to *search for concatenated exons* has been enabled. The third duplicate most probably represents a pseudogene, because exon 11 contains a frame shift and could thus not be found. Other reasons for the frame shift could be sequencing and assembly errors. However, the *Drosophila melanogaster* genome (Adams et al., 2000) is one of the best available and a lot of effort has been spent in the finishing process. Thus, it is more probable that the third duplicate is a pseudogene. Exon 1, which codes for seven amino acids, has low complexity and could therefore only be identified in the second gene duplication by setting the *minimal* 

Mammals encode two clusters of muscle myosin heavy chain genes, one cluster containing the α- and β-cardiac muscle myosin heavy chain genes (Saez et al., 1987; Weydert et al., 1985), and one cluster containing six skeletal muscle myosin heavy chain genes in the order embryonic, 2a, 2x, 2b, perinatal, and extraocular (Sun et al., 2003; Weydert et al., 1985). These myosin genes consist of 38 exons each. Based on their gene size and number of exons the genes of the muscle myosin gene cluster should be on the upper limit of the complexity of a search for tandem gene duplicates. With the new WebScipio extension all genes of the muscle myosin cluster in *Homo sapiens* could be identified (Fig. 6). For the search the *region size* parameter was set to 300,000 nucleotides and the *minimal score for exons* to 50 %. This example also shows the advantage of the new WebScipio extension compared to the *multiple results* option in Scipio. When searching with the *multiple results* option of Scipio and the 2a gene as starting sequence, mixed genes are found for every additional gene candidate (Fig. 6). Scipio does not know about gene borders and analyses all BLAT hits according to their score. Therefore, Scipio combines the highest scoring hits to gene candidate one (2a), the next highest scoring hits to gene candidate two (2x), and so on. The third gene

reverse strand are marked by an arrow in reverse direction.

**pseudogene** 

*exon length* parameter to 7 aa.

**3.1.3 Myosin heavy chain gene duplicates** 

Fig. 5. *Drosophila melanogaster* CG30047: Five gene duplications were found for the CG30047 gene. In the second duplication the intron between exon 11 and 12 was lost as shown in the alignment. The alignment of exon 11 (CG30047) and the alignment of the corresponding region in the duplicated gene were shortend by amino acids 682 to 801 for representation purposes.

Predicting Tandemly Arrayed Gene Duplicates with Webscipio 71

Fig. 7. *Oreochromis niloticus* muscle myosin heavy chain gene cluster: The nile tilapia contains a cluster of three muscle myosin heavy chain genes (Mhc13, Mhc6, and Mhc7) of which Mhc7 is encoded in the opposite direction. The last exon is too divergent to be

one potential exon. The details are shown in the alignment (Fig. 8).

**3.2 Examples of non-tandemly arrayed gene duplicates** 

and Mhc7 are reconstructed completely.

**3.1.4 Revealing a pseudogene** 

**3.2.1 Duplicated gene regions** 

His2A have alignment scores of 79 %.

**3.2.2 Trans-spliced genes** 

identified in most cases. Only when searching with the Mhc13 gene, the tandem genes Mhc6

For the *Drosophila melanogaster* CG3397 gene the first exon is splitted into two exons in the prediction of the gene duplication. For this search the default parameters were used and the option to *search for splitted exons* was enabled. The predicted gene is most probably a pseudogene, because either the predicted intron between the two splitted exons is too short to be spliced, or the exon translation results in a frame shift if both parts are considered as

Tandemly arrayed genes evolve through unequal recombination. In this process not only single genes might be duplicated but small genomic regions containing several genes. The result would be a tandemly arrayed group of genes. Because WebScipio is searching for each gene separately it cannot separate a group of duplicated genes from a tandem array of single genes. An example for duplicated genomic regions is the region in *Drosophila melanogaster* containing genes coding for histones (Fig. 9). The new algorithm identified many duplicates for each of the His1, His2A, His2B, His3, and His4 genes in the *Drosophila* genome. As query the genes CG33825 (His1), CG33826 (His2A), CG33894 (His2B), CG33827 (His3), and CG33893 (His4) were used. The His2B and His4 genes are on the reverse strand in comparison to the other genes. The genes are very similar (some code for the same protein sequence) resulting in alignment scores between 99 % and 100 %. Only two more divergent gene duplicates were found for the His2A gene. The first two gene duplicates of

Tandem gene duplicates and *trans*-spliced genes could evolve through the same gene duplication process during evolution, except that only part of the gene is duplicated instead of the complete gene. The exon-intron structure of tandem gene duplicates and *trans*-spliced genes look very similar, which complicates their differentiation during the process of gene identification. If, for example, the constitutive part of the *trans*-spliced gene consists of only



Fig. 6. *Homo sapiens* muscle myosin heavy chain gene cluster: The skeletal muscle myosin heavy chain cluster consists of the genes embryonic, 2a, 2x, 2b, perinatal, and extraocular, from left (5' end) to the right (3' end). The WebScipio search for tandem gene duplicates based on the 2a gene identifies all other genes of the cluster. The Scipio search with the parameter *multiple results* also identifies six gene candidates but only the search sequence (the 2a gene) is found correctly while the other gene candidates consist of fusions of different parts of the other muscle myosin heavy chain genes.

candidate, for example, mainly consists of the exons of the perinatal muscle myosin heavy chain gene, but the N-terminus of the 2b gene has a higher homology to the 2a gene than the N-terminus of the perinatal gene and therefore the 2b N-terminus is combined with the Cterminus of perinatal.

The nile tilapia *Oreochromis niloticus* contains another type of a muscle myosin heavy chain gene cluster (Fig. 7). Here, two genes (Mhc6 and Mhc13) are encoded on the forward strand, and Mhc7 is encoded on the reverse strand. Nevertheless, WebScipio correctly reconstructed the complete cluster when searching with the Mhc13 gene. When searching with Mhc6 or Mhc7, the small C-terminal exons of the respective other genes could not be identified. These examples demonstrate that WebScipio with the new extension is able to correctly identify arrays of very large and complex genes. For this search the minimal score for exons parameter was set to 30 % and the region size parameter to 50,000 nucleotides.

Fig. 6. *Homo sapiens* muscle myosin heavy chain gene cluster: The skeletal muscle myosin heavy chain cluster consists of the genes embryonic, 2a, 2x, 2b, perinatal, and extraocular, from left (5' end) to the right (3' end). The WebScipio search for tandem gene duplicates based on the 2a gene identifies all other genes of the cluster. The Scipio search with the parameter *multiple results* also identifies six gene candidates but only the search sequence (the 2a gene) is found correctly while the other gene candidates consist of fusions of

candidate, for example, mainly consists of the exons of the perinatal muscle myosin heavy chain gene, but the N-terminus of the 2b gene has a higher homology to the 2a gene than the N-terminus of the perinatal gene and therefore the 2b N-terminus is combined with the C-

The nile tilapia *Oreochromis niloticus* contains another type of a muscle myosin heavy chain gene cluster (Fig. 7). Here, two genes (Mhc6 and Mhc13) are encoded on the forward strand, and Mhc7 is encoded on the reverse strand. Nevertheless, WebScipio correctly reconstructed the complete cluster when searching with the Mhc13 gene. When searching with Mhc6 or Mhc7, the small C-terminal exons of the respective other genes could not be identified. These examples demonstrate that WebScipio with the new extension is able to correctly identify arrays of very large and complex genes. For this search the minimal score for exons

parameter was set to 30 % and the region size parameter to 50,000 nucleotides.

different parts of the other muscle myosin heavy chain genes.

terminus of perinatal.

Fig. 7. *Oreochromis niloticus* muscle myosin heavy chain gene cluster: The nile tilapia contains a cluster of three muscle myosin heavy chain genes (Mhc13, Mhc6, and Mhc7) of which Mhc7 is encoded in the opposite direction. The last exon is too divergent to be identified in most cases. Only when searching with the Mhc13 gene, the tandem genes Mhc6 and Mhc7 are reconstructed completely.

### **3.1.4 Revealing a pseudogene**

For the *Drosophila melanogaster* CG3397 gene the first exon is splitted into two exons in the prediction of the gene duplication. For this search the default parameters were used and the option to *search for splitted exons* was enabled. The predicted gene is most probably a pseudogene, because either the predicted intron between the two splitted exons is too short to be spliced, or the exon translation results in a frame shift if both parts are considered as one potential exon. The details are shown in the alignment (Fig. 8).

## **3.2 Examples of non-tandemly arrayed gene duplicates**

### **3.2.1 Duplicated gene regions**

Tandemly arrayed genes evolve through unequal recombination. In this process not only single genes might be duplicated but small genomic regions containing several genes. The result would be a tandemly arrayed group of genes. Because WebScipio is searching for each gene separately it cannot separate a group of duplicated genes from a tandem array of single genes. An example for duplicated genomic regions is the region in *Drosophila melanogaster* containing genes coding for histones (Fig. 9). The new algorithm identified many duplicates for each of the His1, His2A, His2B, His3, and His4 genes in the *Drosophila* genome. As query the genes CG33825 (His1), CG33826 (His2A), CG33894 (His2B), CG33827 (His3), and CG33893 (His4) were used. The His2B and His4 genes are on the reverse strand in comparison to the other genes. The genes are very similar (some code for the same protein sequence) resulting in alignment scores between 99 % and 100 %. Only two more divergent gene duplicates were found for the His2A gene. The first two gene duplicates of His2A have alignment scores of 79 %.

### **3.2.2 Trans-spliced genes**

Tandem gene duplicates and *trans*-spliced genes could evolve through the same gene duplication process during evolution, except that only part of the gene is duplicated instead of the complete gene. The exon-intron structure of tandem gene duplicates and *trans*-spliced genes look very similar, which complicates their differentiation during the process of gene identification. If, for example, the constitutive part of the *trans*-spliced gene consists of only

Predicting Tandemly Arrayed Gene Duplicates with Webscipio 73

Fig. 9. *Drosophila melanogaster* histones: The results for the separate searches for gene duplicates of the histones His1, His2A, His2B, His3, and His4 are shown. Based on the results of the search for each single gene it is not possible to distinguish between a gene and a genomic region duplication. The results of all searches at the same scale shows that not single genes but a genomic region containing all five histone genes has been duplicated

Fig. 10. *Drosophila melanogaster* CG1737 gene and *Drosophila melanogaster* dynein

intermediate chain: The algorithm identified duplicated exons in the *trans*-spliced CG1737 and dynein intermediate chain genes. The search was done with default parameters and the *search for concatenated exons* and *search for splitted exons* options were enabled. To reveal the last and most divergent exon the *region size* parameter was set to 35,000 nucleotides and the *allowed length difference* parameter to 30 amino acids for the dynein intermediate chain gene. The sequence encoded by the true first exons is conserved throughout all major branches of the eukaryotic tree of life that express a cytoplasmic dynein, in chromalveolates, Excavata, and Opisthokonta. In addition, this N-terminal part of the dynein intermediate chain is of high functional importance because it connects dynein to dynactin by interacting with the

several times.

Fig. 8. *Drosophila melanogaster* CG3397 gene: A gene duplication could be identified downstream of the CG3397 gene, which, however, most probably is a pseudogene.

one exon while the *trans*-spliced part consists of groups of similar alternative exons the correct reconstruction of the *trans*-spliced gene would not look different compared to a partial reconstruction of a cluster of duplicated genes for which the first (or last) exons were not found because of low similarity. The gene CG1637 of Drosphila is a *trans*-spliced gene (McManus et al., 2010). The WebScipio algorithm predicts tandemly arrayed genes for isoform A and B of CG1637, although the first exons of the potential tandem gene candidates were not found (Fig. 10). The close inspection of the three isoforms shows that the predicted exons do not belong to duplicated genes, but to *trans*-spliced variants of the same gene. Another type of problem is demonstrated by the dynein intermediate chain gene of *Drosophila melanogaster*. Here, the dynein intermediate chain gene is annotated as four separate genes (Sdic1, Sdic2, Sdic3 and Sdic4) in Flybase (version of June 24th, 2011). The problem is, however, that the real first two exons of the gene are not annotated in Flybase.

Fig. 8. *Drosophila melanogaster* CG3397 gene: A gene duplication could be identified downstream of the CG3397 gene, which, however, most probably is a pseudogene.

one exon while the *trans*-spliced part consists of groups of similar alternative exons the correct reconstruction of the *trans*-spliced gene would not look different compared to a partial reconstruction of a cluster of duplicated genes for which the first (or last) exons were not found because of low similarity. The gene CG1637 of Drosphila is a *trans*-spliced gene (McManus et al., 2010). The WebScipio algorithm predicts tandemly arrayed genes for isoform A and B of CG1637, although the first exons of the potential tandem gene candidates were not found (Fig. 10). The close inspection of the three isoforms shows that the predicted exons do not belong to duplicated genes, but to *trans*-spliced variants of the same gene. Another type of problem is demonstrated by the dynein intermediate chain gene of *Drosophila melanogaster*. Here, the dynein intermediate chain gene is annotated as four separate genes (Sdic1, Sdic2, Sdic3 and Sdic4) in Flybase (version of June 24th, 2011). The problem is, however, that the real first two exons of the gene are not annotated in Flybase.

Fig. 9. *Drosophila melanogaster* histones: The results for the separate searches for gene duplicates of the histones His1, His2A, His2B, His3, and His4 are shown. Based on the results of the search for each single gene it is not possible to distinguish between a gene and a genomic region duplication. The results of all searches at the same scale shows that not single genes but a genomic region containing all five histone genes has been duplicated several times.

Fig. 10. *Drosophila melanogaster* CG1737 gene and *Drosophila melanogaster* dynein intermediate chain: The algorithm identified duplicated exons in the *trans*-spliced CG1737 and dynein intermediate chain genes. The search was done with default parameters and the *search for concatenated exons* and *search for splitted exons* options were enabled. To reveal the last and most divergent exon the *region size* parameter was set to 35,000 nucleotides and the *allowed length difference* parameter to 30 amino acids for the dynein intermediate chain gene.

The sequence encoded by the true first exons is conserved throughout all major branches of the eukaryotic tree of life that express a cytoplasmic dynein, in chromalveolates, Excavata, and Opisthokonta. In addition, this N-terminal part of the dynein intermediate chain is of high functional importance because it connects dynein to dynactin by interacting with the

Predicting Tandemly Arrayed Gene Duplicates with Webscipio 75

Babushok, D. V., Ostertag, E. M. & Kazazian, H. H., Jr. (2007). Current topics in genome

Bertrand, D., Lajoie, M. & El-Mabrouk, N. (2008). Inferring ancestral gene orders for a family of tandemly arrayed genes, *J Comput Biol*, Vol.15, No.8, pp. 1063-1077 Doring, A., Weese, D., Rausch, T. & Reinert, K. (2008). SeqAn an efficient, generic C++

Elemento, O., Gascuel, O. & Lefranc, M. P. (2002). Reconstructing the duplication history of

Garcia-Fernandez, J. (2005). The genesis and evolution of homeobox gene clusters, *Nat Rev* 

Goto, N., Prins, P., Nakao, M., Bonnal, R., Aerts, J. & Katayama, T. (2010). BioRuby: Bioinformatics software for the Ruby programming language, *Bioinformatics*  Hahn, M. W. (2009). Distinguishing among evolutionary models for the maintenance of gene

Keller, O., Odronitz, F., Stanke, M., Kollmar, M. & Waack, S. (2008). Scipio: using protein

Li, W. H., Yang, J. & Gu, X. (2005). Expression divergence between duplicate genes, *Trends* 

Long, M., Betran, E., Thornton, K. & Wang, W. (2003). The origin of new genes: glimpses

Massingham, T., Davies, L. J. & Lio, P. (2001). Analysing gene function after duplication,

McManus, C. J., Duff, M. O., Eipper-Mains, J. & Graveley, B. R. (2010). Global analysis of

Nei, M. & Roychoudhury, A. K. (1973). Probability of fixation and mean fixation time of an

Odronitz, F., Pillmann, H., Keller, O., Waack, S. & Kollmar, M. (2008). WebScipio: an online

Prototype JavaScript framework: Easy Ajax and DOM manipulation for dynamic web

purzelrakete's workling at master - GitHub, (2011). Available from

Quijano, C., Tomancak, P., Lopez-Marti, J., Suyama, M., Bork, P., Milan, M., Torrents, D. &

duplicated genes during evolution, *Genome Biol*, Vol.9, No.12, pp. R176

Ruby Programming Language, (2011). Available from http://www.ruby-lang.org/

Manzanares, M. (2008). Selective maintenance of Drosophila tandemly arranged

trans-splicing in Drosophila, *Proc Natl Acad Sci U S A*, Vol.107, No.29, pp. 12975-

tool for the determination of gene structures using protein sequences, *BMC* 

from the young and old, *Nat Rev Genet*, Vol.4, No.11, pp. 865-875

overdominant mutation, *Genetics*, Vol.74, No.2, pp. 371-380

The Official YAML Web Site, (2011). Available from http://www.yaml.org/

applications, (2011). Available from http://www.prototypejs.org

Ohno, S. (1970). Evolution by Gene Duplication, Berlin, *Springer* 

http://github.com/purzelrakete/workling

Ruby on Rails, (2011). Available from http://rubyonrails.org

orthologs in closely related species, *BMC Bioinformatics*, Vol.9, pp. 278 Kent, W. J. (2002). BLAT--the BLAST-like alignment tool, *Genome Res*, Vol.12, No.4, pp.

sequences to determine the precise exon/intron structures of genes and their

library for sequence analysis, *BMC Bioinformatics*, Vol.9, pp. 11

tandemly repeated genes, *Mol Biol Evol*, Vol.19, No.3, pp. 278-288

No.5, pp. 542-554

656-664

12979

*Genet*, Vol.6, No.12, pp. 881-892

*Genet*, Vol.21, No.11, pp. 602-607

*Bioessays*, Vol.23, No.10, pp. 873-876

*Genomics*, Vol.9, pp. 422

duplicates, *J Hered*, Vol.100, No.5, pp. 605-617

evolution: molecular mechanisms of new gene formation, *Cell Mol Life Sci*, Vol.64,

dynactin p150 gene. Based on these facts and the found exon order of the genomic region, we expect the gene to be *trans*-spliced (Fig. 10, bottom).
