**3. Results and discussion**

WebScipio uses the command-line tool Scipio to reconstruct the gene structures of given protein sequences based on the available eukaryotic genome assemblies. Scipio has been developed for the case that protein sequences and target genome sequence are from the same organism. Nevertheless, Scipio allows several mismatches that might result from sequencing and assembly errors like missing or additional bases, which lead to frame-shifts, or in-frame stop codons that would lead to premature gene stops. Mismatches might also be the result from differences in the source of the protein sequence, which might have been obtained from cDNA libraries of a certain strain, and the specific sequenced strain of the species. To accomplish this task, Scipio relies on BLAT, which is one of the fastest tools available for

Predicting Tandemly Arrayed Gene Duplicates with Webscipio 65

query and target gene. If genes are highly conserved in evolution Scipio is able to correctly reconstruct genes in species that diverged hundreds of million years ago. If genes evolve fast Scipio can predict genes only in very related organisms. This behaviour can also be used to predict gene duplicates in the same organism, and is implemented as *multiple results* parameter in the Scipio options. Again, because Scipio relies on BLAT, only those duplicates will be identified that are very similar. An advantage of this option is that Scipio is able to

In an analysis of the origin of new genes in the *Drosophila* species complex (Zhou et al., 2008) it has been shown that the majority of the constrained functional new genes are dispersed duplicates. In contrast, tandem duplications were found to be young events and to lead to lower survival rates. Thus, tandem duplicates are often pseudogenes most probably because the introduction of frame shifts and in-frame stop codons does not demand too many mutations to destroy the transcription and expression of the new gene. If duplicates are kept in the genome they acquire new functions through neofunctionalization and subfunctionalization by accumulation of many substitutions (Ohno, 1970). Those genes are too divergent to be identified by the *multiple results* option of Scipio. However, although accumulating many substitutions tandem duplicates very often retain the gene structure of the original gene including intron splice sites and reading frames of exons. Occasionally further introns might be introduced or prior existing introns lost because these changes would not destroy transcription and translation. To use this knowledge in tandem gene duplicate identification we developed an algorithm that searches for duplicates of a query sequence based on the restrictions imposed by its gene structure. Every piece of DNA in the up- and downstream region of the original exon that has the same splice sites and shares sequence homology to the original exon, when translated in the same reading frame, is thought to be a candidate for an exon of a duplicated gene. In the case that introns have been lost or gained in the duplicated genes the splice site restrictions apply to the outer borders of the fused or split exons. WebScipio is able to correctly reconstruct the gene structure for a given protein sequence and is thus very suited as starting point for searches

To search for tandem gene duplicates an extension to WebScipio was implemented providing several parameters to adjust the search according to users or genome-specific needs. In most cases, however, the standard parameters will provide reasonable and interpretable results. As soon as the search is done, WebScipio shows an overview of the results as small gene structure pictures (Quick View), which reveal the exon regions of the found tandem genes (Fig. 2). For convenient analysis the genomic region comprising the gene structure of the query sequence and the exons of the predicted tandem genes is shown in a combined graph and provided as one YAML file. The exons of the original gene are dark coloured and the corresponding predicted exons have the same but lighter colour. The darkness of the colour relates to the similarity of the predicted exon to the original one. The same colour scheme is used to highlight the various exons in the Alignment view of the genomic regions (Fig. 3). The Alignment view shows the nucleotide sequence of the gene ordered in exons and introns. For every exon the genomic DNA and the corresponding translation are shown, as well as the alignment of the query sequence to the translation. To demonstrate the application, quality, and limitations of the new algorithm we provide some example searches in the following sections. Tandemly arrayed gene duplicates have several characteristics that need to be considered. Gene duplications can be found on both the forward and the reverse strand. The duplicated genes might contain fused exons or

find dispersed as well as tandem duplicates.

for candidate exons of duplicated genes.

Fig. 2. WebScipio result view of the search for tandem gene duplications of the *Drosophila melanogaster* CG14502 gene (Flybase sequence accession FBpp0085935): Exons are illustrated as coloured rectangles, introns as grey narrow rectangles, and gaps as red narrow rectangles. Gaps indicate missing exons of the tandem gene duplicates. For the search the default parameters were used except for the *minimal score for exons* parameter that was set to 5 % to find some exon duplicates of the first exon.

the alignment of almost similar protein or DNA sequences. As Scipio tolerates a certain amount of mismatches between query and target sequence it can also successfully be used for cross-species gene reconstructions and predictions (Odronitz et al., 2008). Because Scipio relies on BLAT the success of the cross-species search depends on the difference between the

Fig. 2. WebScipio result view of the search for tandem gene duplications of the *Drosophila melanogaster* CG14502 gene (Flybase sequence accession FBpp0085935): Exons are illustrated

rectangles. Gaps indicate missing exons of the tandem gene duplicates. For the search the default parameters were used except for the *minimal score for exons* parameter that was set to

the alignment of almost similar protein or DNA sequences. As Scipio tolerates a certain amount of mismatches between query and target sequence it can also successfully be used for cross-species gene reconstructions and predictions (Odronitz et al., 2008). Because Scipio relies on BLAT the success of the cross-species search depends on the difference between the

as coloured rectangles, introns as grey narrow rectangles, and gaps as red narrow

5 % to find some exon duplicates of the first exon.

query and target gene. If genes are highly conserved in evolution Scipio is able to correctly reconstruct genes in species that diverged hundreds of million years ago. If genes evolve fast Scipio can predict genes only in very related organisms. This behaviour can also be used to predict gene duplicates in the same organism, and is implemented as *multiple results* parameter in the Scipio options. Again, because Scipio relies on BLAT, only those duplicates will be identified that are very similar. An advantage of this option is that Scipio is able to find dispersed as well as tandem duplicates.

In an analysis of the origin of new genes in the *Drosophila* species complex (Zhou et al., 2008) it has been shown that the majority of the constrained functional new genes are dispersed duplicates. In contrast, tandem duplications were found to be young events and to lead to lower survival rates. Thus, tandem duplicates are often pseudogenes most probably because the introduction of frame shifts and in-frame stop codons does not demand too many mutations to destroy the transcription and expression of the new gene. If duplicates are kept in the genome they acquire new functions through neofunctionalization and subfunctionalization by accumulation of many substitutions (Ohno, 1970). Those genes are too divergent to be identified by the *multiple results* option of Scipio. However, although accumulating many substitutions tandem duplicates very often retain the gene structure of the original gene including intron splice sites and reading frames of exons. Occasionally further introns might be introduced or prior existing introns lost because these changes would not destroy transcription and translation. To use this knowledge in tandem gene duplicate identification we developed an algorithm that searches for duplicates of a query sequence based on the restrictions imposed by its gene structure. Every piece of DNA in the up- and downstream region of the original exon that has the same splice sites and shares sequence homology to the original exon, when translated in the same reading frame, is thought to be a candidate for an exon of a duplicated gene. In the case that introns have been lost or gained in the duplicated genes the splice site restrictions apply to the outer borders of the fused or split exons. WebScipio is able to correctly reconstruct the gene structure for a given protein sequence and is thus very suited as starting point for searches for candidate exons of duplicated genes.

To search for tandem gene duplicates an extension to WebScipio was implemented providing several parameters to adjust the search according to users or genome-specific needs. In most cases, however, the standard parameters will provide reasonable and interpretable results. As soon as the search is done, WebScipio shows an overview of the results as small gene structure pictures (Quick View), which reveal the exon regions of the found tandem genes (Fig. 2). For convenient analysis the genomic region comprising the gene structure of the query sequence and the exons of the predicted tandem genes is shown in a combined graph and provided as one YAML file. The exons of the original gene are dark coloured and the corresponding predicted exons have the same but lighter colour. The darkness of the colour relates to the similarity of the predicted exon to the original one. The same colour scheme is used to highlight the various exons in the Alignment view of the genomic regions (Fig. 3). The Alignment view shows the nucleotide sequence of the gene ordered in exons and introns. For every exon the genomic DNA and the corresponding translation are shown, as well as the alignment of the query sequence to the translation.

To demonstrate the application, quality, and limitations of the new algorithm we provide some example searches in the following sections. Tandemly arrayed gene duplicates have several characteristics that need to be considered. Gene duplications can be found on both the forward and the reverse strand. The duplicated genes might contain fused exons or

Predicting Tandemly Arrayed Gene Duplicates with Webscipio 67

any introns. Although gene duplications are more often found for small genes consisting of one or only a few exons, gene duplicates can also be identified for genes consisting of dozens of exons spanning large genomic regions. Because tandem gene duplicates are defined by being located next to each other in the genome, intergenic regions are expected to be short. This is also the reason why the parameter for bordering the search in up- and downstream regions of the original gene limits this region to 300,000 nucleotides. However, WebScipio cannot exclude that there may be additional genes in-between gene duplicates. An example for such a scenario would be the duplication or multiple duplications of small genomic regions that encode several genes. In most cases we considered examples from the fruit fly *Drosophila melanogaster* and sequences from Flybase (Tweedie et al., 2009), because the corresponding genome is of high quality and the annotation of the genome is already at a very advanced stage. Fragmented genomes, like draft genomes for which only short contigs are available, or chromosome assemblies containing many gaps, are useful to screen for interesting candidates but do not provide the reliability needed for tests of the algorithms quality and limitations. An advanced annotation provides the advantage that genomic locations of most genes have already been identified. Thus the gene order is already established although there might still be

errors in the annotation of single exons.

parameters.

Identity to

**3.1 Examples of tandemly arrayed gene duplicates** 

*Drosophila melanogaster* heat shock protein cluster.

**3.1.1 Gene duplicates on both the forward and the reverse strand** 

The WebScipio tandem gene duplication extension has been developed to find tandem gene duplications on the forward as well as on the reverse strand in relation to the query gene. The example in Fig. 4 shows five gene duplicates of the *Drosophila melanogaster* heat shock protein 23 gene (Hsp23), which consists of one exon. The first duplicate (Hsp67Bc) and the forth duplicate (Hsp26) in the genomic region are on the reverse strand, the other duplications Hsp22, CG4461, and Hsp27 are in the same reading direction as Hsp23. This search was performed with default parameters except increasing the *allowed length difference for exons* parameter to 30 amino acids. The most divergent gene duplication Hsp67Ba (Table 1), which is encoded in the genomic region between Hsp26 and Hsp23, was not found. This example shows that although the sequence identity is very low between the duplicates and the Hsp23 search sequence (Table 1), five duplicates could be identified. The length difference between Hsp23 and Hsp67Ba was too large so that candidates of the length of Hsp67Ba were not included in the search with the given search

Hsp67Bc Hsp22 CG4461 Hsp26 Hsp67Ba Hsp23 Hsp27

Length [aa] 199 174 200 208 445 186 213

Hsp23 0.29 0.31 0.26 0.49 0.15 1.00 0.41

Strand rev for for rev rev for for

Table 1. Comparison of the length, similarity, and reading direction of the genes of the


Fig. 3. Alignment view of the first two exons of the third gene duplicate of the *Drosophila melanogaster* CG14502 gene (see also Fig. 2): Each exon is named by the tandem gene number and the exon number. In addition, the tandem gene score and the exon score is given for each exon in percentage. The alignment indicates the positions of the sequences in the genome and the protein. The first line of the alignment represents the nucleotide sequence of the gene and the second line the translation of this sequence. The third line shows how the amino acids of this translation match the amino acids coded by the original exon, which are shown in the forth line. Mismatches are represented by an X in dark red or, if amino acids are chemically similar, in light red. Gaps in the alignment are shown as hyphens in green. The Duplicated Exon 3.3 alignment has been closed for representation purposes.

might contain additional introns. In the case of retroposed genes, which are derived from the reverse transcription and insertion of processed genes, gene duplications do not have

Fig. 3. Alignment view of the first two exons of the third gene duplicate of the *Drosophila melanogaster* CG14502 gene (see also Fig. 2): Each exon is named by the tandem gene number and the exon number. In addition, the tandem gene score and the exon score is given for each exon in percentage. The alignment indicates the positions of the sequences in the genome and the protein. The first line of the alignment represents the nucleotide sequence of the gene and the second line the translation of this sequence. The third line shows how the amino acids of this translation match the amino acids coded by the original exon, which are shown in the forth line. Mismatches are

represented by an X in dark red or, if amino acids are chemically similar, in light red. Gaps in the alignment are shown as hyphens in green. The Duplicated Exon 3.3

might contain additional introns. In the case of retroposed genes, which are derived from the reverse transcription and insertion of processed genes, gene duplications do not have

alignment has been closed for representation purposes.

any introns. Although gene duplications are more often found for small genes consisting of one or only a few exons, gene duplicates can also be identified for genes consisting of dozens of exons spanning large genomic regions. Because tandem gene duplicates are defined by being located next to each other in the genome, intergenic regions are expected to be short. This is also the reason why the parameter for bordering the search in up- and downstream regions of the original gene limits this region to 300,000 nucleotides. However, WebScipio cannot exclude that there may be additional genes in-between gene duplicates. An example for such a scenario would be the duplication or multiple duplications of small genomic regions that encode several genes. In most cases we considered examples from the fruit fly *Drosophila melanogaster* and sequences from Flybase (Tweedie et al., 2009), because the corresponding genome is of high quality and the annotation of the genome is already at a very advanced stage. Fragmented genomes, like draft genomes for which only short contigs are available, or chromosome assemblies containing many gaps, are useful to screen for interesting candidates but do not provide the reliability needed for tests of the algorithms quality and limitations. An advanced annotation provides the advantage that genomic locations of most genes have already been identified. Thus the gene order is already established although there might still be errors in the annotation of single exons.
