**1. Introduction**

94 Gene Duplication

Zhao DZ, Wang GF, Speal B, Ma H, The excess microsporocytes 1 gene encodes a putative

Zipfel C, Kunze G, Cinchilla D, et al. Perception of the bacterial PAMP EF-TU by the

cell fates in the Arabidopsis anther, Genes Dev. (2002) 16:2021-31.

60.

leucine rich repeat receptor protein kinase that controls somatic and reproductive

receptor EFR restricts Agrobacterium mediated transformation, Cell (2006) 125:749-

The publication of the first fully sequenced genomes represented a landmark in the biological sciences. The comparison of genomes from different organisms provides us with unprecedented opportunities to address many long-standing evolutionary questions in a more comprehensive way.

### **1.1 Lineage-specific genes**

The availability of several genomes from related organisms permits the identification of newly evolved genes in different lineages or species, the study of their mechanisms of formation and the investigation of their role in adapting to new environments or physiological conditions (Domazet-Loso & Tautz, 2003; Guo et al., 2007; Khalturin et al., 2009; Kuo & Kissinger, 2008; Siepel, 2009; Toll-Riera et al., 2009a; Zhou et al., 2008). Recently formed genes give us the opportunity to study the action of natural selection in recent times and to investigate the processes associated with gene creation (Zhou & Wang, 2008).

The number of species-specific genes, or orphan genes, is not insignificant. They represent around 14% of the genes in 60 fully sequenced microbial genomes (Siew & Fischer, 2003) and between 20-30% in *Drosophila* species (Domazet-Loso & Tautz, 2003; Drosophila 12 Genomes Consortium, 2007). Genes restricted to particular lineages include vomeronasal receptors and casein milk proteins in mammals, which are known to be involved in specific physiological adaptations in this lineage (International Chicken Genome Sequencing Consortium, 2004). Additionally, several lineage-specific genes have been found to be involved in defence against pathogens, such as dermcidin in primates (Toll-Riera et al., 2009a) and surface antigens in apicomplexan parasites (Kuo & Kissinger, 2008). Interestingly, it has been noticed that rice orphan genes are more often expressed under environmental pressure (injury and hormone treatment) than non-orphan genes, indicating that novel genes help in adaptation to changing conditions (Guo et al., 2007).

Many newly evolved genes are derived from partial or complete gene duplication of preexisting genes (Long et al., 2003; Marques et al., 2005; Toll-Riera et al., 2009a; Zhou et al., 2008). Alternative processes of gene formation include exaptation from mobile elements

Partial Gene Duplication and the Formation of Novel Genes 97

separation of New World Monkeys, and which gave rise to differentiated red and green pigments (Nathans et al., 1986). Zhang and colleagues (Zhang et al., 1998) reported on another example of the action of positive selection after gene duplication. The eosinophil cationic protein (ECP) and eosinophil-derived neurotoxin (EDN) genes are present in Old World Monkeys and hominoids, and probably originated by tandem gene duplication after the divergence of New World Monkeys. EDN is an antiviral agent (Domachowske & Rosenberg, 1997) and ECP is a potent toxin for bacteria and parasites (Rosenberg & Dyer, 1995). The authors detected a non-random accumulation of arginine substitutions in ECP, which may contribute to the generation of pores in pathogens' membranes. Another example refers to pancreatic ribonuclease 1B (RNASE1B), which originated through gene duplication of RNASE1, an enzyme used to digest bacteria in the small intestine, in the douc langur (*Pygathrix nemaeus*) around 2-4 million years ago (Zhang et al., 2002). Douc langurs are folivorous monkeys, in which leaves are digested through fermentation by symbiotic bacteria residing in the foregut. The newly duplicated copy, RNASE1B has evolved very rapidly (non-synonymous to synonymous nucleotide substitution rate of 4.03), contrary to the paralogous copy, RNASE1, which has not undergone change. These results indicate a burst of positive selection acting on the duplicated copy. Moreover, most of the substitutions imply the gain of negatively charged residues, lowering the optimal pH for RNASE1B, which could be related to an increase in digestive efficiency, given the lower pH found in

Not all duplicated proteins are identical to their parental copies at birth. In fact, it has been reported that in *C. elegans* only about 40% of the new duplicates are borne out of complete gene duplications, the remainder representing cases of partial gene duplication (Katju & Lynch, 2003). These partially duplicated genes may recruit sequences from their genomic neighbourhood or from other genes (Katju & Lynch, 2006). In the first case, adjacent non-coding sequences are co-opted for a coding function. Katju and Lynch (Katju & Lynch, 2006) found that about half of the partially duplicated genes did not recruit any surrounding sequences but accumulated mutations, for example in initiation or termination codons, that altered the coding sequence. In *Drosophila melanogaster*, around 30% of the newly formed genes recruited various genomic sequences or formed chimeric gene structures (Zhou et al., 2008). Partially duplicated and chimeric genes are expected to adopt new functions immediately, which may increase their probability of being retained (Patthy, 1999; Zhou et al., 2008). An example of a gene that has arisen by partial duplication is the *Hun* gene in *Drosophila*, located on the X-chromosome. *Hun* arose from a partial duplication of the *Bällchen* gene, which is on chromosome 3R. *Hun* lacks 3' coding sequence with respect to *Bällchen*, but has gained 33 amino acids from a nearby intergenic sequence. Further, while *Bällchen* is expressed ubiquitously, *Hun* shows testes-specific

The sequence similarity that exists between completely duplicated gene copies and parental gene copies is often sufficient to detect homologues in a whole range of organisms. However, this is often not the case for partially duplicated genes, especially if the sequence common to both duplicates is short and the rate of divergence of the novel gene duplicate is abnormally high. As a result, many partially duplicated genes are identified as orphan or lineage-specific genes, that is, genes that do not yield any significant hits in database protein

the small intestine of douc langurs.

expression (Arguello et al., 2006).

**1.3 Partial gene duplication** 

(Nekrutenko & Li, 2001; Toll-Riera et al., 2009b), gene fusion or fission (Parra et al., 2006) and *de novo* gene formation from non-coding sequences (Cai et al., 2008; Heinen et al., 2009; Knowles & McLysaght, 2009; Levine et al., 2006; Toll-Riera et al., 2009a). A genome-wide study in *Drosophila melanogaster* has reported that gene duplication is the most common mechanism for the formation of novel genes in this species (Zhou et al., 2008).

### **1.2 Gene duplication**

In the early thirties Haldane (Haldane, 1932) and Muller (Muller, 1935) were the first to propose gene duplication as a mechanism for the generation of new genes. Later, in the seventies, Ohno published an influential book about the role of gene duplication in evolution (Ohno, 1970), in which he emphasised the importance of gene duplication in generating protein functional diversity. With the availability of complete genome sequences it has become possible to estimate genomic rates of gene duplication (Lynch & Conery, 2000), analyse the pattern of evolution of the two duplicated copies, and identify lineagespecific gene family expansions. Expanded gene families that have been analysed in detail include olfactory receptors in mouse (Mouse Genome Sequencing Consortium 2002) and human (Gilad et al., 2005), and KRAB-associated zinc-finger in primates (Castresana et al., 2004). Genomic studies have shown that gene duplication is associated with increased coding sequence evolutionary rates (Lynch and Conery 2000; Scannell and Wolfe 2008), higher tissue expression divergence (Gu et al., 2002; Makova & Li, 2003), and higher regulatory sequence divergence (Farre & Alba, 2010).

The molecular mechanisms that have been proposed to be involved in duplication are nonallelic homologous recombination, transposon-mediated transposition and illegitimate recombination. The first two mechanisms imply the presence of sequence homology (Zhou et al., 2008). Yang and colleagues (Yang et al., 2008) found an excess of repetitive sequences at the breakpoints of the duplicated regions of a group of *Drosophila* lineage-specific young duplicates, suggesting the action of non-allelic homologous recombination. Another study in *Drosophila* found that dispersed duplicates have mainly arisen through non-allelic homologous recombination, while tandem duplicates most often arose through illegitimate recombination (Zhou et al., 2008). It has also been hypothesized that segmental duplications may arise from the recombination of Alu repeat sequences (Bailey et al., 2003).

Duplicated genes appear at a very high rate. It has been estimated that, on average, 0.01 duplicates arise per gene per million years (Lynch & Conery, 2000). The most frequent fate following gene duplication is believed to be the silencing of one of the duplicated copies due to the accumulation of degenerative mutations, a process that may take approximately 4 million years to complete (Lynch & Conery, 2000). However, sometimes both copies survive. The duplicated copy can acquire beneficial mutations and consequently gain a novel function with respect to the parental gene (neofunctionalisation), while the parental copy preserves its original function (Ohno, 1970). The duplicated copy may also be retained due to the split of the original function between the two gene copies (subfunctionalisation) (Hughes, 1994). Finally, if an increase in dosage of a particular gene is beneficial, the new copy may become fixed by positive selection maintaining the same gene structure and function as the parental gene (Kondrashov & Koonin, 2004).

Duplicated genes may confer adaptive advantages. For example, trichromatic colour vision in Old World Monkeys is associated with a pigment gene duplication that occurred after the

(Nekrutenko & Li, 2001; Toll-Riera et al., 2009b), gene fusion or fission (Parra et al., 2006) and *de novo* gene formation from non-coding sequences (Cai et al., 2008; Heinen et al., 2009; Knowles & McLysaght, 2009; Levine et al., 2006; Toll-Riera et al., 2009a). A genome-wide study in *Drosophila melanogaster* has reported that gene duplication is the most common

In the early thirties Haldane (Haldane, 1932) and Muller (Muller, 1935) were the first to propose gene duplication as a mechanism for the generation of new genes. Later, in the seventies, Ohno published an influential book about the role of gene duplication in evolution (Ohno, 1970), in which he emphasised the importance of gene duplication in generating protein functional diversity. With the availability of complete genome sequences it has become possible to estimate genomic rates of gene duplication (Lynch & Conery, 2000), analyse the pattern of evolution of the two duplicated copies, and identify lineagespecific gene family expansions. Expanded gene families that have been analysed in detail include olfactory receptors in mouse (Mouse Genome Sequencing Consortium 2002) and human (Gilad et al., 2005), and KRAB-associated zinc-finger in primates (Castresana et al., 2004). Genomic studies have shown that gene duplication is associated with increased coding sequence evolutionary rates (Lynch and Conery 2000; Scannell and Wolfe 2008), higher tissue expression divergence (Gu et al., 2002; Makova & Li, 2003), and higher

The molecular mechanisms that have been proposed to be involved in duplication are nonallelic homologous recombination, transposon-mediated transposition and illegitimate recombination. The first two mechanisms imply the presence of sequence homology (Zhou et al., 2008). Yang and colleagues (Yang et al., 2008) found an excess of repetitive sequences at the breakpoints of the duplicated regions of a group of *Drosophila* lineage-specific young duplicates, suggesting the action of non-allelic homologous recombination. Another study in *Drosophila* found that dispersed duplicates have mainly arisen through non-allelic homologous recombination, while tandem duplicates most often arose through illegitimate recombination (Zhou et al., 2008). It has also been hypothesized that segmental duplications

Duplicated genes appear at a very high rate. It has been estimated that, on average, 0.01 duplicates arise per gene per million years (Lynch & Conery, 2000). The most frequent fate following gene duplication is believed to be the silencing of one of the duplicated copies due to the accumulation of degenerative mutations, a process that may take approximately 4 million years to complete (Lynch & Conery, 2000). However, sometimes both copies survive. The duplicated copy can acquire beneficial mutations and consequently gain a novel function with respect to the parental gene (neofunctionalisation), while the parental copy preserves its original function (Ohno, 1970). The duplicated copy may also be retained due to the split of the original function between the two gene copies (subfunctionalisation) (Hughes, 1994). Finally, if an increase in dosage of a particular gene is beneficial, the new copy may become fixed by positive selection maintaining the same gene structure and

Duplicated genes may confer adaptive advantages. For example, trichromatic colour vision in Old World Monkeys is associated with a pigment gene duplication that occurred after the

may arise from the recombination of Alu repeat sequences (Bailey et al., 2003).

function as the parental gene (Kondrashov & Koonin, 2004).

mechanism for the formation of novel genes in this species (Zhou et al., 2008).

regulatory sequence divergence (Farre & Alba, 2010).

**1.2 Gene duplication** 

separation of New World Monkeys, and which gave rise to differentiated red and green pigments (Nathans et al., 1986). Zhang and colleagues (Zhang et al., 1998) reported on another example of the action of positive selection after gene duplication. The eosinophil cationic protein (ECP) and eosinophil-derived neurotoxin (EDN) genes are present in Old World Monkeys and hominoids, and probably originated by tandem gene duplication after the divergence of New World Monkeys. EDN is an antiviral agent (Domachowske & Rosenberg, 1997) and ECP is a potent toxin for bacteria and parasites (Rosenberg & Dyer, 1995). The authors detected a non-random accumulation of arginine substitutions in ECP, which may contribute to the generation of pores in pathogens' membranes. Another example refers to pancreatic ribonuclease 1B (RNASE1B), which originated through gene duplication of RNASE1, an enzyme used to digest bacteria in the small intestine, in the douc langur (*Pygathrix nemaeus*) around 2-4 million years ago (Zhang et al., 2002). Douc langurs are folivorous monkeys, in which leaves are digested through fermentation by symbiotic bacteria residing in the foregut. The newly duplicated copy, RNASE1B has evolved very rapidly (non-synonymous to synonymous nucleotide substitution rate of 4.03), contrary to the paralogous copy, RNASE1, which has not undergone change. These results indicate a burst of positive selection acting on the duplicated copy. Moreover, most of the substitutions imply the gain of negatively charged residues, lowering the optimal pH for RNASE1B, which could be related to an increase in digestive efficiency, given the lower pH found in the small intestine of douc langurs.

### **1.3 Partial gene duplication**

Not all duplicated proteins are identical to their parental copies at birth. In fact, it has been reported that in *C. elegans* only about 40% of the new duplicates are borne out of complete gene duplications, the remainder representing cases of partial gene duplication (Katju & Lynch, 2003). These partially duplicated genes may recruit sequences from their genomic neighbourhood or from other genes (Katju & Lynch, 2006). In the first case, adjacent non-coding sequences are co-opted for a coding function. Katju and Lynch (Katju & Lynch, 2006) found that about half of the partially duplicated genes did not recruit any surrounding sequences but accumulated mutations, for example in initiation or termination codons, that altered the coding sequence. In *Drosophila melanogaster*, around 30% of the newly formed genes recruited various genomic sequences or formed chimeric gene structures (Zhou et al., 2008). Partially duplicated and chimeric genes are expected to adopt new functions immediately, which may increase their probability of being retained (Patthy, 1999; Zhou et al., 2008). An example of a gene that has arisen by partial duplication is the *Hun* gene in *Drosophila*, located on the X-chromosome. *Hun* arose from a partial duplication of the *Bällchen* gene, which is on chromosome 3R. *Hun* lacks 3' coding sequence with respect to *Bällchen*, but has gained 33 amino acids from a nearby intergenic sequence. Further, while *Bällchen* is expressed ubiquitously, *Hun* shows testes-specific expression (Arguello et al., 2006).

The sequence similarity that exists between completely duplicated gene copies and parental gene copies is often sufficient to detect homologues in a whole range of organisms. However, this is often not the case for partially duplicated genes, especially if the sequence common to both duplicates is short and the rate of divergence of the novel gene duplicate is abnormally high. As a result, many partially duplicated genes are identified as orphan or lineage-specific genes, that is, genes that do not yield any significant hits in database protein

Partial Gene Duplication and the Formation of Novel Genes 99

paralogues that were not primate-specific. In such cases, the closest hit in human was considered the putative human parental gene, and the closest non-primate orthologue of the parental gene was taken as the outgroup gene (Figure 1). Protein sequences were aligned with T-Coffee (Notredame et al., 2000), and the alignments between primate-specific genes and parental genes were carefully examined to discard any spurious associations. We also removed any regions that were completely divergent (non-alignable) between the orphan

Fig. 1. Tree topology corresponding to gene families containing duplicated orphan genes. The orphan and parental genes are from human, the outgroup gene from a non-primate

The final set consisted of 14 orphan genes. Table 1 shows the orphan, parental and outgroup gene names, protein identifiers, and the percent of parental protein that could be reliably aligned with the orphan protein, corresponding to the portion of the protein that had duplicated. Of the 14 orphan genes, 4 represented single copies and the rest belonged to orphan gene families. In only one case, dermcidin, sequence similarity supported a complete

We used the protein multiple alignments to estimate the number of amino acid substitutions per site (K) in the orphan, parental and outgroup branches. We used PROML, a maximum likelihood based method in the Phylip package for this purpose (Felsenstein, 2005). The

The first example of an orphan gene that arose through gene duplication is dermcidin. This gene encodes a short protein of 110 amino acids in length. The corresponding parental gene is lacritin, which has orthologues in other mammals, and is located on chromosome 12 adjacent to the dermcidin gene. The two genes have a similar exonic structure, and although they are highly divergent, sequence similarity between the two is still detectable (Wang et al., 2006). Dermcidin is secreted in sweat glands, having an antimicrobial activity (Schittek et al., 2001), and may also be involved in neural survival and cancer (Porter et al., 2003),

Figure 2 shows the alignment of the complete protein sequences of human dermcidin, human lacritin and cat lacritin. The number of amino acid substitutions per site in the orphan branch was 1.026, about double the number of amino acid substitutions per site in

and parental genes.

species.

gene duplication event.

**2.2 Dermcidin and lacritin** 

results of these computations are discussed below.

whereas lacritin is expressed in the lacrimal glands (Ma et al., 2008).

the parental and outgroup branches (0.434 and 0.505, respectively).

searches of more distant organisms (Chen et al., 2010; Domazet-Loso & Tautz, 2003; Toll-Riera et al., 2009a). In a recent study that showed that newly formed genes in *Drosophila melanogaster* are as likely to perform essential functions as older genes, it was found that 28 out of the 50 new genes that had arisen through gene duplication corresponded to partial duplications (Chen et al., 2010). These young genes were found to evolve very rapidly, showing a median of 47.3% divergence, at the amino-acid level, from their parents. In an analysis of the mechanisms of formation of primate-specific genes, we observed that about 24% of the newly formed genes had originated through gene duplication, frequently involving partial gene duplication and the recruitment of additional sequences (Toll-Riera et al., 2009a). One example is human XAGE-1, a cancer/testis-associated gene that has partial homology to human XAGE-2, a gene that is well conserved in other mammals. The similarity is limited to the C-terminal half of the orphan XAGE-1 protein. We showed that, in the conserved region, the rate of amino acid sequence evolution of XAGE-1 was double that of XAGE-2, suggesting that the recruitment of additional sequences in XAGE-1 resulted in a marked asymmetry in the evolutionary rates of the two copies.

Partial gene duplication is likely to be very important for the formation of novel gene structures and the evolution of new protein functions, but studies focusing on this type of gene duplication are still scarce. To shed new light on this issue, we decided to analyse the evolutionary patterns of several primate-specific genes (orphan genes) formed, at least partially, by gene duplication. The results show that increased evolutionary rates in the partially duplicated copy are the norm, reinforcing the role of partial gene duplication in the formation of novel genes with distinct functions.

## **2. Results**

Here we use a similar approach to that employed in Toll-Riera et al. (Toll-Riera et al., 2009a) to identify a set of primate-specific genes that show significant similarity to human genes (parental genes) that are well conserved in non-primate species. We investigate the differences in the rate of evolution of the novel and parental genes and discuss the role of partial duplication in increasing the protein functional repertoire.

### **2.1 Identification of primate lineage-specific genes formed by gene duplication**

We identified a set of genes present in human and macaque but absent in 13 non-primate genomes (Mus musculus, Rattus norvegicus, Bos Taurus, Canis familiaris, Gallus gallus, Xenopus tropicalis, Danio rerio, Takifugu rubripes, Caenorhabditis elegans, Drosophila melanogaster, Saccharomyces cerevisiae, Schizosaccharomyces pombe, and Arabidopsis thaliana). The existence of a homologue in a specific genome was determined by the presence of a BLASTP (Altschul et al., 1997) hit with an expectation value (E-value) smaller than 10-4, as previously described (Alba and Castresana 2005). Orphan genes were defined as those for which we could not detect any homologues in any of the species mentioned above. As they were, by definition, present in human and macaque, our collection of orphan genes corresponded to primate-specific genes, presumably formed after the split of the rodent and primate branches and before the speciation of the human and macaque lineages. Once we had this set of orphan genes, we investigated which ones could have arisen through gene duplication by performing BLASTP searches against all human proteins, using a relaxed E-value (E<0.5). We kept those cases for which we could identify human

searches of more distant organisms (Chen et al., 2010; Domazet-Loso & Tautz, 2003; Toll-Riera et al., 2009a). In a recent study that showed that newly formed genes in *Drosophila melanogaster* are as likely to perform essential functions as older genes, it was found that 28 out of the 50 new genes that had arisen through gene duplication corresponded to partial duplications (Chen et al., 2010). These young genes were found to evolve very rapidly, showing a median of 47.3% divergence, at the amino-acid level, from their parents. In an analysis of the mechanisms of formation of primate-specific genes, we observed that about 24% of the newly formed genes had originated through gene duplication, frequently involving partial gene duplication and the recruitment of additional sequences (Toll-Riera et al., 2009a). One example is human XAGE-1, a cancer/testis-associated gene that has partial homology to human XAGE-2, a gene that is well conserved in other mammals. The similarity is limited to the C-terminal half of the orphan XAGE-1 protein. We showed that, in the conserved region, the rate of amino acid sequence evolution of XAGE-1 was double that of XAGE-2, suggesting that the recruitment of additional sequences in XAGE-1 resulted

Partial gene duplication is likely to be very important for the formation of novel gene structures and the evolution of new protein functions, but studies focusing on this type of gene duplication are still scarce. To shed new light on this issue, we decided to analyse the evolutionary patterns of several primate-specific genes (orphan genes) formed, at least partially, by gene duplication. The results show that increased evolutionary rates in the partially duplicated copy are the norm, reinforcing the role of partial gene duplication in the

Here we use a similar approach to that employed in Toll-Riera et al. (Toll-Riera et al., 2009a) to identify a set of primate-specific genes that show significant similarity to human genes (parental genes) that are well conserved in non-primate species. We investigate the differences in the rate of evolution of the novel and parental genes and discuss the role of

We identified a set of genes present in human and macaque but absent in 13 non-primate genomes (Mus musculus, Rattus norvegicus, Bos Taurus, Canis familiaris, Gallus gallus, Xenopus tropicalis, Danio rerio, Takifugu rubripes, Caenorhabditis elegans, Drosophila melanogaster, Saccharomyces cerevisiae, Schizosaccharomyces pombe, and Arabidopsis thaliana). The existence of a homologue in a specific genome was determined by the presence of a BLASTP (Altschul et al., 1997) hit with an expectation value (E-value) smaller than 10-4, as previously described (Alba and Castresana 2005). Orphan genes were defined as those for which we could not detect any homologues in any of the species mentioned above. As they were, by definition, present in human and macaque, our collection of orphan genes corresponded to primate-specific genes, presumably formed after the split of the rodent and primate branches and before the speciation of the human and macaque lineages. Once we had this set of orphan genes, we investigated which ones could have arisen through gene duplication by performing BLASTP searches against all human proteins, using a relaxed E-value (E<0.5). We kept those cases for which we could identify human

**2.1 Identification of primate lineage-specific genes formed by gene duplication** 

in a marked asymmetry in the evolutionary rates of the two copies.

partial duplication in increasing the protein functional repertoire.

formation of novel genes with distinct functions.

**2. Results** 

paralogues that were not primate-specific. In such cases, the closest hit in human was considered the putative human parental gene, and the closest non-primate orthologue of the parental gene was taken as the outgroup gene (Figure 1). Protein sequences were aligned with T-Coffee (Notredame et al., 2000), and the alignments between primate-specific genes and parental genes were carefully examined to discard any spurious associations. We also removed any regions that were completely divergent (non-alignable) between the orphan and parental genes.

Fig. 1. Tree topology corresponding to gene families containing duplicated orphan genes. The orphan and parental genes are from human, the outgroup gene from a non-primate species.

The final set consisted of 14 orphan genes. Table 1 shows the orphan, parental and outgroup gene names, protein identifiers, and the percent of parental protein that could be reliably aligned with the orphan protein, corresponding to the portion of the protein that had duplicated. Of the 14 orphan genes, 4 represented single copies and the rest belonged to orphan gene families. In only one case, dermcidin, sequence similarity supported a complete gene duplication event.

We used the protein multiple alignments to estimate the number of amino acid substitutions per site (K) in the orphan, parental and outgroup branches. We used PROML, a maximum likelihood based method in the Phylip package for this purpose (Felsenstein, 2005). The results of these computations are discussed below.

### **2.2 Dermcidin and lacritin**

The first example of an orphan gene that arose through gene duplication is dermcidin. This gene encodes a short protein of 110 amino acids in length. The corresponding parental gene is lacritin, which has orthologues in other mammals, and is located on chromosome 12 adjacent to the dermcidin gene. The two genes have a similar exonic structure, and although they are highly divergent, sequence similarity between the two is still detectable (Wang et al., 2006). Dermcidin is secreted in sweat glands, having an antimicrobial activity (Schittek et al., 2001), and may also be involved in neural survival and cancer (Porter et al., 2003), whereas lacritin is expressed in the lacrimal glands (Ma et al., 2008).

Figure 2 shows the alignment of the complete protein sequences of human dermcidin, human lacritin and cat lacritin. The number of amino acid substitutions per site in the orphan branch was 1.026, about double the number of amino acid substitutions per site in the parental and outgroup branches (0.434 and 0.505, respectively).

Partial Gene Duplication and the Formation of Novel Genes 101

number of amino acid substitutions per site in the orphan, parental and outgroup branches. We also investigated the presence of any known protein domains in the region conserved

between parental and orphan proteins, using the Pfam web server (Finn et al., 2010).

Fig. 2. Alignment of dermcidin (orphan), human lacritin (parental) and cat lacritin (outgroup) proteins. Identical residues are in green, similar residues in yellow.

implicated in several human cancers (Zendman et al., 2002).

The FAM9 family (family with sequence similarity 9) is composed of three genes: FAM9A, FAM9B and FAM9C. They are all predicted to have a Cor1/Xlr/Xmr domain in the region of similarity to the parental gene (E-values ranging from 0.049 to 4.4e-13), related to meiotic prophase chromosomes. The parental gene, synaptonemal complex protein 3 (SYCP3) is involved in the assembly of the synaptonemal complex during meiosis (Martinez-Garay et al., 2002), but the exact physiological functions of the FAM9 proteins remain unknown. The largest orphan gene family is XAGE-1, which has 5 members with identical amino acid sequences that are contiguous on the X chromosome. The region conserved between the XAGE-1s and XAGE-2 includes the GAGE domain. The function of GAGE (G antigen) and XAGE (X antigen) domains is unknown, but XAGE and GAGE proteins have been

The two genes belonging to the C2orf27 family are contiguous in the genome, though C2orf27A is located on the forward strand of chromosome 2 whereas C2orf27B is located on the reverse strand. Their function is unknown, but they derive from a protein annotated as Ral guanine nucleotide dissociation stimulator-like 4. The parental protein contains the RasGEF domain, which is a guanine nucleotide exchange factor for Ras-like small GTPases.

NPIP-like 1 belongs to the nuclear pore complex-interacting protein (NPIP) family. The parental protein contains two AMP-binding domains that are at the N-terminal region of the protein, not the area conserved in the orphan protein, which is the C-terminal part. The NPIP family (Nuclear Pore Interacting Protein), also named *morpheus*, is located on a duplicated segment of chromosome 16. It has been suggested to have experienced a burst of

Finally, AL133216.1 and AL023807.2 are two primate-specific genes of unknown function containing putative coding sequences of length 151 and 121 amino acids respectively. The parental copy of AL133216.1 modulates arsenic sensitivity, is involved in cell cycle progression, and in RNA-mediated gene silencing by microRNA (Gruber et al., 2009). It also contains an arsenite-resistance protein 2 domain (Pfam hit E-value = 3.1e-18). The orphan

The duplicated region overlaps minimally with this domain (14 amino acids).

positive selection during the emergence of *Homininae* (Johnson et al., 2001).


Table 1. List of primate-specific genes that have arisen by gene duplication. Protein identifiers are from Ensembl (ENSP) or Genbank (XP). % refers to the percentage of the parental protein that showed homology to the orphan protein.

### **2.3 Partially duplicated orphan genes**

The remaining primate-specific genes that have arisen through gene duplication corresponded to partial duplications of the parental gene (Table 1). They included 3 individual genes (AL023807.2, NPIP-like 1 and AL133216.1) and 3 gene families (FAM9, XAGE-1 and C2orf27). The percentage of protein sequence from the parental protein that could by identified as homologous in the orphan protein ranged from 9.4 to 87.3% (Table 1). With the exception of NPIP-like 1, the orphan gene is located on a different chromosome from the parental gene, although the presence of introns in all orphan genes suggests that they were not retrotransposed copies. We aligned the conserved regions of orphan, parental and outgroup proteins (Figure 3). These alignments were used for the estimation of the number of amino acid substitutions per site in the orphan, parental and outgroup branches. We also investigated the presence of any known protein domains in the region conserved between parental and orphan proteins, using the Pfam web server (Finn et al., 2010).

100 Gene Duplication

257867

ENSP00000 266743

FAM9B idem idem idem idem FAM9C idem idem idem idem

382846

XAGE-1B idem idem idem idem XAGE-1C idem idem idem idem XAGE-1D idem idem idem idem XAGE-1E idem idem idem idem

> ENSP00000 428098

> ENSP00000 290691

C2orf27B idem idem idem idem

The remaining primate-specific genes that have arisen through gene duplication corresponded to partial duplications of the parental gene (Table 1). They included 3 individual genes (AL023807.2, NPIP-like 1 and AL133216.1) and 3 gene families (FAM9, XAGE-1 and C2orf27). The percentage of protein sequence from the parental protein that could by identified as homologous in the orphan protein ranged from 9.4 to 87.3% (Table 1). With the exception of NPIP-like 1, the orphan gene is located on a different chromosome from the parental gene, although the presence of introns in all orphan genes suggests that they were not retrotransposed copies. We aligned the conserved regions of orphan, parental and outgroup proteins (Figure 3). These alignments were used for the estimation of the

Table 1. List of primate-specific genes that have arisen by gene duplication. Protein identifiers are from Ensembl (ENSP) or Genbank (XP). % refers to the percentage of the

ENSP00000 314491

**protein Outgroup** %

ENSFCAP0000000931

ENSMUSP0000002025

ENSCINP00000011125

<sup>333775</sup>XP\_001249434.1 (cow) 36.03%

ENSMUSP0000003614

ENSCAFP0000003110

ENSMUSP0000004312

7 (cat) 100%

2 (mouse) 87.29%

(vase tunicate) 64.42%

0 (mouse) 18,07%

2 (dog) 12.05%

3 (mouse) 9.36%

**Name Parental name Parental** 

Dermcidin Lacritin ENSP00000

complex protein 3

AL023807.2 AL365202.1 ENSP00000

XAGE-1A XAGE-2 ENSP00000

Acyl-CoA synthetase medium-chain family member 1

Ral guanine nucleotide dissociation stimulator like-4

protein 2

parental protein that showed homology to the orphan protein.

AL133216.1 Arsenite-resistance

**2.3 Partially duplicated orphan genes** 

FAM9A Synaptonemal

**Orphan** 

NPIP-like 1

C2orf27A


Fig. 2. Alignment of dermcidin (orphan), human lacritin (parental) and cat lacritin (outgroup) proteins. Identical residues are in green, similar residues in yellow.

The FAM9 family (family with sequence similarity 9) is composed of three genes: FAM9A, FAM9B and FAM9C. They are all predicted to have a Cor1/Xlr/Xmr domain in the region of similarity to the parental gene (E-values ranging from 0.049 to 4.4e-13), related to meiotic prophase chromosomes. The parental gene, synaptonemal complex protein 3 (SYCP3) is involved in the assembly of the synaptonemal complex during meiosis (Martinez-Garay et al., 2002), but the exact physiological functions of the FAM9 proteins remain unknown.

The largest orphan gene family is XAGE-1, which has 5 members with identical amino acid sequences that are contiguous on the X chromosome. The region conserved between the XAGE-1s and XAGE-2 includes the GAGE domain. The function of GAGE (G antigen) and XAGE (X antigen) domains is unknown, but XAGE and GAGE proteins have been implicated in several human cancers (Zendman et al., 2002).

The two genes belonging to the C2orf27 family are contiguous in the genome, though C2orf27A is located on the forward strand of chromosome 2 whereas C2orf27B is located on the reverse strand. Their function is unknown, but they derive from a protein annotated as Ral guanine nucleotide dissociation stimulator-like 4. The parental protein contains the RasGEF domain, which is a guanine nucleotide exchange factor for Ras-like small GTPases. The duplicated region overlaps minimally with this domain (14 amino acids).

NPIP-like 1 belongs to the nuclear pore complex-interacting protein (NPIP) family. The parental protein contains two AMP-binding domains that are at the N-terminal region of the protein, not the area conserved in the orphan protein, which is the C-terminal part. The NPIP family (Nuclear Pore Interacting Protein), also named *morpheus*, is located on a duplicated segment of chromosome 16. It has been suggested to have experienced a burst of positive selection during the emergence of *Homininae* (Johnson et al., 2001).

Finally, AL133216.1 and AL023807.2 are two primate-specific genes of unknown function containing putative coding sequences of length 151 and 121 amino acids respectively. The parental copy of AL133216.1 modulates arsenic sensitivity, is involved in cell cycle progression, and in RNA-mediated gene silencing by microRNA (Gruber et al., 2009). It also contains an arsenite-resistance protein 2 domain (Pfam hit E-value = 3.1e-18). The orphan

Partial Gene Duplication and the Formation of Novel Genes 103

Fig. 3. Multiple alignments of the conserved regions between orphan, parental and outgroup proteins. For the XAGE-1 family only XAGE-1A is shown, as the other orphan sequences were identical at the amino acid level. The same is true for the C2orf27 family. Identical

residues are in green, similar residues in yellow. See Table 1 for more details.

copy does not contain this domain even though it is located in the conserved region, suggesting that this region has lost its ancestral function in the orphan protein.

Table 2 shows the estimated amino acid substitution rates in the orphan, parental and outgroup branches. In the case of identical copies (for example C2orf27A and C2orf27B) only one is taken as representative. In the case of divergent copies (the FAM9 family) the amino acid substitution rates are summed up for all branches from the ancestor to the derived node (see Figure 4). In all cases the duplicated protein is evolving much faster than the parental gene, and in some cases, such as the FAM9 and NPIP-like 1 proteins, more than six times faster. These results indicate that orphan proteins are evolving under much more relaxed constraints, and/or adapting to a new function with respect to their parental copies.


Table 2. Estimated number of amino acid substitutions per site (K) for orphan, parental and outgroup branches. Orphan protein identifiers are from Ensembl. See Table 1 for more details.

### **2.4 Role of low-complexity sequences**

Low complexity regions (LCRs) are sequences in which one or a few residues are highly overrepresented. Several studies have shown that duplicated gene copies can gain new functions through the acquisition of LCRs (Fondon & Garner, 2004; Salichs et al., 2009). It has also been shown that young proteins contain more LCRs than old proteins (Alba & Castresana, 2005). Therefore, we inspected the presence of LCRs in our set of orphan proteins using the SEG algorithm with default parameters (Wootton & Federhen, 1996).

We found that the FAM9A protein contained a very conspicuous low-complexity sequence. Figure 4 shows the detailed phylogenetic tree of the FAM9 gene family (including the parental and outgroup SYCP3 genes). The ancestral FAM9 evolved very rapidly and eventually underwent two duplication events, leading to FAM9A, FAM9B and FAM9C. The multiple alignment of the region surrounding the LCR in FAM9A shows how, from a small region containing several acidic residues in SYCP3, a larger acidic region was formed in the common FAM9 ancestor, which finally expanded to a 75 amino acid stretch in FAM9A containing a long glutamic acid repeat, as well as poly-alanine and poly-glycine repeats.

As is the case for the SYCP3 proteins, all three human FAM9 proteins show testis-specific expression. However, the cellular localization is different depending on the protein studied: FAM9B and FAM9C are localized in the nucleus with low protein levels being detectable in the cytoplasm, whereas FAM9A is present at high levels in the nucleolus (Martinez-Garay et al., 2002). The distinct location of FAM9A may be due to the long glutamic acid repeat, as

copy does not contain this domain even though it is located in the conserved region,

Table 2 shows the estimated amino acid substitution rates in the orphan, parental and outgroup branches. In the case of identical copies (for example C2orf27A and C2orf27B) only one is taken as representative. In the case of divergent copies (the FAM9 family) the amino acid substitution rates are summed up for all branches from the ancestor to the derived node (see Figure 4). In all cases the duplicated protein is evolving much faster than the parental gene, and in some cases, such as the FAM9 and NPIP-like 1 proteins, more than six times faster. These results indicate that orphan proteins are evolving under much more relaxed constraints, and/or adapting to a new function with respect to their parental copies.

Orphan Name Orphan protein Orphan Parental Outgroup Dermcidin ENSP00000293371 1.02595 0.43458 0.5055 FAM9A ENSP00000370391 1.28971 0.17014 0.15423 FAM9B ENSP00000318716 1.13565 0.17014 0.15423 FAM9C ENSP00000369999 1.15328 0.17014 0.15423 AL023807.2 ENSP00000381423 0.19840 0.12203 0.23096 XAGE-1A ENSP00000382698 0.52961 0.17820 0.94188 NPIP-like 1 ENSP00000350444 0.40089 0.02020 0.11929 C2orf27B ENSP00000304065 0.55865 0.21080 0.68342 AL133216.1 ENSP00000382606 1.37580 0.00010 0.00010 Table 2. Estimated number of amino acid substitutions per site (K) for orphan, parental and outgroup branches. Orphan protein identifiers are from Ensembl. See Table 1 for more

Low complexity regions (LCRs) are sequences in which one or a few residues are highly overrepresented. Several studies have shown that duplicated gene copies can gain new functions through the acquisition of LCRs (Fondon & Garner, 2004; Salichs et al., 2009). It has also been shown that young proteins contain more LCRs than old proteins (Alba & Castresana, 2005). Therefore, we inspected the presence of LCRs in our set of orphan proteins using the SEG algorithm with default parameters (Wootton & Federhen, 1996). We found that the FAM9A protein contained a very conspicuous low-complexity sequence. Figure 4 shows the detailed phylogenetic tree of the FAM9 gene family (including the parental and outgroup SYCP3 genes). The ancestral FAM9 evolved very rapidly and eventually underwent two duplication events, leading to FAM9A, FAM9B and FAM9C. The multiple alignment of the region surrounding the LCR in FAM9A shows how, from a small region containing several acidic residues in SYCP3, a larger acidic region was formed in the common FAM9 ancestor, which finally expanded to a 75 amino acid stretch in FAM9A containing a long glutamic acid repeat, as well as poly-alanine and

As is the case for the SYCP3 proteins, all three human FAM9 proteins show testis-specific expression. However, the cellular localization is different depending on the protein studied: FAM9B and FAM9C are localized in the nucleus with low protein levels being detectable in the cytoplasm, whereas FAM9A is present at high levels in the nucleolus (Martinez-Garay et al., 2002). The distinct location of FAM9A may be due to the long glutamic acid repeat, as

suggesting that this region has lost its ancestral function in the orphan protein.

details.

poly-glycine repeats.

**2.4 Role of low-complexity sequences** 

Fig. 3. Multiple alignments of the conserved regions between orphan, parental and outgroup proteins. For the XAGE-1 family only XAGE-1A is shown, as the other orphan sequences were identical at the amino acid level. The same is true for the C2orf27 family. Identical residues are in green, similar residues in yellow. See Table 1 for more details.

Partial Gene Duplication and the Formation of Novel Genes 105

Orphan genes are in general poorly annotated and their function is unknown in most cases (Kuo & Kissinger, 2008). The fact that organisms had lived perfectly well without them until recent times when they made their appearance, has led scientists to think that orphan genes were, for the most part, dispensable. However, a recent study by Chen and colleagues (Chen et al., 2010) has challenged this viewpoint. In their study, the authors identified new young genes in Drosophila melanogaster (around 34 million years old) and designed RNA interference lines to knoch each of them out (KO). Surprisingly, they found that 30% of these young genes KOs were lethal, as Drosophila could not survive without them. These young genes had mainly arisen through duplication and they showed higher evolutionary rates than the parental gene, indicating the action of positive selection, or relaxation of functional constraints. They hypothesized that new genes are quickly integrated into existing pathways, and hence many of them soon become essential for the viability of the organism. Capra and colleagues (Capra et al., 2010) compared the evolutionary patterns of genes that arose by duplication with those that did not (named novel genes). They argued that the evolutionary pressures should be different in each case as, contrary to novel genes, duplicated genes were functionally and structurally well formed from birth. They showed that although duplicated genes are initially more integrated into cellular networks, both types of new genes gain functions and interactions with time, though novel genes do it more rapidly than duplicated genes. Additionally, novel genes also increase in length through the incorporation of transposable elements or surrounding sequences. This increase in length could be related with the rapid gain of function and interactions experienced by novel genes. They also found that genes tended to interact with genes similar in age and mode of origin. Thus, the mechanism by which a gene originates seems to significantly impact on its

Several studies have demonstrated that duplicated genes show increased protein evolutionary rates with respect to non-duplicated genes in the same lineage (Castillo-Davis et al., 2004; Cusack & Wolfe, 2007; Kondrashov et al., 2002; Lynch & Conery, 2000; Nembaware et al., 2002; Scannell & Wolfe, 2008; Van de Peer et al., 2001). Here we identified a very strong asymmetry in the rates of evolution of the newly evolved copy (orphan) and the well-conserved copy (parental), the former evolving much faster than the latter. Surprisingly, the parental protein copy did not evolve consistently faster than the outgroup protein (not duplicated), highlighting the fact that we are dealing with a special type of gene duplication in which the copy containing the partially duplicated segment rapidly departs

Increased evolutionary rates may reflect either relaxation of purifying selection, positive selection, or the combined effects of both these forces. The orphan genes under study predated the split of the human and macaque lineages, which occurred approximately 25 million years ago so, if relaxed selection was the only factor for their increased rates, the genes should by now have become pseudogenes and not be expressed. However, all genes were expressed at the RNA level in one or several tissues. Therefore we must hypothesize that, at least to some extent, positive selection has influenced the evolution of these genes. We compared the rates of evolution of the protein regions that were conserved between orphan and parental proteins, but what about the unique sequences contained in the orphan proteins? These sequences lacked any similarity to other protein-coding genes, so they may be ancestral non-coding sequences that have been co-opted for a coding function (Long et al., 2003). Genes generated *de novo* from non-coding sequences are among the fastest evolving genes (Levine et al., 2006), and there is no reason to believe that unique sequences

from the ancestral family, which remains essentially unaffected.

subsequent evolution.

Fig. 4. Phylogenetic tree of the FAM9 gene family. Branch lengths correspond to the estimated number of amino acid substitutions per site, using the alignment in Fig 3. The protein alignment shown corresponds to exon 5 in FAM9B and FAM9C and to exon 6 in FAM9A, human SYCP3 and mouse SYCP3. The expanded low-complexity region in FAM9A is depicted above the alignment.

acidic clusters have been shown to mediate protein nucleolar retention (Ochs et al., 1996; Shu-Nu et al., 2000; Ueki et al., 1998).In FAM9A, the low complexity sequence is located within the Cor1/Xlr/Xmr conserved region, perhaps interfering with its function. In fact, FAM9A shows higher sequence divergence from the common ancestor than FAM9B.
