**7. Models of parental gene function**

Since available sequence data is generally restricted to present-day organisms, it is not directly possible to measure a gene's function pre- and post-duplication. As such, when presented with a pair of paralogs, it is not often clear which genes have retained the "ancestral function", if any. In this section, a number of proposed techniques for estimating pre-duplication function are described. These techniques can be broadly broken into two categories – techniques which seek to find an appropriate reference organism elsewhere in nature (i.e. a sister species with a somewhat divergent duplication history), and techniques which attempt to estimate/reconstruct ancestral function from those observed in extant species.

If gene information is available for two closely related species, it is often possible to find a number of examples where a gene duplication event has only occurred in one of the two species. In these cases, paralogs in one genome will correspond to a single ortholog in another. By comparing the functions of paralogs to an unduplicated ortholog, it may be possible to infer which of the two paralogs has undergone more dramatic functional changes. Unfortunately, this approach is restricted to genes present in a 2-to-1 fashion, and even in this cases caution must be taken to ensure the duplication event truly post-dates the speciation event.

One interesting variant of this strategy is to use distantly related members of the same gene family from within the same species (Panchin et al., 2010). Since most genes belong to families with several members, recent duplicates can take advantage of ancient and highly diverged gene family members to serve as a proxy for an orthologous outgroup. This process is useful for calculating rates of sequence evolution between recent duplicates.

Comparisons between lineages which have and have not undergone WGD events can also shed light on the evolution of function post-duplication. For example, Kassahn et al. (2009) used mouse orthologs as reference points for evaluating the post-WGD expression divergence in duplicate genes from five teleost fish species, suggesting that this approach is viable even when the organisms being compared are distant relatives.

Allopolyploids present a unique opportunity for studying gene evolution in the aftermath of widespread duplication. Allopolyploids are hybrids of distinct species, and in many cases the unhybridized lineages have persisted alongside their allopolyploid cousins to the present day. In these cases, the history of gene functional evolution can be inferred by examining gene expression behaviour at four stages: pre-hybridization (the two present day diploid parental strains), day zero hybridization (a cross of the two modern day parentals to

overlook self-self interactions in the parents/progeny, and propose means of avoiding this bias. In general, they found that subfunctionalization was the prevalent driver of protein

Recently, sequencing and mass spectrometry have both achieved levels of throughput that make it possible to survey the transcriptome or proteome directly. While these technologies have considerable promise as a source for expression data, at present there are less data available from these platforms (but see Harhay et al., 2010). However, the essential idea of the expression profile holds constant, irrespective of the specific sort (and indeed,

Since available sequence data is generally restricted to present-day organisms, it is not directly possible to measure a gene's function pre- and post-duplication. As such, when presented with a pair of paralogs, it is not often clear which genes have retained the "ancestral function", if any. In this section, a number of proposed techniques for estimating pre-duplication function are described. These techniques can be broadly broken into two categories – techniques which seek to find an appropriate reference organism elsewhere in nature (i.e. a sister species with a somewhat divergent duplication history), and techniques which attempt to estimate/reconstruct ancestral function from

If gene information is available for two closely related species, it is often possible to find a number of examples where a gene duplication event has only occurred in one of the two species. In these cases, paralogs in one genome will correspond to a single ortholog in another. By comparing the functions of paralogs to an unduplicated ortholog, it may be possible to infer which of the two paralogs has undergone more dramatic functional changes. Unfortunately, this approach is restricted to genes present in a 2-to-1 fashion, and even in this cases caution must be taken to ensure the duplication event truly post-dates the

One interesting variant of this strategy is to use distantly related members of the same gene family from within the same species (Panchin et al., 2010). Since most genes belong to families with several members, recent duplicates can take advantage of ancient and highly diverged gene family members to serve as a proxy for an orthologous outgroup. This process is useful for calculating rates of sequence evolution between recent

Comparisons between lineages which have and have not undergone WGD events can also shed light on the evolution of function post-duplication. For example, Kassahn et al. (2009) used mouse orthologs as reference points for evaluating the post-WGD expression divergence in duplicate genes from five teleost fish species, suggesting that this approach is

Allopolyploids present a unique opportunity for studying gene evolution in the aftermath of widespread duplication. Allopolyploids are hybrids of distinct species, and in many cases the unhybridized lineages have persisted alongside their allopolyploid cousins to the present day. In these cases, the history of gene functional evolution can be inferred by examining gene expression behaviour at four stages: pre-hybridization (the two present day diploid parental strains), day zero hybridization (a cross of the two modern day parentals to

viable even when the organisms being compared are distant relatives.

interaction network evolution.

mixture) of data that is mined.

those observed in extant species.

speciation event.

duplicates.

**7. Models of parental gene function** 

create an "F1" allopolyploid), and post-hybridization (the present day allopolyploid). Thus, the functions of both parental genes can be compared to novel and mature hybrids, revealing the immediate effects upon and eventual trajectory of functional evolution.

The utility of this approach can be seen in Chaudhary et al. (2009), where the functional profiles of homeologous genes could be succinctly depicted as two-component pie charts. The dominance of one genome's homelog over another can be visualized as an unequal partitioning in the pie, and changes to this partitioning following the transition from diploidy to allopolyploidy mark possible instances of functional specialization.

However, in many cases there are no suitable extant orthologs available to serve as models for ancestral gene function. In these cases, there are a number of algorithms for estimating ancestral gene function based directly on the functions of descendent (and other related) genes. Estimation methods can try to infer both gene regulation and gene sequence/structure from present day data.

Microarray-based gene expression profiles have been used in several efforts to estimate ancestral gene function. In a study of stress response genes in Arabidopsis, estimates of ancestral gene function were constructed using BayesTraits (Pagel & Meade, 2006), with the present day response profiles used as primary data. For each extant stress response gene, responses to various stresses were coded based on expression level changes (up-regulation, down-regulation, no response). By adjusting the parameters of the Bayestraits program, the authors were able to select a model for gain/loss of response behaviour. This information, when combined with phylogenetic trees mapping out the sequence relationships for each gene family, allowed estimates of the stress response behaviour of ancestral genes (internal nodes on the tree) (Zou et al., 2009).

Another microarray-based approach was explored in Doxey et al. (2007). The study examined the beta-(1,3)-glucanase gene family in *Arabidopsis*, using expression profiles constructed from microarray measurements on tissue and stress response patterns. The expression data for all genes in the family were grouped using hierarchical clustering, such that genes with similar (correlated) expression profiles were grouped together. Based on this clustering, genes were assigned labels according to their functional groups, and these labels were then used as primary data for the reconstruction of ancestral states on the gene family phylogenetic tree via parsimony. Using this approach, the expression profile of ancestral, pre-duplication sequences could be estimated from on the values reconstructed on the tree.

This approach of reconstructing gene functions as characters on a gene phylogenetic tree has a lot of potential, as it allows all members of a gene family to contribute information about the functional breadth explored in a gene family. The exact quantity reconstructed on the tree can vary from simple binary tissue presence/absence (Karanth et al., 2009) to the exact expression abundance as measured by a high-throughput assay (Guo et al., 2007; Li et al., 2005; Oakley et al., 2005).

There have also been efforts to reconstruct ancestral gene sequences, with the hope of reconstructing gene function. Working from the extant variety of fluorescent proteins, Field and Matz (2010) modeled the evolution of fluorescence color in the family by estimating and producing gene sequences at the internal nodes of the fluorescent protein family phylogenetic tree. By producing proteins based on the estimated ancestral sequences, the authors were able to estimate the fluorescence colors of evolutionary intermediates in the family.

Detection and Analysis of Functional Specialization in Duplicated Genes 51

an apparent neofunctionalization event followed by subfunctionalization in a subsequent

As discussed earlier, one effective technique for estimating the function of the ancestor of a pair of duplicate genes is to refer to a related species where the locus is unduplicated. In this case, the assumption is that the orthologous gene is behaving in the related genome as the parental gene was behaving prior to the duplication event. This point of reference makes it possible to distinguish between models of duplicate retention, lending to support towards

In a study of zebrafish-specific WGD-produced duplicates, Kassahn et al. (2009) use unduplicated mouse orthologs as a reference, despite the considerable distance separating these two organisms. Multiple gene properties were compared between paralogs and their mouse ortholog, including sequence, structure, and expression information. The authors found support for neofunctionalization in a number of duplicates, and that regulatory

In a study of human genes, Panchin et al. (2010) chose to use distantly related gene family members as proxies for ancestors of recent paralogs. They demonstrated that, in many cases, the recent duplicates are evolving asymmetrically, with one duplicate accumulating

Semon and Wolfe (2008) conducted a study comparing the fate of WGD duplicates in *X.laevis*, an allopolyploid, to *X. tropicalis*, a related species that did not undergo any WGD. Expression patterns were compared across 11 tissue types, and related losses of tissue breadth to possible subfunctionalization. In addition to this, the authors also compared the fate of duplicated genes produced through two different large-scale duplication mechanisms by comparing *X.laevis* to zebrafish, a species with a well studied WGD that did not stem from allopolyploidy. They find that duplicates retained in the *X.laevis* duplication were also frequently retained in duplicate in zebrafish, suggesting common influences on

Another example of a well-studied allopolyploid, cotton, has been discussed in previous sections (Chaudhary et al., 2009; Flagel et al., 2008; Flagel & Wendel, 2010). One unique observation made possible in this system is the phenomenon of transgressive segregation, where the expression profiles of homeologous genes eventually evolve to resemble neither of the parental strains, suggesting a unique adaptation to the presence of two essentially

While not as easily assayed as gene expression, the transcribed content of genes (i.e. proteins) can also suggest the gain and loss of functions. As a simple example, the rate of protein sequence evolution can be compared between duplicates by comparing their respective rates of synonymous and non-synonymous mutation. While not necessarily illustrative of the nature of the difference, this method can provide evidence for asymmetrical selection, suggesting one duplicate is acquiring amino acid altering mutations faster than the other (Ganko et al., 2007). Working from a list of 15 of the most asymmetrically diverged WGD-derived protein sequences in *S. cerevisiae*, Turunen et al. (2009) noted substantial indels in addition to changes in important catalytic residues and

duplication.

**8.3 Comparing with a non-duplicated ortholog** 

subfunctionalization versus neofunctionalization, for example.

changes were far more common than changes to gene products.

complete genomes within a single cell (Flagel & Wendel, 2010).

sequence mutations much faster than its sibling.

the duplicability of these gene varieties.

**8.4 Comparing gene product properties** 
