**6. Tools for measuring gene regulation**

While a gene's regulatory control is partly controlled by its complement of non-coding elements (as well as its genomic location, e.g., proximity to histones/heterochromatin), efforts to predict regulation from sequence alone have met with limited success, owing to non-linear interactions between various regulatory domains (Jarinova et al., 2008). A separate study found that transcription factor binding site turnover was insufficient to explain cis-regulatory evolution across orthologs (Venkataram & Fay, 2010).

Since accurate predictions of gene regulation based on genomic context and peripheral regulatory elements remain elusive, most studies of gene regulation depend on empirical measurements of gene products (i.e. mRNA or protein) as evidence for a gene's expression under given conditions. Tools for quantifying the abundance of specific mRNA and/or protein species, such as PCR and Western blots, are standard laboratory techniques.

nucleosome positions) has a strong effect on gene expression, which suggests that translocated duplicates may show expression divergence by virtue of chromosomal position alone. In addition, Ren et al. (2005) found that tandem duplicates that shared expression domains tended to have dissimilar sequence-based functions. Shoja et al. (2007) noted that tandem gene duplicates tended to show a relationship between expression divergence and

In their work on the possible action of gene conversion on the evolution of duplicated segments in *Drosophila*, Osada and Innan (2008) noted that duplications lying near the edges of duplicated segments showed more sequence divergence, suggesting that sublocation within a duplicated segment is an additional factor to consider in studies of duplicate

The broad functional category to which a gene belongs can also influence its freedom to explore divergent functions. In an analysis of genes in the rice genome produced through a specific WGD, Yim et al. (2009) found that duplicate genes with divergent functions showed a significant enrichment towards metabolism-related activity. Langille and Clark (2007) showed that "cell physiological process" genes were particularly amenable to duplication via transposition. Perhaps reflecting similar functional pressures, Li et al. (2010) found that subcellular localization also influenced the divergence of expression

The mode of retention may also depend on the amount of selective pressure acting on its coding sequence. Semon and Wolfe (2008) showed that duplicates undergoing slow rates of sequence evolution seemed particularly prone to regulatory subfunctionalization. This observation is echoed in Arnaiz et al. (2010), who find that duplicate pairs in *Tetraurelia* with divergent expression profiles were unlikely to undergo sequence subfunctionalization. Li et al. (2009) found that the mode of duplication had a substantial effect on the degree of expression divergence between duplicates, based on microarray expression profiles of rice

Nielsen et al. (2010) suggest that genes under strong selective pressure produce duplicates that are quickly nonfunctionalized, suggesting low tolerance for (poisonous) isoforms of essential products. Thus, a gene's essentiality and, by consequence, age, may both determine

While a gene's regulatory control is partly controlled by its complement of non-coding elements (as well as its genomic location, e.g., proximity to histones/heterochromatin), efforts to predict regulation from sequence alone have met with limited success, owing to non-linear interactions between various regulatory domains (Jarinova et al., 2008). A separate study found that transcription factor binding site turnover was insufficient to

Since accurate predictions of gene regulation based on genomic context and peripheral regulatory elements remain elusive, most studies of gene regulation depend on empirical measurements of gene products (i.e. mRNA or protein) as evidence for a gene's expression under given conditions. Tools for quantifying the abundance of specific mRNA and/or protein species, such as PCR and Western blots, are standard laboratory

explain cis-regulatory evolution across orthologs (Venkataram & Fay, 2010).

chromosomal distance.

between duplicate genes.

the extent to which gene duplicates may be retained.

**6. Tools for measuring gene regulation**

divergence.

tissues.

techniques.

Within the past decade, however, a number of high-throughput technologies have become available that allow the localization and abundance of gene products to measured empirically on a genomic/proteomic scale. At present, the most widely used platform is the microarray, an assay with a very large number of transcript-specific probes. Each probe is specific to a known transcript, allowing the potential for complete coverage of all known and predicted genes in a known genome sequence. Custom arrays can also be built from cDNA libraries when working with non-model organisms. Databases replete with microarray data are now publicly available for data mining, allowing a gene's expression (or lack thereof) to be profiled across tissues, timepoints, and stimuli. This aggregate gene behaviour is referred to as an "expression profile", and can serve as an empirical proxy of overall gene function. As more microarray data becomes available, the quality of this proxy will improve.

Expression measurement technologies measure gene activation directly, and are agnostic to the regulatory inputs/mechanisms that lead to transcription. In some cases, cis-regulatory regions can undergo substantial changes/shuffling without having much effect on the ultimate transcription behaviour of a gene -- transcription measurement technologies can help distinguish these cases from those that have actually changed a gene's expression phenotype (Comelli & Gonzalez, 2009).

In addition to general purpose (i.e. gene, exon) microarrays, several arrays have been designed to be maximally sensitive to differences between closely related genes. Microarrays use probes that measure targets by hybridizing to nucleotides directly via base complementation. Studies have previously demonstrated that the nucleotides at the center of the probe have the most influence on binding strength. In order to minimize the potential for cross-hybridization, some researchers have designed microarrays for comparing closely related genes (e.g. homeologs) by using probes that feature a known distinguishing SNP at the central position in a probe (Chaudhary et al., 2009; Flagel & Wendel., 2010; Flagel et al., 2008; Udall et al., 2006). This design should minimize cross-hybridization, though it should be noted that previous studies have found that cross-hybridization is only of concern when target sequences are >90-95% identical (Rajashekar et al., 2007). For duplicate genes that have highly similar sequences, alternative measurement technologies like deep sequencing can be used to obtain unbiased paralog-specific expression profiles.

Quantitative proteomics techniques such as iTRAQ (Burkhart et al., 2011) or 2D differential in-gel electrophoresis provide a similarly high-throughput platform for the quantitation of protein abundance. The data differs from microarray data in two respects – the identities of quantified proteins are often not known in advance, and the coverage of the proteome is not complete and is sensitive to experimental parameters. However, protein abundances may be a more accurate reflection of gene action, as proteins are the active products of genes in most cases and mRNA abundance doesn't always correlate with protein abundance.

Gibson and Goldberg (2009) conducted a study on yeast duplicates using a novel metric of functional differentiation -- number and type of protein interactions. The authors used both affinity-precipitation mass spectometry and yeast-2-hybrid assays to construct networks of protein interactions, and then sought to test whether the patterns of functional differentiation better fit models of subfunctionalization or neofunctionalization. Their work expands on previous studies that describe the functional evolution of the genome/proteome in terms of the growth of (novel) protein interactions. They illustrate how existing methods

Detection and Analysis of Functional Specialization in Duplicated Genes 49

create an "F1" allopolyploid), and post-hybridization (the present day allopolyploid). Thus, the functions of both parental genes can be compared to novel and mature hybrids,

The utility of this approach can be seen in Chaudhary et al. (2009), where the functional profiles of homeologous genes could be succinctly depicted as two-component pie charts. The dominance of one genome's homelog over another can be visualized as an unequal partitioning in the pie, and changes to this partitioning following the transition from

However, in many cases there are no suitable extant orthologs available to serve as models for ancestral gene function. In these cases, there are a number of algorithms for estimating ancestral gene function based directly on the functions of descendent (and other related) genes. Estimation methods can try to infer both gene regulation and gene

Microarray-based gene expression profiles have been used in several efforts to estimate ancestral gene function. In a study of stress response genes in Arabidopsis, estimates of ancestral gene function were constructed using BayesTraits (Pagel & Meade, 2006), with the present day response profiles used as primary data. For each extant stress response gene, responses to various stresses were coded based on expression level changes (up-regulation, down-regulation, no response). By adjusting the parameters of the Bayestraits program, the authors were able to select a model for gain/loss of response behaviour. This information, when combined with phylogenetic trees mapping out the sequence relationships for each gene family, allowed estimates of the stress response behaviour of ancestral genes (internal

Another microarray-based approach was explored in Doxey et al. (2007). The study examined the beta-(1,3)-glucanase gene family in *Arabidopsis*, using expression profiles constructed from microarray measurements on tissue and stress response patterns. The expression data for all genes in the family were grouped using hierarchical clustering, such that genes with similar (correlated) expression profiles were grouped together. Based on this clustering, genes were assigned labels according to their functional groups, and these labels were then used as primary data for the reconstruction of ancestral states on the gene family phylogenetic tree via parsimony. Using this approach, the expression profile of ancestral, pre-duplication sequences could be estimated from on the values

This approach of reconstructing gene functions as characters on a gene phylogenetic tree has a lot of potential, as it allows all members of a gene family to contribute information about the functional breadth explored in a gene family. The exact quantity reconstructed on the tree can vary from simple binary tissue presence/absence (Karanth et al., 2009) to the exact expression abundance as measured by a high-throughput assay (Guo et al., 2007; Li et al.,

There have also been efforts to reconstruct ancestral gene sequences, with the hope of reconstructing gene function. Working from the extant variety of fluorescent proteins, Field and Matz (2010) modeled the evolution of fluorescence color in the family by estimating and producing gene sequences at the internal nodes of the fluorescent protein family phylogenetic tree. By producing proteins based on the estimated ancestral sequences, the authors were able to estimate the fluorescence colors of evolutionary

revealing the immediate effects upon and eventual trajectory of functional evolution.

diploidy to allopolyploidy mark possible instances of functional specialization.

sequence/structure from present day data.

nodes on the tree) (Zou et al., 2009).

reconstructed on the tree.

2005; Oakley et al., 2005).

intermediates in the family.

overlook self-self interactions in the parents/progeny, and propose means of avoiding this bias. In general, they found that subfunctionalization was the prevalent driver of protein interaction network evolution.

Recently, sequencing and mass spectrometry have both achieved levels of throughput that make it possible to survey the transcriptome or proteome directly. While these technologies have considerable promise as a source for expression data, at present there are less data available from these platforms (but see Harhay et al., 2010). However, the essential idea of the expression profile holds constant, irrespective of the specific sort (and indeed, mixture) of data that is mined.
