**1. Introduction**

382 Gene Duplication

Zarranz JJ, Alegre J, Gomez-Esteban JC, Lezcano E, Ros R, Ampuero I, Vidal L, Hoenicka J,

two novel LRRK2 variants. Eur J Neurol 14:7-11.

and Lewy body dementia. Ann Neurol 55:164-173.

in Greek sporadic and autosomal dominant Parkinson's disease: identification of

Rodriguez O, Atares B, Llorens V, Gomez Tortosa E, del Ser T, Munoz DG, de Yebenes JG (2004) The new mutation, E46K, of alpha-synuclein causes Parkinson

> Members of multiple gene families in higher organisms allow for more refined cellular signaling networks and structural organization toward more stable physiological homeostasis. Gene duplication is one the most powerful ways of providing an opportunity to create a novel gene(s) because a novel function might be acquired without the loss of the original gene function (Ohno, 1970). Gene duplication can result from unequal crossing over by recombination, retroposition of cDNA, or whole-genome duplication. Furthermore, a replication-based mechanism of change in gene copy number has been proposed recently (Hastings et al., 2009). Gene duplication generated by retroposition is frequently accompanied by deleterious effects because the insertion of cDNA into the genome is nearly random or unlinks the original gene location resulting in an alteration of the original vital functions of the target genes. Thus retroelements such as transposable elements and endogenous retroviruses have been thought of as "selfish". On the other hand, gene duplication caused by unequal crossing over generally results in tandem alignment, which less frequently disrupts the functions of other genes. Recent genome-wide studies have demonstrated that retroelements can definitely contribute to the creation of individual novel genes and the modulation of gene expression, which allows for the dynamic diversity of biological systems, such as placental evolution (Rawn & Cross, 2008). It is now recognized that tandem duplication and retroposition are among the key factors that initiate the creation of novel gene family members (Brosius, 2005; Sorek, 2007; Kaessmann, 2010). By these mechanisms, species-specific gene duplication can lead to species-specific gene functions, which might contribute to species-specific phenotypes (Zhang, 2003). For example, many genes derived from retroelements are expressed in mammalian placentas, and species-specific gene duplication has occurred multiple times during placental evolution (Rawn & Cross, 2008). If a combination of tandem gene duplication and retroposition of cDNA occurs, there is a good possibility for the creation of a novel gene(s)

<sup>1</sup>Although the vertebrate *Bcnt* (Bucentaur) gene is officially called *Cfdp1* (craniofacial developmental protein 1), its biological function remains unclear. So far, solid evidence that the gene is involved in craniofacial development has not been provided except for its unique expression during mouse tooth development (Diekwisch et al., 1999). The authors are concerned that a "wrong" naming may have caused confusion concerning the function of the *Bcnt*/*Cfdp1* gene. Thus we use the names *Bcnt*/*Cfdp1*, *p97Bcnt/Cfdp2* and *p97Bcnt-2* in this article.

Bucentaur (Bcnt) Gene Family: Gene Duplication and Retrotransposon Insertion 385

Fig. 2. The structural relationships among the open reading frames of retrotransposable L1

The L1 and RTE elements have apurinic/apyrimidic (AP)-endonuclease domains (yellow boxes) and reverse transcriptase domains (green boxes). The L1 element has another restriction enzyme-like endonuclease domain in the C-terminal region (dark blue bar). The square to the left of RTE indicates the ambiguity of the 5' region. The assignment of the domains in L1 and RTE is according to Malik & Eickbush, 1998. The numbers above the rectangles of the three Bcnt-related proteins, Bcnt/Cfdp1, p97Bcnt/Cfdp2 and p97Bcnt2, indicate amino acid residue numbers. The latter two contain a region derived from the AP-endonuclease domain of RTE (termed the RTE domain) in the middle of their molecules. As described below, the three proteins have common acidic N-terminal regions (grey boxes) and intramolecular repeat (IR) units consisting of 40 amino acids each (orange boxes). The blue box at the C-terminus of

Retrotransposons spread through vertical transmission, but occasionally through horizontal transmission (Gentles et al., 2007). Bov-B LINEs are order-specific RTEs that are found specifically in ruminants, where they were initially reported as a bovine *Alu*-like dimerdriven family; they potentially encode both an AP-enodnuclease domain and a reverse transcriptase domain accompanied by a short interspersed repetitive element (SINE) cassette (Szemraj et al., 1995). It has been suggested that BovB-LINEs were transferred horizontally from squamata into an ancestral ruminant and expanded in all ruminants (Zupunski et al., 2001). *p97Bcnt*/*Cfdp 2* recruited the AP-endonuclease domain of Bov-B

The p97Bcnt/Cfdp2 protein was discovered in bovine brain during screening for hybridoma producing monoclonal antibodies (mAbs). In the course of a study on Ras GTPase-activating proteins (GAPs, RAS p21 protein activators, Rasa), we had attempted to generate mAbs to distinguish each GAP from among their family members (Kobayashi et al., 1993; Iwashita & Song, 2008). We used a glutathione-S-transferase (GST) fusion protein of rat Rasa2 (GAP1m) as an immunoantigen and screened for hybridomas by western blotting using bovine brain extract. We isolated five independent clones, all of which showed a single broad band with an apparent molecular mass of 97 kDa, exactly the expected size of rat Rasa2. At one time

ancestral Bcnt/Cfdp1 indicates a conserved 82-amino acids region (Bcnt-C)

LINE during the creation process in an ancient ruminant.

**2.2 Discovery of a novel protein, p97Bcnt/Cfdp2** 

and RTE elements and three Bcnt-related proteins

because a novel function could be acquired with the guarantee that the original gene functions will be retained. This type of evolutionary process has been described, e.g. in the *Jingwei* gene of *Drosophila*, where segmental duplication of a certain gene followed by retroposition of alcohol dehydrogenase (*Adh*) cDNA into one of the copied genes created a new Adh with an altered substrate specificity (Zhang et al., 2004). Furthermore, since the insertion of retrotransposons can speed up the natural mutation process tremendously (Makalowski, 2003), the combined process of tandem duplication followed by retrotransposon insertion has a greater potential to generate a novel gene(s) (Fig. 1). Indeed we have identified an example of this type, the p97Bcnt protein (Nobukuni et al., 1997). The *p97Bcnt/Cfdp2* gene was created in a common ancestor of ruminants by a partial duplication of the ancestral *Bcnt*/*Cfdp1* gene followed by insertion of an order-specific retrotransposon, Bov B-LINE (Iwashita et al., 2003, 2006). As a result, the paralog recruited an apurinic/apyrimidinic (AP)-endonuclease domain in the middle of the protein. In this article, we summarize the gene organization and protein structures of three Bcnt family members, and describe their biochemical characteristics. We also argue that the process of tandem duplication followed by retroelement insertion generates a high potential for creating novel genes for expanding signaling networks.

Fig. 1. Mechanism of novel gene creation by a combination of gene duplication and retrotransposition

If segmental duplication followed by retrotransposon insertion occurs, it provides a good opportunity to generate a paralogous gene, because a novel function could be acquired under the guarantee that the original gene function will remain. The schematic is a modification of the original (Makalowski, 2003)

### **2. Establishment of three** *Bcnt* **members**

### **2.1 A bovine specific retrotransposon, Bov B-LINE**

Autonomous non-long-terminal repeat (non-LTR) retrotransposons, also termed Long-Interspersed Nuclear Elements (LINEs), have been identified in almost all eukaryotic organisms. Based on their structures and type of endonuclease, non-LTR retrotransposons are classified into two subtypes. The major subtype encodes an endonuclease with homology to AP-endonuclease (APE), thus termed APE-type non-LTR retrotransposons. These APE-type elements are now divided into four groups and eleven clades (Zingler et al., 2005). The RTE clade is one of the most widespread and shortest APE-type non-LTR retrotransposons, which are truncated forms of L1 (human LINE 1) lacking both the 5′ and 3′ regions (Fig. 2).

because a novel function could be acquired with the guarantee that the original gene functions will be retained. This type of evolutionary process has been described, e.g. in the *Jingwei* gene of *Drosophila*, where segmental duplication of a certain gene followed by retroposition of alcohol dehydrogenase (*Adh*) cDNA into one of the copied genes created a new Adh with an altered substrate specificity (Zhang et al., 2004). Furthermore, since the insertion of retrotransposons can speed up the natural mutation process tremendously (Makalowski, 2003), the combined process of tandem duplication followed by retrotransposon insertion has a greater potential to generate a novel gene(s) (Fig. 1). Indeed we have identified an example of this type, the p97Bcnt protein (Nobukuni et al., 1997). The *p97Bcnt/Cfdp2* gene was created in a common ancestor of ruminants by a partial duplication of the ancestral *Bcnt*/*Cfdp1* gene followed by insertion of an order-specific retrotransposon, Bov B-LINE (Iwashita et al., 2003, 2006). As a result, the paralog recruited an apurinic/apyrimidinic (AP)-endonuclease domain in the middle of the protein. In this article, we summarize the gene organization and protein structures of three Bcnt family members, and describe their biochemical characteristics. We also argue that the process of tandem duplication followed by retroelement insertion generates a high potential for

Fig. 1. Mechanism of novel gene creation by a combination of gene duplication and

If segmental duplication followed by retrotransposon insertion occurs, it provides a good opportunity to generate a paralogous gene, because a novel function could be acquired under the guarantee that the original gene function will remain. The schematic is a

Autonomous non-long-terminal repeat (non-LTR) retrotransposons, also termed Long-Interspersed Nuclear Elements (LINEs), have been identified in almost all eukaryotic organisms. Based on their structures and type of endonuclease, non-LTR retrotransposons are classified into two subtypes. The major subtype encodes an endonuclease with homology to AP-endonuclease (APE), thus termed APE-type non-LTR retrotransposons. These APE-type elements are now divided into four groups and eleven clades (Zingler et al., 2005). The RTE clade is one of the most widespread and shortest APE-type non-LTR retrotransposons, which are truncated forms of L1 (human LINE 1) lacking both the 5′ and 3′

creating novel genes for expanding signaling networks.

modification of the original (Makalowski, 2003)

**2. Establishment of three** *Bcnt* **members** 

**2.1 A bovine specific retrotransposon, Bov B-LINE** 

retrotransposition

regions (Fig. 2).

Fig. 2. The structural relationships among the open reading frames of retrotransposable L1 and RTE elements and three Bcnt-related proteins

The L1 and RTE elements have apurinic/apyrimidic (AP)-endonuclease domains (yellow boxes) and reverse transcriptase domains (green boxes). The L1 element has another restriction enzyme-like endonuclease domain in the C-terminal region (dark blue bar). The square to the left of RTE indicates the ambiguity of the 5' region. The assignment of the domains in L1 and RTE is according to Malik & Eickbush, 1998. The numbers above the rectangles of the three Bcnt-related proteins, Bcnt/Cfdp1, p97Bcnt/Cfdp2 and p97Bcnt2, indicate amino acid residue numbers. The latter two contain a region derived from the AP-endonuclease domain of RTE (termed the RTE domain) in the middle of their molecules. As described below, the three proteins have common acidic N-terminal regions (grey boxes) and intramolecular repeat (IR) units consisting of 40 amino acids each (orange boxes). The blue box at the C-terminus of ancestral Bcnt/Cfdp1 indicates a conserved 82-amino acids region (Bcnt-C)

Retrotransposons spread through vertical transmission, but occasionally through horizontal transmission (Gentles et al., 2007). Bov-B LINEs are order-specific RTEs that are found specifically in ruminants, where they were initially reported as a bovine *Alu*-like dimerdriven family; they potentially encode both an AP-enodnuclease domain and a reverse transcriptase domain accompanied by a short interspersed repetitive element (SINE) cassette (Szemraj et al., 1995). It has been suggested that BovB-LINEs were transferred horizontally from squamata into an ancestral ruminant and expanded in all ruminants (Zupunski et al., 2001). *p97Bcnt*/*Cfdp 2* recruited the AP-endonuclease domain of Bov-B LINE during the creation process in an ancient ruminant.

### **2.2 Discovery of a novel protein, p97Bcnt/Cfdp2**

The p97Bcnt/Cfdp2 protein was discovered in bovine brain during screening for hybridoma producing monoclonal antibodies (mAbs). In the course of a study on Ras GTPase-activating proteins (GAPs, RAS p21 protein activators, Rasa), we had attempted to generate mAbs to distinguish each GAP from among their family members (Kobayashi et al., 1993; Iwashita & Song, 2008). We used a glutathione-S-transferase (GST) fusion protein of rat Rasa2 (GAP1m) as an immunoantigen and screened for hybridomas by western blotting using bovine brain extract. We isolated five independent clones, all of which showed a single broad band with an apparent molecular mass of 97 kDa, exactly the expected size of rat Rasa2. At one time

Bucentaur (Bcnt) Gene Family: Gene Duplication and Retrotransposon Insertion 387

Fig. 3. Epitopes of the monoclonal antibodies that enabled the identification of the

and amino acid residues identical to those in the expected epitope are indicated in red

The draft bovine genome sequence was published in 2009 (The Bovine Genome Sequencing and Analysis Consortium, 2009). The initial analysis estimated that the bovine genome contains about 22,000 genes, with a core set of 14,345 orthologs shared among seven mammalian species. It has been shown that 3.1% of the bovine genome consists of recently duplicated sequences (judged by sequences ≥ 1 kb in length and ≥ 90% identity), and more than three-quarters (75-90%) of segmental duplications are organized into local tandem duplication clusters (Liu et al., 2009). It is noteworthy that cattle-specific evolutionary breakpoint regions in the chromosomes have a higher density of tandem duplications and enrichment of repetitive elements. Furthermore, it has been pointed out that bovine tandem gene duplication is significantly related to species-specific biological functions such as

**3. Tandem alignment of three** *Bcnt* **gene family members** 

immunity, digestion, lactation, and reproduction (Liu et al., 2009).

A plasmid of a fusion protein of truncated rat Rasa2 (from Ile65 to Ser847) and glutathione Stransferase was constructed using a linker (by Dr. S. Hattori), expressed in *Escherichia coli,* and its protein was purified by glutathione-affinity column chromatography. mAbs against the fusion protein were isolated according to a conventional method. Epitope mapping was carried out using the full-length cDNA of the targeted molecule, hereafter *p97Bcnt*. Fragments of ~300 base pairs in size were expressed in a protein expression vector and screened with the obtained mAbs. Seven positive clones were isolated from among ~9 x 103 bacterial colonies, and the sequence common to all clones was determined as the possible epitope for anti-p97Bcnt antibodies (13 amino acids, RKQGRLSLDQEEE, represented by the red bar in the upper part) (Nobukuni et al., 1997). Amino acid sequences corresponding to the epitope region of rat Rasa2, bovine p97Bcnt/Cfdp2, human BCNT/CFDP1, bovine Bcnt/Cfdp1, and bovine p97Bcnt2 are aligned

p97Bcnt/Cfdp2 protein

we thought we had obtained appropriate antibodies, but the target protein was entirely different from Rasa as described below. Although we screened a bovine brain cDNA expression library by western blotting with the obtained mAbs, we could not clone the target molecule. Instead, a 97 kDa protein was isolated from bovine brain extract by affinity chromatography with the antibodies, and the amino acid sequences of its protease-digested fragments were determined. We used redundant primers designed based on the determined peptide sequences as DNA probes, and cloned the target molecule by both "rapid amplification of cDNA ends" (RACE) and screening of a bovine brain cDNA library (Nobukuni et. al., 1997). The obtained clone, which had an open reading frame of 592 amino acids, was named Bcnt after bucentaur, a Greek mythical creature that is half man and half ox, implying a strange protein from bovine brain. The identified protein, named p97Bcnt, Bcnt with a molecular mass of 97kDa, consists of an acidic N-terminal region, a retrotransposon-derived 325-amino acid region (termed the RTE domain), and two 40 amino acid intrarepeat (IR) units. The RTE domain is 72% identical to an order-specific retrotransposon, Bov-B LINE (GenBank accession number AF332697). The relationship between p97Bcnt and its estimated epitope in the mAbs, which enabled us to identify the protein, is summarized in Fig. 3. It provides a reasonable explanation as to why the unique protein was isolated by mAbs generated by a GST-fusion protein of rat Rasa2 as an immunoantigen. The estimated epitope of five independent mAbs maps on a single site in the N-terminal region of p97Bcnt/Cfdp2, and the antibodies recognize neither human BCNT/CFDP1 (Nobukuni et al., 1997) nor bovine Bcnt/Cfdp1 (Iwashita et al., 2003). The junction region of the fusion protein between GST and truncated Rasa2 codes a unique amino acid sequence generated by the extra nucleotides of the multiple cloning sites and a nucleotide linker for plasmid construction. Since Rasa is a highly conserved protein in mammals, the junction region might present strong antigenicity. Generally, it is hard to clone a target molecule by direct DNA screening when interspersed repetitive sequences are involved. Therefore, we first isolated a 97kDa protein that was recognized by the accidentally generated mAbs, determined its amino acid sequence, and then screened a cDNA library with the designed oligonucleotide probes. This led to the identification of a unique protein, p97Bcnt/Cfdp2.

### **2.3 Identification of three Bcnt-related proteins**

Immediately after the identification of *p97Bcnt/Cfdp2*, we isolated its human and mouse counterparts, and examined their differences from *p97Bcnt/Cfdp2* at both the cDNA and genome levels (Nobukuni et al., 1997; Takahashi et al., 1998). The counterparts, called (ancestral) *Bcnt/Cfdp1*, have homologous acidic N-terminal regions and one IR unit of 40 amino acids, but lack the RTE domain. Instead they contain a highly conserved 82-amino acid region at the C-terminus that is not present in p97Bcnt/Cfdp2 (Fig. 2) as will be described below. Subsequently, we found that ruminants have both ancestral *Bcnt/Cfdp1* and *p97Bcnt/Cfdp2*, while other animals have only *Bcnt/Cfdp1*. The pairwise sequence alignment of bovine and human genome DNA revealed that the region encompassing the gene was duplicated in two rounds in bovines (Iwashita et al., 2003). Although automated computational annotation predicted another homolog of *p97Bcnt* (LOC514131) in the bovine genome, its 5' UTR was different from the full-length cDNA that we isolated. Then we identified another paralog, termed *p97Bcnt-2,* in the adjacent region (Iwashita et al., 2009). The gene product, p97Bcnt2, is highly homologous to p97Bcnct/Cfdp2, comprising an acidic N-terminal region, a 324-amino acid RTE domain, and three IR units instead of the two in p97Bcnt/Cfdp2 in the C-terminal region (Fig. 2).

we thought we had obtained appropriate antibodies, but the target protein was entirely different from Rasa as described below. Although we screened a bovine brain cDNA expression library by western blotting with the obtained mAbs, we could not clone the target molecule. Instead, a 97 kDa protein was isolated from bovine brain extract by affinity chromatography with the antibodies, and the amino acid sequences of its protease-digested fragments were determined. We used redundant primers designed based on the determined peptide sequences as DNA probes, and cloned the target molecule by both "rapid amplification of cDNA ends" (RACE) and screening of a bovine brain cDNA library (Nobukuni et. al., 1997). The obtained clone, which had an open reading frame of 592 amino acids, was named Bcnt after bucentaur, a Greek mythical creature that is half man and half ox, implying a strange protein from bovine brain. The identified protein, named p97Bcnt, Bcnt with a molecular mass of 97kDa, consists of an acidic N-terminal region, a retrotransposon-derived 325-amino acid region (termed the RTE domain), and two 40 amino acid intrarepeat (IR) units. The RTE domain is 72% identical to an order-specific retrotransposon, Bov-B LINE (GenBank accession number AF332697). The relationship between p97Bcnt and its estimated epitope in the mAbs, which enabled us to identify the protein, is summarized in Fig. 3. It provides a reasonable explanation as to why the unique protein was isolated by mAbs generated by a GST-fusion protein of rat Rasa2 as an immunoantigen. The estimated epitope of five independent mAbs maps on a single site in the N-terminal region of p97Bcnt/Cfdp2, and the antibodies recognize neither human BCNT/CFDP1 (Nobukuni et al., 1997) nor bovine Bcnt/Cfdp1 (Iwashita et al., 2003). The junction region of the fusion protein between GST and truncated Rasa2 codes a unique amino acid sequence generated by the extra nucleotides of the multiple cloning sites and a nucleotide linker for plasmid construction. Since Rasa is a highly conserved protein in mammals, the junction region might present strong antigenicity. Generally, it is hard to clone a target molecule by direct DNA screening when interspersed repetitive sequences are involved. Therefore, we first isolated a 97kDa protein that was recognized by the accidentally generated mAbs, determined its amino acid sequence, and then screened a cDNA library with the designed oligonucleotide probes. This led to the identification of a

Immediately after the identification of *p97Bcnt/Cfdp2*, we isolated its human and mouse counterparts, and examined their differences from *p97Bcnt/Cfdp2* at both the cDNA and genome levels (Nobukuni et al., 1997; Takahashi et al., 1998). The counterparts, called (ancestral) *Bcnt/Cfdp1*, have homologous acidic N-terminal regions and one IR unit of 40 amino acids, but lack the RTE domain. Instead they contain a highly conserved 82-amino acid region at the C-terminus that is not present in p97Bcnt/Cfdp2 (Fig. 2) as will be described below. Subsequently, we found that ruminants have both ancestral *Bcnt/Cfdp1* and *p97Bcnt/Cfdp2*, while other animals have only *Bcnt/Cfdp1*. The pairwise sequence alignment of bovine and human genome DNA revealed that the region encompassing the gene was duplicated in two rounds in bovines (Iwashita et al., 2003). Although automated computational annotation predicted another homolog of *p97Bcnt* (LOC514131) in the bovine genome, its 5' UTR was different from the full-length cDNA that we isolated. Then we identified another paralog, termed *p97Bcnt-2,* in the adjacent region (Iwashita et al., 2009). The gene product, p97Bcnt2, is highly homologous to p97Bcnct/Cfdp2, comprising an acidic N-terminal region, a 324-amino acid RTE domain, and three IR units instead of the two in

unique protein, p97Bcnt/Cfdp2.

**2.3 Identification of three Bcnt-related proteins** 

p97Bcnt/Cfdp2 in the C-terminal region (Fig. 2).

Fig. 3. Epitopes of the monoclonal antibodies that enabled the identification of the p97Bcnt/Cfdp2 protein

A plasmid of a fusion protein of truncated rat Rasa2 (from Ile65 to Ser847) and glutathione Stransferase was constructed using a linker (by Dr. S. Hattori), expressed in *Escherichia coli,* and its protein was purified by glutathione-affinity column chromatography. mAbs against the fusion protein were isolated according to a conventional method. Epitope mapping was carried out using the full-length cDNA of the targeted molecule, hereafter *p97Bcnt*. Fragments of ~300 base pairs in size were expressed in a protein expression vector and screened with the obtained mAbs. Seven positive clones were isolated from among ~9 x 103 bacterial colonies, and the sequence common to all clones was determined as the possible epitope for anti-p97Bcnt antibodies (13 amino acids, RKQGRLSLDQEEE, represented by the red bar in the upper part) (Nobukuni et al., 1997). Amino acid sequences corresponding to the epitope region of rat Rasa2, bovine p97Bcnt/Cfdp2, human BCNT/CFDP1, bovine Bcnt/Cfdp1, and bovine p97Bcnt2 are aligned and amino acid residues identical to those in the expected epitope are indicated in red
