**1. Introduction**

In the famous book "The Origin of Species" by Darwin (1859), the gradual accumulation of selectively advantageous variants has been proposed qualitatively by obtaining a hint from the artificial selection of domestic animals and plants as well as from the observation of unique species in a geographically isolated region. The core of this proposal has become evident, after the re-discovery of Mendelian heredity, by the detection of hereditary variants, i. e., mutants, and extensive investigations have been carried out for the behavior of mutants especially in the *Drosophila* population (Dobzhansky, 1941; Mayer, 1942; Huxley, 1943; Simpson, 1944). In parallel, Darwinian evolution is mathematically formulated in population genetics to estimate the probability that a spontaneously generated mutant is fixed in, or eliminated from, the population according to the positive or negative value of a selective parameter (Fisher, 1930; Wright, 1949). Although the accumulation of such mutants as those found in the *Drosophila* was supposed to explain the whole process of evolution, the mutants detected at that time were mainly due to the point mutations in established genes, and most of them were defective. Thus, doubts remain about whether the gradual accumulation of such mutants gives rise to radically new organs such as wings and eyes. Another criticism against the survival of the fittest in Darwinian evolution is also raised by the ecological fact of diversity that different styles of organisms coexist in the same area (Nowak et al., 1994).

The gene and genome sequencing, which started in the latter half of the last century, has brought new information about the evolution of organisms. First, the amino acid sequence similarities of paralogous proteins strongly suggest that the repertoire of protein functions has been expanded by gene duplication, succeeding nucleotide base substitutions, partial insertion and deletion, and further by domain shuffling in some cases (Ingram, 1963; Gilbert, 1978; Ferris & White, 1979). Such examples are now increasing, proposing many protein families and superfamilies. Second, the clustering analysis of proteomes reveals a characteristic feature that the proteins functioning in the core part are essentially common to both prokaryotes and eukaryotes, and that the decisive difference in gene repertoire between the organisms is observed in the peripheral parts displaying different living styles (Kojima & Otsuka, 2000 a, b, c; Kojima & Otsuka, 2002). These sequence data are now compiled into databases (e. g., Wheeler et al., 2004; Birney et al., 2006).

A Theoretical Scheme

inequality must hold:

of the Large-Scale Evolution by Generating New Genes from Gene Duplication 5

genome size *N* and systematization - *SN* but also on the material and energy source *M* available from the environment. Thus, the energy acquired by the organism during its lifetime is expressed as *Ea(M; N, SN)*, which may be an increasing function of *N* and *SN* as well as of *M*. On the other hand, the organism utilizes the acquired energy and materials to construct the biomolecules for its growth and self-reproduction. The energy *Es(N, SN)* stored in the form of biomolecules is also another increasing function of *N* and *SN*. The difference between the acquired energy and the stored energy, *Ea(M; N, SN) – Es(N, SN)*, is lost as heat. According to the second law of thermodynamics, the entropy production by the heat must compensate for the entropy reduction, i. e., *- SN*, by the systematization. Thus, the following

where *T* is the temperature. In other words, this indicates the upper boundary of systematization (negentropy) by entropy production (Otsuka & Nozawa, 1998). However, organisms must have developed the systematization to increase the acquired energy through the evolutionary process of gene and genome duplication, nucleotide base substitutions and selection, and this is the main problem in the present study. The larger value of *Ea(M; N, SN) – Es(N, SN) – TSN* gives a measure for the biological processes to proceed more smoothly. In this sense, the quantity of *Ea(M; N, SN) – Es(N, SN) – TSN*, which an organism produces during one generation, will be called the 'biological activity' of the organism. The 'biological activity' has thermodynamic connotation as a departure from equilibrium, but this is in a reverse relation to the free energy in thermodynamics, which decreases upon any change in a given system by the decrease in internal energy and/or by the increase in entropy. In an organism, the acquired energy is stored in ATP and NADH molecules as chemical energy, and it is gradually consumed in the syntheses of biomolecules under the guidance of the enzymes, without drastically raising the temperature. In such moderate reactions, the temperature is almost constant, and the quantity obtained from the 'biological activity' divided by the product of the Boltzmann constant *k* and temperature *T* is considered to be approximately proportional to the self-

reproducing rate of an organism, which will be denoted by *R(M; N, SN)* hereafter.

terms of 'biological activity'. First, the enlarged genome size *N +* 

*N, SN)* to *Ea(M', N+*

Δ

makes the stored energy *Es(N+*

Δ

Δ

*N+*Δ

from *SN* to *SN+*

energy from *Ea(M, N+*

This concept of 'biological activity' or self-reproducing rate is useful to formulate the largescale of evolution arising from the gene duplication and succeeding generation of new genes. The essence of the present theory considers the following process of evolution in

*N, SN)* remains almost equal to *Ea(M; N, SN)*. Thus, the 'biological activity' of a variant bearing duplicated genes becomes lower than that of the original style organism. Moreover, the biological activity of the variant further decreases by the increase in systematization

incorporated into an extended system of regulation and control. However, such a variant with the lower activity is not necessarily extinct but has a chance to recover as a new style of organisms, if the new gene begins expressing a new biological function to raise the acquired

> Δ*N, SN+*Δ

energy source *M'* other than *M*, or by moving to a new living area or by utilizing *M* more efficiently in the case of *M' = M*. This process of the large-scale evolution will be mathematically formulated to estimate the probabilities of generating new genes, for the

*<sup>N</sup>*, as a new gene generated from the counterpart of duplicated genes is

( ;, ) (, ) 0 *E M N S E N S TS a Ns N N* − − > (1)

Δ

*N, SN)* larger than *Es(N, SN)*, while the acquired energy *Ea(M;* 

*N* due to gene duplication

*N)* by utilizing the new material and

Although the importance of gene duplication in evolution was already indicated in the last century (Ohno, 1970), this indication still remained describing the circumstantial evidence of gene duplication and the fossil record of vertebrate organs in a qualitative way. Theoretically, some new concept is needed to formulate the evolution by gene duplication, going beyond the narrow view of population genetics which only focuses on a mutated gene. For this purpose, the author has recently proposed the new concept of 'biological activity', which is determined by a whole genome, and explained the divergence of the original style of organisms and the new style of organisms having a new gene generated from the counterpart of duplicated genes (Otsuka, 2005; 2008). This evolution by gene duplication will be called the large-scale evolution, being distinguished from Darwinian evolution.

In this chapter, the explanatory remarks are first given for the concept of 'biological activity' and the large-scale of evolution will be then investigated in detail on the three types of organisms, which are different in their genome constitution and transmission. The genome is a single DNA molecule in most prokaryotes and it is a set of chromosomes in lower eukaryotes. These organisms will be tentatively called the monoploid organisms as the first type of organisms. Some lower eukaryotes exchange homologous chromosomes through the process of conjugation. These lower eukaryotes are treated as the second type of organisms. In higher animals and plants, each of the cells constituting the adult form carries the genome consisting of the plural number of homologous chromosome pairs, and the monoploid state only appears in the gametes (egg and sperm). These higher eukaryotes will be treated as the third type, being called the diploid organisms in the sense that the present study focuses on the evolution of the characters expressed in their diploid state. The main purpose of the present study is to elucidate the difference between the three types of organisms, especially in the probabilities that two or more kinds of new genes are generated from different origins of gene duplication. This study reveals that the second type organism is most suitable to generate many kinds of new genes and the third type organism is next in line. The cell differentiation is a representative character, which requires many kinds of genes for its expression, and the present result provides an explanation for the fact that the cell differentiation has started in the second type of organisms and then evolved to the higher hierarchy in the third type of organisms.

### **2. The concept of biological activity**

Although the 'biological activity' is a macroscopic quantity generally characterizing various biological systems such as an ecological system, an organism, an individual cell of a multicelluar organism etc. (Otsuka, 2004, 2005, 2008), it will be explained focusing on an organism for the present purpose of considering the large-scale evolution of organisms by gene duplication. In general, an organism may be characterized by a set of two macrovariables, the genome size *N* and its systematization - *SN* of genes and their products. The systematization corresponds to the negentropy, which should be measured for the specific arrangement of nucleotides in individual genes, the degree of accuracy in transmitting the genetic information to the amino acid sequences of proteins, the formation of metabolic pathways by enzyme protein functions, the regulation and control at various levels of biological processes, the cell structure constructed by the interaction of metabolic products, and for furthering the communication between differentiated cells in the case of multicellular organisms. The energy acquired by an organism depends not only on the

Although the importance of gene duplication in evolution was already indicated in the last century (Ohno, 1970), this indication still remained describing the circumstantial evidence of gene duplication and the fossil record of vertebrate organs in a qualitative way. Theoretically, some new concept is needed to formulate the evolution by gene duplication, going beyond the narrow view of population genetics which only focuses on a mutated gene. For this purpose, the author has recently proposed the new concept of 'biological activity', which is determined by a whole genome, and explained the divergence of the original style of organisms and the new style of organisms having a new gene generated from the counterpart of duplicated genes (Otsuka, 2005; 2008). This evolution by gene duplication will be called the large-scale evolution, being distinguished

In this chapter, the explanatory remarks are first given for the concept of 'biological activity' and the large-scale of evolution will be then investigated in detail on the three types of organisms, which are different in their genome constitution and transmission. The genome is a single DNA molecule in most prokaryotes and it is a set of chromosomes in lower eukaryotes. These organisms will be tentatively called the monoploid organisms as the first type of organisms. Some lower eukaryotes exchange homologous chromosomes through the process of conjugation. These lower eukaryotes are treated as the second type of organisms. In higher animals and plants, each of the cells constituting the adult form carries the genome consisting of the plural number of homologous chromosome pairs, and the monoploid state only appears in the gametes (egg and sperm). These higher eukaryotes will be treated as the third type, being called the diploid organisms in the sense that the present study focuses on the evolution of the characters expressed in their diploid state. The main purpose of the present study is to elucidate the difference between the three types of organisms, especially in the probabilities that two or more kinds of new genes are generated from different origins of gene duplication. This study reveals that the second type organism is most suitable to generate many kinds of new genes and the third type organism is next in line. The cell differentiation is a representative character, which requires many kinds of genes for its expression, and the present result provides an explanation for the fact that the cell differentiation has started in the second type of organisms and then evolved to the

Although the 'biological activity' is a macroscopic quantity generally characterizing various biological systems such as an ecological system, an organism, an individual cell of a multicelluar organism etc. (Otsuka, 2004, 2005, 2008), it will be explained focusing on an organism for the present purpose of considering the large-scale evolution of organisms by gene duplication. In general, an organism may be characterized by a set of two macrovariables, the genome size *N* and its systematization - *SN* of genes and their products. The systematization corresponds to the negentropy, which should be measured for the specific arrangement of nucleotides in individual genes, the degree of accuracy in transmitting the genetic information to the amino acid sequences of proteins, the formation of metabolic pathways by enzyme protein functions, the regulation and control at various levels of biological processes, the cell structure constructed by the interaction of metabolic products, and for furthering the communication between differentiated cells in the case of multicellular organisms. The energy acquired by an organism depends not only on the

from Darwinian evolution.

higher hierarchy in the third type of organisms.

**2. The concept of biological activity**

genome size *N* and systematization - *SN* but also on the material and energy source *M* available from the environment. Thus, the energy acquired by the organism during its lifetime is expressed as *Ea(M; N, SN)*, which may be an increasing function of *N* and *SN* as well as of *M*. On the other hand, the organism utilizes the acquired energy and materials to construct the biomolecules for its growth and self-reproduction. The energy *Es(N, SN)* stored in the form of biomolecules is also another increasing function of *N* and *SN*. The difference between the acquired energy and the stored energy, *Ea(M; N, SN) – Es(N, SN)*, is lost as heat. According to the second law of thermodynamics, the entropy production by the heat must compensate for the entropy reduction, i. e., *- SN*, by the systematization. Thus, the following inequality must hold:

$$E\_a \text{(M;N,S\_N)} - E\_s \text{(N,S\_N)} - TS\_N > 0 \tag{1}$$

where *T* is the temperature. In other words, this indicates the upper boundary of systematization (negentropy) by entropy production (Otsuka & Nozawa, 1998). However, organisms must have developed the systematization to increase the acquired energy through the evolutionary process of gene and genome duplication, nucleotide base substitutions and selection, and this is the main problem in the present study. The larger value of *Ea(M; N, SN) – Es(N, SN) – TSN* gives a measure for the biological processes to proceed more smoothly. In this sense, the quantity of *Ea(M; N, SN) – Es(N, SN) – TSN*, which an organism produces during one generation, will be called the 'biological activity' of the organism. The 'biological activity' has thermodynamic connotation as a departure from equilibrium, but this is in a reverse relation to the free energy in thermodynamics, which decreases upon any change in a given system by the decrease in internal energy and/or by the increase in entropy. In an organism, the acquired energy is stored in ATP and NADH molecules as chemical energy, and it is gradually consumed in the syntheses of biomolecules under the guidance of the enzymes, without drastically raising the temperature. In such moderate reactions, the temperature is almost constant, and the quantity obtained from the 'biological activity' divided by the product of the Boltzmann constant *k* and temperature *T* is considered to be approximately proportional to the selfreproducing rate of an organism, which will be denoted by *R(M; N, SN)* hereafter.

This concept of 'biological activity' or self-reproducing rate is useful to formulate the largescale of evolution arising from the gene duplication and succeeding generation of new genes. The essence of the present theory considers the following process of evolution in terms of 'biological activity'. First, the enlarged genome size *N +* Δ*N* due to gene duplication makes the stored energy *Es(N+*Δ*N, SN)* larger than *Es(N, SN)*, while the acquired energy *Ea(M; N+*Δ*N, SN)* remains almost equal to *Ea(M; N, SN)*. Thus, the 'biological activity' of a variant bearing duplicated genes becomes lower than that of the original style organism. Moreover, the biological activity of the variant further decreases by the increase in systematization from *SN* to *SN+*Δ*<sup>N</sup>*, as a new gene generated from the counterpart of duplicated genes is incorporated into an extended system of regulation and control. However, such a variant with the lower activity is not necessarily extinct but has a chance to recover as a new style of organisms, if the new gene begins expressing a new biological function to raise the acquired energy from *Ea(M, N+*Δ*N, SN)* to *Ea(M', N+*Δ*N, SN+*Δ*N)* by utilizing the new material and energy source *M'* other than *M*, or by moving to a new living area or by utilizing *M* more efficiently in the case of *M' = M*. This process of the large-scale evolution will be mathematically formulated to estimate the probabilities of generating new genes, for the

A Theoretical Scheme

time *t*:

0

organisms *xopt* in the following form.

and *xI* correspond to *(N+*

∫

τ

of the Large-Scale Evolution by Generating New Genes from Gene Duplication 7

Darwinian evolution corresponds to the evaluation of the time-change of variant fractions mainly by the first term on the right side of Eq. (5), as discussed by Eigen (1971). If the increase rate *W(M;xi)* of an occasionally generated mutant *xi* is greater than the average increase rate, that is, *W(M;xi) - Wav(M;t) > 0*, the fraction *f(xi;t)* increases with time according to the first term on the right side of Eq. (5). The increase in the fraction of such variants *xi* gradually raises the average increase rate *Wav(M;t)*, resulting in the increase in the total number *B(t)* of organisms according to Eq. (4), although this increase is ultimately stopped by the decrease in available material *M*. On the other hand, the fraction *f(xi;t)* decreases when *W(M;xi) – Wav(M;t) < 0.* Thus, the organisms taking a common material and energy source *M* are elaborated by mutation and selection, and most of them finally reach the ones with the optimum increase rate, each characterized by *xopt*. However, such Darwinian

The large-scale evolutionary process of generating new gene(s) from gene duplication is obtained by evaluating the fraction of variants up to the first and higher orders of the mutation term. For this illustration, Eq. (5) will be formally integrated with respect to

( ; ) exp[ { ( ; ) ( ; )} ][ ( ) ( ; ) ( ; )

τ

*j*

ττ

 τ

(8)

*i i av xi xj j j*

After the organisms *xopt* have become dominant in the population, *Wav(M;t)* is approximately equal to *W(M; xopt)*, the fractions of variants except for *xopt* are neglected on the right side of Eq. (8), and the mutation term *qxi,xopt(t)* is replaced by the mutation rate *qxi,xopt* defined as an

> <sup>1</sup> ( ) *<sup>t</sup> xi xopt xi xopt q qd <sup>t</sup>* <sup>≡</sup>

τ τ

*q RMx f x f x WMx WMx* <sup>=</sup> <sup>−</sup> (10)

∫ (9)

, 0 0

, , <sup>0</sup>

Then, the fraction *f(xi)* of variants *xi* is finally related with the fraction *f(xopt)* of dominant

, (; ) ( ) ( ) ( ; ) ( ;) *xi xopt opt i opt opt i*

Among such satellite variants, the variant arising from the gene duplication is especially notable in the sense that it has the potential to generate a new gene from the counterpart of duplicated genes. If the probability of generating a new gene *I* from the duplicated part in *xi* is denoted by *qxI,xi*, a new style of the organism carrying the new gene *I* is generated from

, ,

*xI xi xi xo o*

*WMx WMx*

*o i*

←← = <sup>−</sup> (11)

*N)*, respectively, in terms of the set of

(;) ( ) ( ;) ( ;)

where *xopt* is rewritten into *xo* with the meaning of the original style of an organism. Here, *xi*

Δ*N,SN+*Δ

*q q RMx Px x x*

the original style of an organism with the following probability *Pm1(xI* ← *xi* ← *xo)*.

*f x t W Mx W M d q RMx f x*

*t t*

∫ ∫ ∑

evolution may only hold for the point mutations in existing genes.

[ { ( ; ) ( ; ')} '] ( ;0)]

average of mutation terms during a sufficiently long time *t*, i. e.,

1

Δ

variables characterizing an organism in section 2.

*mI i o*

*N, SN)* and *(N+*

= −

−− +

*WMx W M d d fx*

*i av i*

τ ττ

first type of organisms in section 3, for the second type in section 4 and for the third type in section 5.

### **3. Prokaryotes and lower eukaryotes in the monoploid state**

For the mathematical description, the set of variables *(Ni, SNi)* characterizing a variant *i* will be simply denoted as a single variable *xi*, unless the description of changes in its content is necessary. In the population of monoploid organisms taking a common material and energy source *M*, the number *n(xi;t)* of variants, each characterized by the monoploid genome *xi*, obeys the following time-change equation.

$$\frac{d}{dt}n(\mathbf{x}\_i;t) = \{Q\_{xi}(t)\mathcal{R}(M;\mathbf{x}\_i) - D(\mathbf{x}\_i)\}n(\mathbf{x}\_i;t) + \sum\_{j(\neq i)} q\_{xi,\mathbf{x}\_j^\perp}(t)\mathcal{R}(M;\mathbf{x}\_j)n(\mathbf{x}\_j;t) \tag{2}$$

where the self-reproducing rate and death rate of the variant *xi* are denoted by *R(M;xi)* and *D(xi)*, respectively. The apparent decrease factor *Qxi(t)* in the self-reproducing rate of the variant *xi* is related with the mutation term *qxj,xi(t)* from the variant *xi* to other kinds of variants *xj* 's in the following way.

$$Q\_{xi}(t) = 1 - \sum\_{j(\neq i)} q\_{xj, \text{xi}}(t) \tag{3}$$

If the quantity *qxi,xi(t)* defined by *Qxi (t) -1* is introduced, the restriction *j* ≠ *i* can be removed from the summation of the second term on the right side of Eq. (2). For investigating the population behavior, Eq. (2) is transformed into the following two types of equations; one concerning the total number of all kinds of variants defined by *B(t) =*Σ*i n(xi;t)* and another concerning the fraction *f(xi;t)* of variants *xi* defined by *n(xi;t)/B(t)*.

$$\frac{d}{dt}B(t) = \mathcal{W}\_{av}(M; t)B(t) \tag{4}$$

$$\frac{d}{dt}f(\mathbf{x}\_i;t) = \{\mathcal{W}(M;\mathbf{x}\_i) - \mathcal{W}\_{av}(M;t)\}f(\mathbf{x}\_i;t) + \sum\_j q\_{\mathbf{x}i,\mathbf{x}j}(t)R(M;\mathbf{x}\_j)f(\mathbf{x}\_j;t) \tag{5}$$

where the increase rate *W(M; xi)* of variant *xi* and the average increase rate *Wav(M;t)* of organisms in the population are defined by the following forms, respectively.

$$\mathcal{W}(\mathcal{M}; \mathbf{x}\_i) \equiv \mathcal{R}(\mathcal{M}; \mathbf{x}\_i) - D(\mathbf{x}\_i) \tag{6}$$

$$\mathcal{W}\_{av}(M;t) \equiv \sum\_{i} \mathcal{W}(M; \mathbf{x}\_i) f(\mathbf{x}\_i; t) \tag{7}$$

Strictly, the nucleotide base change occurs due to the miss in repairing damaged bases, while the gene duplication occurs by the illegitimate crossing over of DNA strands upon replication. Although they are simply represented by the mutation term *qxi,xj(t)* in the above mathematical formulation, the point mutation due to nucleotide base change and the gene duplication are distinguished from each other in the following mathematical treatment.

first type of organisms in section 3, for the second type in section 4 and for the third type in

For the mathematical description, the set of variables *(Ni, SNi)* characterizing a variant *i* will be simply denoted as a single variable *xi*, unless the description of changes in its content is necessary. In the population of monoploid organisms taking a common material and energy source *M*, the number *n(xi;t)* of variants, each characterized by the monoploid genome *xi*,

( ; ) { ( ) ( ; ) ( )} ( ; ) ( ) ( ; ) ( ; ) *i xi i ii xi xj j j*

*<sup>d</sup> nx t Q tRMx Dx nx t q tRMx nx t dt* <sup>≠</sup>

where the self-reproducing rate and death rate of the variant *xi* are denoted by *R(M;xi)* and *D(xi)*, respectively. The apparent decrease factor *Qxi(t)* in the self-reproducing rate of the variant *xi* is related with the mutation term *qxj,xi(t)* from the variant *xi* to other kinds of

If the quantity *qxi,xi(t)* defined by *Qxi (t) -1* is introduced, the restriction *j* ≠ *i* can be removed from the summation of the second term on the right side of Eq. (2). For investigating the population behavior, Eq. (2) is transformed into the following two types of equations; one

> () ( ;) () *av <sup>d</sup> Bt W MtBt*

, ( ; ) { ( ; ) ( ; )} ( ; ) ( ) ( ; ) ( ; ) *i i av ixi xj j <sup>j</sup>*

*<sup>d</sup> <sup>f</sup> x t W Mx W Mt f x t q tRMx f x t dt*

where the increase rate *W(M; xi)* of variant *xi* and the average increase rate *Wav(M;t)* of

( ;) ( ; ) ( ;) *av i i i*

Strictly, the nucleotide base change occurs due to the miss in repairing damaged bases, while the gene duplication occurs by the illegitimate crossing over of DNA strands upon replication. Although they are simply represented by the mutation term *qxi,xj(t)* in the above mathematical formulation, the point mutation due to nucleotide base change and the gene duplication are distinguished from each other in the following mathematical treatment.

organisms in the population are defined by the following forms, respectively.

*j*

=− +∑ (5)

, ( )

= − ∑ (3)

*dt* <sup>=</sup> (4)

( ;) ( ;) () *W Mx RMx Dx i ii* ≡ − (6)

*W Mt WMx* <sup>≡</sup> ∑ *<sup>f</sup> x t* (7)

Σ

*i n(xi;t)* and another

*j i*

, ( ) () 1 () *xi xj xi j i Qt q t* ≠

= −+ ∑ (2)

**3. Prokaryotes and lower eukaryotes in the monoploid state**

concerning the total number of all kinds of variants defined by *B(t) =*

concerning the fraction *f(xi;t)* of variants *xi* defined by *n(xi;t)/B(t)*.

obeys the following time-change equation.

variants *xj* 's in the following way.

section 5.

Darwinian evolution corresponds to the evaluation of the time-change of variant fractions mainly by the first term on the right side of Eq. (5), as discussed by Eigen (1971). If the increase rate *W(M;xi)* of an occasionally generated mutant *xi* is greater than the average increase rate, that is, *W(M;xi) - Wav(M;t) > 0*, the fraction *f(xi;t)* increases with time according to the first term on the right side of Eq. (5). The increase in the fraction of such variants *xi* gradually raises the average increase rate *Wav(M;t)*, resulting in the increase in the total number *B(t)* of organisms according to Eq. (4), although this increase is ultimately stopped by the decrease in available material *M*. On the other hand, the fraction *f(xi;t)* decreases when *W(M;xi) – Wav(M;t) < 0.* Thus, the organisms taking a common material and energy source *M* are elaborated by mutation and selection, and most of them finally reach the ones with the optimum increase rate, each characterized by *xopt*. However, such Darwinian evolution may only hold for the point mutations in existing genes.

The large-scale evolutionary process of generating new gene(s) from gene duplication is obtained by evaluating the fraction of variants up to the first and higher orders of the mutation term. For this illustration, Eq. (5) will be formally integrated with respect to time *t*:

$$\begin{split} f(\mathbf{x}\_{i};t) &= \exp\left[\int\_{0}^{t} \|\mathcal{W}(M;\mathbf{x}\_{i}) - \mathcal{W}\_{w}(M;\boldsymbol{\tau})\|d\boldsymbol{\tau}\right] \|\_{0}^{t} \sum\_{j} q\_{\mathbf{x}\_{i};\mathbf{z}\_{j}}(\boldsymbol{\tau}) \mathcal{R}(M;\mathbf{x}\_{j}) f(\mathbf{x}\_{j};\boldsymbol{\tau}) \\ & \quad \left[ - \int\_{0}^{\tau} \|\mathcal{W}(M;\mathbf{x}\_{i}) - \mathcal{W}\_{w}(M;\boldsymbol{\tau})\|d\boldsymbol{\tau}\right] d\boldsymbol{\tau} + f(\mathbf{x}\_{i};\mathbf{0}) \right] \end{split} \tag{8}$$

After the organisms *xopt* have become dominant in the population, *Wav(M;t)* is approximately equal to *W(M; xopt)*, the fractions of variants except for *xopt* are neglected on the right side of Eq. (8), and the mutation term *qxi,xopt(t)* is replaced by the mutation rate *qxi,xopt* defined as an average of mutation terms during a sufficiently long time *t*, i. e.,

$$q\_{xi, \text{xopt}} \equiv \frac{1}{t} \int\_0^t q\_{xi, \text{xopt}}(\tau) d\tau \tag{9}$$

Then, the fraction *f(xi)* of variants *xi* is finally related with the fraction *f(xopt)* of dominant organisms *xopt* in the following form.

$$f(\mathbf{x}\_i) = \frac{q\_{xi, \text{x}\_{\text{opt}}} R(M; \mathbf{x}\_{\text{opt}})}{\mathcal{W}(M; \mathbf{x}\_{\text{opt}}) - \mathcal{W}(M; \mathbf{x}\_i)} f(\mathbf{x}\_{\text{opt}}) \tag{10}$$

Among such satellite variants, the variant arising from the gene duplication is especially notable in the sense that it has the potential to generate a new gene from the counterpart of duplicated genes. If the probability of generating a new gene *I* from the duplicated part in *xi* is denoted by *qxI,xi*, a new style of the organism carrying the new gene *I* is generated from the original style of an organism with the following probability *Pm1(xI* ← *xi* ← *xo)*.

$$P\_{m1}(\mathbf{x}\_I \leftarrow \mathbf{x}\_i \leftarrow \mathbf{x}\_o) = \frac{q\_{\text{xl},\text{xl}} q\_{\text{xi},\text{xo}} R(M; \mathbf{x}\_o)}{W(M; \mathbf{x}\_o) - W(M; \mathbf{x}\_i)} \tag{11}$$

where *xopt* is rewritten into *xo* with the meaning of the original style of an organism. Here, *xi* and *xI* correspond to *(N+*Δ*N, SN)* and *(N+*Δ*N,SN+*Δ*N)*, respectively, in terms of the set of variables characterizing an organism in section 2.

A Theoretical Scheme

0

present, is narrowed to *0 < s < 1/n.*

5

10

15

Pmn/Qn

20

25

of the Large-Scale Evolution by Generating New Genes from Gene Duplication 9

where *qxI,xiqxi,xo* is denoted by *Q1*. In the same way, the self-reproducing rate of the variant *xij* is denoted as *R(1- s1- s2)* with the additional reduction factor *s2* under the condition of *0 < s1 + s2 < 1* and *qxIJ,xIjqxI,xiqxij,xiqxi,xo* is denoted by *Q2*. The expression of the probability (16) then becomes

> Pm1/Q1 Pm2/Q2 Pm3/Q3 Pm4/Q4

1 3 5 7 9 11

12s

Fig. 1. The probabilities of generating new genes from gene duplication in the monoploid organism. On the basis of Eq. (20), the values of *Pmn/Qn* are plotted against the twelve-fold reduction factor *12s* for *n = 1, 2, 3* and *4*. Although the value of *Qn* becomes smaller for a larger value of *n,* the plotting of the probability *Pmn* in the unit of *Qn* makes the figure compact. The probability *Pm1* is present in a whole range of reduction factor *0 < s < 1*. As the number of *n* increases, however, the range of reduction factor *s*, where the probability *Pmn* is

2 2

(1 ) ( ) ( ) *m IJ ij o <sup>s</sup> Px x x Q*

<sup>−</sup> ←← =

This expression of probabilities (17) and (18) is easily extended to express the probability of

11 2 1 2 3

one variable *s*. Then, the first relation becomes *0 < s < 1/n*, and Eq. (19) is reduced to

1 12 123 1

(1 )(1 ) (1 ) ( )( ) *<sup>n</sup> mn <sup>n</sup>*

*s ss sss s P Q ss s s s s s*

The reduction factors *si*'s in Eq. (19) are in the relations of *0 < s1 + s2 +……….+ sn < 1* and *0 < s1, s2, ………, sn < 1*. Strictly, the values of *si*'s are different depending on the length of duplicated sequences and on the order of gene duplication events. For the simple investigation of the *n* dependence of *Pmn*, however, these reduction factors are assumed to be commonly equal to

> (1 )(1 2 )(1 3 ) {1 ( 1) } ! *mn <sup>n</sup> <sup>n</sup> s s s ns P Q n s*

successively generating *n* kinds of new genes in the following way.

1

− − − ⋅⋅⋅⋅⋅⋅⋅ − − − −⋅⋅⋅⋅⋅− <sup>−</sup> <sup>=</sup> + ⋅⋅⋅⋅⋅ + + +⋅⋅⋅⋅⋅+ (19)

− − − ⋅⋅⋅⋅⋅⋅⋅⋅⋅ − − <sup>=</sup> (20)

*n*

<sup>+</sup> (18)

11 2

*ss s*

When a biologically meaningful character is newly exhibited by two new genes generated from different origins of gene duplication, the variant, which experienced gene duplication *i*, must successively experience further gene duplication *j* in the other part of the genome to exhibit such a new character. The fraction *f(xij; t)* of such variants *xij* obeys the following equation as a special case of Eq. (5).

$$\frac{d}{dt}f(\mathbf{x}\_{ij};t) = \{\mathcal{W}(M;\mathbf{x}\_{ij}) - \mathcal{W}\_{\text{av}}(M;t)\} f(\mathbf{x}\_{ij};t) + q\_{xi\_i;i}(t)\mathcal{R}(M;\mathbf{x}\_i)f(\mathbf{x}\_i;t) \tag{12}$$

where *qxij,xi(t)* represents the mutation term from the variant *xi* to the variant *xij* and the smaller terms including the mutation from the variant *xij* to other variants are neglected. By formally integrating Eq. (12), the fraction *f(xij)* of variants *xij* is finally expressed as

$$f(\mathbf{x}\_{ij}) = \frac{q\_{xij, \text{xi}} R(M; \mathbf{x}\_i)}{W(M; \mathbf{x}\_{opt}) - W(M; \mathbf{x}\_{ij})} f(\mathbf{x}\_i) \tag{13}$$

where *Wav (M;t)* is approximated to be *W(M;xopt)* and the mutation term *qxij,xi(t)* is replaced by the mutation rate *qxij,xi* , i. e.,

$$q\_{xij,xi} \equiv \frac{1}{t} \int\_0^t q\_{xij,xi}(\tau) d\tau \tag{14}$$

By inserting the expression (10) of fraction *f(xi)* into the right side of Eq. (13), the fraction *f(xij)* of variants *xij* is related with the fraction *f(xopt)* of dominant organisms *xopt* by the second order of mutation rates in the following form.

$$f(\mathbf{x}\_{ij}) = \frac{q\_{xi|j, \text{xi}} R(M; \mathbf{x}\_i)}{W(M; \mathbf{x}\_{opt}) - W(M; \mathbf{x}\_{ij})} \cdot \frac{q\_{xi|.wpt} R(M; \mathbf{x}\_{opt})}{W(M; \mathbf{x}\_{opt}) - W(M; \mathbf{x}\_i)} f(\mathbf{x}\_{opt}) \tag{15}$$

Thus, a new style of the organism *xIJ* carrying new genes *I* and *J* is generated from the original style of an organism *xo* with the following probability *Pm2(xIJ* ← *xij* ← *xo)*.

$$P\_{m2}(\mathbf{x}\_{l\bar{l}} \leftarrow \mathbf{x}\_{\bar{i}\bar{j}} \leftarrow \mathbf{x}\_o) = \frac{q\_{\text{xll},\text{xl}\bar{l}} q\_{\text{xll},\text{xl}} q\_{\text{xjj},\text{xl}} R(M;\mathbf{x}\_i)}{W(M;\mathbf{x}\_o) - W(M;\mathbf{x}\_{\bar{i}})} \cdot \frac{q\_{\text{x}i,\text{x}o} R(M;\mathbf{x}\_o)}{W(M;\mathbf{x}\_o) - W(M;\mathbf{x}\_i)} \tag{16}$$

where *qxIJ,xIj* is the probability of generating the new gene *J* from the duplicated part in *j*. This procedure can be easily extended to the general case of successively generating three or more new genes.

Before describing the result of the general case, the expression of probabilities (11) and (16) will be simplified by assuming that the gene duplication only reduces the self-reproducing rate of the variant without any influence on the death rate. When the self-reproducing rate of the original style organism is simply denoted by *R* and that of the variant *xi* is expressed as *R(1-s1)* with the reduction factor satisfying *0 < s1 < 1*, the probability (11) is simply expressed as

$$P\_{m1}(\mathbf{x}\_I \leftarrow \mathbf{x}\_i \leftarrow \mathbf{x}\_o) = \frac{Q\_1}{s\_1} \tag{17}$$

When a biologically meaningful character is newly exhibited by two new genes generated from different origins of gene duplication, the variant, which experienced gene duplication *i*, must successively experience further gene duplication *j* in the other part of the genome to exhibit such a new character. The fraction *f(xij; t)* of such variants *xij* obeys the following

> , ( ; ) { ( ; ) ( ; )} ( ; ) ( ) ( ; ) ( ; ) *ij ij av ij xij xi i i <sup>d</sup> <sup>f</sup> x t W Mx W Mt f x t q tRMx f x t dt*

where *qxij,xi(t)* represents the mutation term from the variant *xi* to the variant *xij* and the smaller terms including the mutation from the variant *xij* to other variants are neglected. By

> , ( ;) ( ) ( ) (; ) (;) *xij xi i ij i opt ij*

where *Wav (M;t)* is approximated to be *W(M;xopt)* and the mutation term *qxij,xi(t)* is replaced

τ τ

*q RMx q RMx f x f x WMx WMx WMx WMx* <sup>=</sup> <sup>⋅</sup> − − (15)

, ,, ,

*xIJ xIj xI xi xij xi i xi xo o*

*WMx WMx WMx WMx* ←← = <sup>⋅</sup> − − (16)

1

←← = (17)

1

*s*

*o ij o i*

( ;) (;) ( ) ( ;) ( ; ) ( ;) ( ;)

where *qxIJ,xIj* is the probability of generating the new gene *J* from the duplicated part in *j*. This procedure can be easily extended to the general case of successively generating three or

Before describing the result of the general case, the expression of probabilities (11) and (16) will be simplified by assuming that the gene duplication only reduces the self-reproducing rate of the variant without any influence on the death rate. When the self-reproducing rate of the original style organism is simply denoted by *R* and that of the variant *xi* is expressed as *R(1-s1)* with the reduction factor satisfying *0 < s1 < 1*, the probability (11) is simply

*mI i o* ( ) *<sup>Q</sup> Px x x*

1

, , <sup>0</sup> <sup>1</sup> ( ) *<sup>t</sup> xij xi xij xi q qd <sup>t</sup>* <sup>≡</sup>

By inserting the expression (10) of fraction *f(xi)* into the right side of Eq. (13), the fraction *f(xij)* of variants *xij* is related with the fraction *f(xopt)* of dominant organisms *xopt* by the second

> , , ( ;) (; ) ( ) ( ) ( ; ) ( ; ) ( ; ) ( ;) *xij xi i xi xopt opt ij opt opt ij opt i*

Thus, a new style of the organism *xIJ* carrying new genes *I* and *J* is generated from the original style of an organism *xo* with the following probability *Pm2(xIJ* ← *xij* ← *xo)*.

*q q q RMx q RMx Px x x*

*q RMx <sup>f</sup> <sup>x</sup> <sup>f</sup> <sup>x</sup>*

formally integrating Eq. (12), the fraction *f(xij)* of variants *xij* is finally expressed as

=− + (12)

*WMx WMx* <sup>=</sup> <sup>−</sup> (13)

∫ (14)

equation as a special case of Eq. (5).

by the mutation rate *qxij,xi* , i. e.,

2

more new genes.

expressed as

order of mutation rates in the following form.

*m IJ ij o*

where *qxI,xiqxi,xo* is denoted by *Q1*. In the same way, the self-reproducing rate of the variant *xij* is denoted as *R(1- s1- s2)* with the additional reduction factor *s2* under the condition of *0 < s1 + s2 < 1* and *qxIJ,xIjqxI,xiqxij,xiqxi,xo* is denoted by *Q2*. The expression of the probability (16) then becomes

Fig. 1. The probabilities of generating new genes from gene duplication in the monoploid organism. On the basis of Eq. (20), the values of *Pmn/Qn* are plotted against the twelve-fold reduction factor *12s* for *n = 1, 2, 3* and *4*. Although the value of *Qn* becomes smaller for a larger value of *n,* the plotting of the probability *Pmn* in the unit of *Qn* makes the figure compact. The probability *Pm1* is present in a whole range of reduction factor *0 < s < 1*. As the number of *n* increases, however, the range of reduction factor *s*, where the probability *Pmn* is present, is narrowed to *0 < s < 1/n.*

$$P\_{m2}(\mathbf{x}\_{lj} \leftarrow \mathbf{x}\_{ij} \leftarrow \mathbf{x}\_o) = \frac{(1 - s\_1)}{s\_1(s\_1 + s\_2)} Q\_2 \tag{18}$$

This expression of probabilities (17) and (18) is easily extended to express the probability of successively generating *n* kinds of new genes in the following way.

$$P\_{mn} = \frac{(1 - s\_1)(1 - s\_1 - s\_2) \cdots \cdots (1 - s\_1 - s\_2 - s\_3 - \cdots - s\_{n-1})}{s\_1(s\_1 + s\_2) \cdots \cdots (s\_1 + s\_2 + s\_3 + \cdots + s\_n)} Q\_n \tag{19}$$

The reduction factors *si*'s in Eq. (19) are in the relations of *0 < s1 + s2 +……….+ sn < 1* and *0 < s1, s2, ………, sn < 1*. Strictly, the values of *si*'s are different depending on the length of duplicated sequences and on the order of gene duplication events. For the simple investigation of the *n* dependence of *Pmn*, however, these reduction factors are assumed to be commonly equal to one variable *s*. Then, the first relation becomes *0 < s < 1/n*, and Eq. (19) is reduced to

$$P\_{mn} = \frac{(1-s)(1-2s)(1-3s)\cdots \cdots \cdots \{1-(n-1)s\}}{n!s^n}Q\_n\tag{20}$$

A Theoretical Scheme

with the probability of *Pm1Pm2/4* .

organism received three new genes by conjugation.

descendant *(C1I, C2J, C3k, C4L)* with the probability *Pm22/8*.

of the Large-Scale Evolution by Generating New Genes from Gene Duplication 11

One is the case that the new gene *I* is encoded on the chromosome *C1* and both new genes *J* and *K* are encoded on another kind of chromosome *C2*. Then, the genome of the variant carrying the new gene *I* is denoted by *(C1I, C20)* and the genome of another variant carrying the new genes *J* and *K* is denoted by *(C10, C2JK)*. The conjugation of these two variants forms the zygote *(C1I, C10; C2JK, C20)*, which can produce four types of monoploid descendants, *(C1I, C2JK)*, *(C1I, C20)*, *(C10, C2JK)* and *(C10, C20)*. If the homologous chromosomes are equivalently partitioned into two daughter cells, regardless of carrying new genes or not, the new

In the second case, the new genes *J* and *K* are encoded on separate chromosomes. If the chromosome carrying the new gene *K* is denoted by *C3*, the genome of the variant carrying new genes *J* and *K* is represented by *(C10, C2J, C3K)*. The conjugation of this variant and the variant *(C1I, C20, C30)* forms the zygote *(C1I, C10; C2J, C20; C3K, C30)*. Under the random partition of homologous chromosomes, this zygote yields a new monoploid descendant *(C1I, C2J, C3K)*

As a whole, *3Pm1Pm2/4* is obtained for the probability *Pc3* of producing a new monoploid

The highest probability of producing the descendant received four new genes is obtained by the conjugation of two variants, one carrying two new genes *I* and *J*, and another carrying other two new genes *K* and *L*. The following three cases (i) ~ (iii) are considerable. (i) The new genes *I* and *J* are encoded on the chromosome *C1* in one variant, while the new genes *K* and *L* are encoded on the chromosome C*2* in another variant. The conjugation of these two variants forms the zygote *(C1IJ, C10; C2KL, C20)*, which yields four types of monoploid descendants, *(C1IJ, C2KL)*, *(C1IJ, C20)*, *(C10, C2KL)* and *(C10, C20)*. If the homologous chromosomes are randomly partitioned into two descendants, the probability of producing the monoploid descendant *(C1IJ, C2KL)* is calculated to be *Pm22/2*. (ii) The new genes *I* and *J* are encoded on the chromosome C*1* in one variant but the new genes *K* and *L* are encoded on the chromosomes *C2* and C*3*, respectively, in another variant. The conjugation of these two variants forms the zygote *(C1IJ, C10; C2K, C20; C3L, C30)*. If the homologous chromosomes in each kind of *1*, *2* and *3* are randomly partitioned into two daughter cells, the probability of producing the monoploid descendant *(C1IJ, C2K, C3L)* is calculated to be *Pm22/4*. (iii) The new genes *I* and *J* are encoded on the chromosomes *C1* and *C2*, respectively, in one variant, while the new genes *K* and *L* are encoded on the chromosomes *C3* and *C4*, respectively, in another variant. The conjugation of these two variants forms the zygote *(C1I, C10; C2J, C20; C3K, C30; C4L, C40)*, and yields the monoploid

The monoploid organism receiving four new genes can be also produced by the conjugation of a variant with one new gene *I* on the chromosome *C1* and another variant with three new genes *J*, *K* and *L*. The following three cases (iv) ~ (vi) are considerable for the location of the three new genes *J*, *K* and *L*. (iv) The three new genes are encoded on the same chromosome *C2*. In this case, the conjugation of the two variants forms the zygote *(C1I, C10; C2JKL, C20)* and yields the monoploid descendant *(C1I, C2JKL)* with the probability of *Pm1Pm3/2*. (v) The new gene *J* is encoded on the chromosome *C2* and the other two new genes *K* and *L* are encoded on the chromosome *C3*. The conjugation of these variants forms the zygote *(C1I, C10; C2J, C20; C3KL, C30)* and yields the descendant monoploid *(C1I, C2J, C3KL)* with the probability of

monoploid descendant *(C1I, C2JK)* is produced with the probability of *Pm1Pm2/2*.

**4.3 The probability of producing the descendant received four new genes**

On the basis of this expression (20), the probabilities *Pmn*'s for several values of *n* are plotted against the reduction factor *s* in Fig. 1. In the case of *n = 1*, the reduction factor *s* is permitted in a whole range of *0 < s < 1* and the probability *Pm1* of generating a new gene is present in this range. This means that the monoploid organism is suitable to create a new gene step by step, testing the biological function of the new gene product, even if the gene size is large. As the value of *n* increases, however, the reduction factor *s* is restricted to the narrower range of *0 < s < 1/n*. When the monoploid organism creates simultaneously multiple kinds of new genes from different origins of gene duplication, therefore, these genes are obliged to be of a smaller size. Moreover, the probability *Pmn* is also decreased as the value of *n* increases. This is because *Qn* becomes smaller for the larger value of *n*. Thus, it is difficult for the monoploid organism to evolve a new character which requires the expression of many kinds of new and large genes. This result is common to the prokaryote with a single DNA molecule and the lower eukaryote with the plural number of chromosomes, if the latter does not conjugate to exchange homologous chromosomes.
