Production rule

B = {�*expr*�},

In Equation 1, (N ∪T )<sup>∗</sup> represents a set of possible elements composed of (N ∪T ). By applying production rules to the start symbol B, grammar *G* generates sentences. A language generated by grammar *G* is represented by *L*(*G*). If *W* ∈ *L*(*G*), then *W* ∈ T <sup>∗</sup>.

By applying production rules, non-terminal *A* is replaced by another symbol. For instance, application of the production rule represented by Equation 1 to *α*1*Aα*2(*α*1, *α*<sup>2</sup> ∈ (N ∪T )∗, *A* ∈ N ) yields *α*1*αα*2. In this case, it is said that "*α*1*Aα*<sup>2</sup> derived *α*1*αα*2", and this process is represented as follows:

$$
\alpha\_1 A \alpha\_2 \underset{G}{\Rightarrow} \alpha\_1 \mathfrak{a} \mathfrak{a}\_2.
$$

Furthermore, if we have the following consecutive applications

$$
\mathfrak{a}\_1 \underset{G}{\Rightarrow} \mathfrak{a}\_2 \cdot \cdots \underset{G}{\Rightarrow} \mathfrak{a}\_n (\mathfrak{a}\_i \in (\mathcal{N} \cup \mathcal{T})^\*)\_{\prime \prime}
$$

*α<sup>n</sup>* is derived from *α*<sup>1</sup> and is described by *α*<sup>1</sup> ∗ ⇒ *G αn*. This derivation process can be represented by a tree structure, which is known as a *derivation tree*. Derivation trees of grammar *G* are defined as follows.


3. Branch node is an element of N

4 Will-be-set-by-IN-TECH

There are other GP-EDAs not belonging to either of the groups presented above. *N*-gram GP ([21]) is based on the linear GP ([18]), which is the assembly language of a register-based CPU, and learns the sub-sequences using an *N*-gram model. The *N*-gram model is very popular in NLP which considers *N* consecutive sub-sequences for calculating the probabilities of symbols. AntTAG ([1]) also shares similar concepts with GP-EDAs, although AntTAG does not employ a statistical inference method for probability learning; instead, AntTAG employs the ant colony optimization method (ACO), where the pheromone matrix in ACO can be

The context-free grammar (CFG) *G* is defined by four variables *G* = {N , T , R, B}, where the

It is important to note that the terms "non-terminal" and "terminal" in CFG are different from those in GP (for example in symbolic regression problems, not only variables *x*, *y* but also sin, + are treated as terminals in CFG). In CFG, sentences are generated by applying

In Equation 1, (N ∪T )<sup>∗</sup> represents a set of possible elements composed of (N ∪T ). By applying production rules to the start symbol B, grammar *G* generates sentences. A language

By applying production rules, non-terminal *A* is replaced by another symbol. For instance, application of the production rule represented by Equation 1 to *α*1*Aα*2(*α*1, *α*<sup>2</sup> ∈ (N ∪T )∗, *A* ∈ N ) yields *α*1*αα*2. In this case, it is said that "*α*1*Aα*<sup>2</sup> derived *α*1*αα*2", and this process is

*α*1*αα*2.

*αn*(*α<sup>i</sup>* ∈ (N ∪T )∗),

*α*1*Aα*<sup>2</sup> ⇒ *G*

> ∗ ⇒ *G*

by a tree structure, which is known as a *derivation tree*. Derivation trees of grammar *G* are

*α*<sup>2</sup> ···⇒ *G*

*A* → *α* (*A* ∈ N , *α* ∈ (N ∪T )∗). (1)

*αn*. This derivation process can be represented

production rules to non-terminal symbols, which are generally given by

generated by grammar *G* is represented by *L*(*G*). If *W* ∈ *L*(*G*), then *W* ∈ T <sup>∗</sup>.

Furthermore, if we have the following consecutive applications

*α*<sup>1</sup> ⇒ *G*

*α<sup>n</sup>* is derived from *α*<sup>1</sup> and is described by *α*<sup>1</sup>

1. Node is an element of (N ∪T )

interpreted as a probability distribution.

In this section, we explain basic concepts of PCFG.

meanings of these variables are listed below.

• N : Finite set of non-terminal symbols • T : Finite set of terminal symbols • R: Finite set of production rules

**3. Basics of PCFG**

• B: Start symbol

represented as follows:

defined as follows.

2. Root is B

4. If children of *A* ∈ N are *α*1*α*<sup>2</sup> ··· *α<sup>k</sup>* (*α<sup>i</sup>* ∈ (N ∪T )) from left, production rule *A* → *α*1*α*<sup>2</sup> ··· *α<sup>k</sup>* is an element of R

We next explain CFG with an example. We now consider a univariate function *f*(*x*) composed of sin, cos, exp, log and arithmetic operators (+, −, × and ÷). A grammar *G*reg can be

$$\begin{aligned} \mathcal{B} &= \{ \langle \text{expr} \rangle \}, \\ \mathcal{N} &= \{ \langle \text{expr} \rangle, \langle op2 \rangle, \langle op1 \rangle, \langle var \rangle, \langle const \rangle \}, \\ \mathcal{T} &= \{ +, -, \times, \div, \sin \text{\textdiamond\texttexttextend\_\textquotesingle}, \text{op.x}, \text{C} \}. \end{aligned}$$

We define the following production rules.


*G*reg derives univariate functions by applying the production rules. Suppose we have the following derivation:

$$
\begin{aligned}
\langle \exp pr \rangle &\to \langle op2 \rangle \langle \exp pr \rangle \langle \exp pr \rangle \\
&\to + \langle \exp pr \rangle \langle \exp pr \rangle \\
&\to + \langle op2 \rangle \langle \exp pr \rangle \langle \exp pr \rangle \langle \exp pr \rangle \\
&\to + + \langle \exp pr \rangle \langle \exp pr \rangle \langle \exp pr \rangle \\
&\to + + \langle op1 \rangle \langle \exp pr \rangle \langle \exp pr \rangle \langle \exp pr \rangle \\
&\to + + \log \langle \exp pr \rangle \langle \exp pr \rangle \langle \exp pr \rangle \\
&\to + + \log \langle var \rangle \langle \exp pr \rangle \langle \exp pr \rangle \\
&\to + + + \log x \ \langle \exp rx \rangle \langle \exp rx \rangle \\
&\to + + + \log x \ \langle \exp rx \rangle \\
&\to + + + \log x \ \langle \exp rx \rangle \\
&\to + + + \log x \ \langle \text{const} \rangle \\
&\to ++ + \log x \ \mathcal{C}.
\end{aligned}
$$

exp exp

**Figure 2.** (a) Complete tree with annotations and (b) its observed tree.

formula derived from the EM algorithm.

tree (complete data) is given by

*<sup>i</sup>* , ···} (*x<sup>j</sup>*

We summarized variables in Appendix B.

*<sup>P</sup>*(*Ti*, *Xi*; <sup>β</sup>, <sup>π</sup>) = <sup>∏</sup>*<sup>x</sup>*∈*<sup>H</sup>*

are given.

**4. PAGE**

**4.1. PCFG-LA**

Ref. ([17]).

*Xi* <sup>=</sup> {*x*<sup>1</sup>

*<sup>i</sup>* , *<sup>x</sup>*<sup>2</sup>

*z z*

Programming with Annotated Grammar Estimation 55

more complicated grammars such as PCFG-LA, more advanced estimation methods (i.e. the expectation maximization (EM) algorithm ([5])) have to be used even when derivation trees

Our proposed algorithm PAGE is based on PCFG-LA. In PCFG-LA, latent annotations are estimated from promising solutions using the EM algorithm, and PCFG-LA takes advantage of forward–backward probabilities for computationally efficient estimation. In this section, we describe the details of PCFG-LA, forward-backward probabilities and a parameter update

Although the PCFG-LA used in PAGE has been developed specifically for the present application, it is essentially identical to the conventional PCFG-LA. In this section, we describe the specialized version of PCFG-LA. For further details on PCFG-LA, the reader may refer to

PCFG-LA assumes that every non-terminal is labeled with annotations. In the complete form, non-terminals are represented by *A*[*x*], where *A* is the non-terminal symbol, *x*(∈ *H*) is an annotation (which is latent), and *H* is a set of annotations (in this paper, we take *H* = {0, 1, 2, 3, ··· , *h* − 1}, where *h* is the annotation size). Fig. 2 shows an example of a tree with annotations (a), and the corresponding observed tree (b). The likelihood of an annotated

where *Ti* denotes the *i*th derivation tree; *Xi* is the set of latent annotations of *Ti* represented by

position; *β*(*r*) is the probability of the annotated production rule *r* ∈ R[*H*]; *δ*(*x*; *Ti*, *Xi*) is 1 if the annotation at the root node is *x* in the complete tree *Ti*, *Xi* and is 0 otherwise; *c*(S[*x*] → *α*; *Ti*, *Xi*) is the number of occurrences of rule S[*x*] → *α* in the complete tree *Ti*, *Xi*; *h* is the annotation size that is specified in advance as a parameter; β = {*β*(S[*x*] → *α*)|S[*x*] → *α* ∈ R[*H*]}; and π = {*π*(S[*x*])|*x* ∈ *H*}. The set of annotated rules R[*H*] is given in Equation 8.

*<sup>π</sup>*(S[*x*])*δ*(*x*;*Ti*,*Xi*) ∏

*r*∈R[*H*]

*<sup>i</sup>* is the *j*th annotation of *Ti*); *π*(S[*x*]) is the probability of S[*x*] at the root

*β*(*r*)*c*(*r*;*Ti*,*Xi*)

, (3)

**Figure 1.** (a) Derivation tree for log *x* + *x* + *C* and (b) its corresponding S-expression in GP.

In this case, the derived function is

$$f(\mathbf{x}) = \log \mathbf{x} + \mathbf{x} + \mathbf{C}\_r$$

and its derivation process is represented by the derivation tree in Figure 1(a).

Although functions and programs are represented with standard tree representations (S-expression) in the conventional GP (Figure 1(b)), derivation trees can express the same functions and programs. Consequently, derivation trees can be used in program evolution, and GGGP ([33, 34]) adopted derivation trees for its chromosome.

We next proceed to PCFG, which extends CFG by adding probabilities to each production rule. For example, the likelihood (probability) of the derivation tree in Fig. 1(a) is

$$\begin{split} P(W,T) &= \pi(\langle \exp pr \rangle) \mathfrak{f}(\langle \exp pr \rangle \to \langle op2 \rangle \langle \exp pr \rangle \langle \exp pr \rangle)^2 \mathfrak{f}(\langle op2 \rangle \to +)^2 \\ &\quad \times \mathfrak{f}(\langle \exp pr \rangle \to \langle op1 \rangle \langle \exp pr \rangle) \mathfrak{f}(\langle op1 \rangle \to \log) \\ &\quad \times \mathfrak{f}(\langle \exp pr \rangle \to \langle \text{const} \rangle) \mathfrak{f}(\langle \exp pr \rangle \to \langle var \rangle)^2 \mathfrak{f}(\langle \text{const} \rangle \to \mathbb{C}) \mathfrak{f}(\langle var \rangle \to \ge)^2, \end{split}$$

where *W* ∈ T <sup>∗</sup> is a sentence (i.e. *W* corresponds to log *x* + *x* + *C* in *G*reg), *T* is a derivation tree, *π*(�*expr*�) is the probability of �*expr*� and *β*(*A* → *α*) is the probability of a production rule *A* → *α*. Furthermore, the probability *P*(*W*) of sentence *W* is given by calculating the marginal probability in terms of *T* ∈ Φ(*W*):

$$P(\mathcal{W}) = \sum\_{T \in \Phi(\mathcal{W})} P(\mathcal{W}, T), \tag{2}$$

where Φ(*W*) is the set of all possible derivation trees which derive *W*. In NLP, inference of the production rule parameters *β*(*A* → *α*) is carried out with learning data **W** = {*W*1, *W*2, ···}, which is a set of sentences. The learning data does not have information about derivation processes. Because there are many possible derivations Φ(*W*) for large sentences, directly calculating *P*(*W*) with marginalization in terms of Φ(*W*) (Equation 2) is computationally intractable. Consequently, a computationally efficient method called the *inside–outside algorithm* is used to estimate the parameters. The inside–outside algorithm takes advantage of dynamic programming to reduce the computational cost. However, in contrast to the case of NLP, the derivation trees are observed in GP-EDAs, and the parameter estimation of production rules in GP-EDAs with PCFG is very easy. However, when using

**Figure 2.** (a) Complete tree with annotations and (b) its observed tree.

more complicated grammars such as PCFG-LA, more advanced estimation methods (i.e. the expectation maximization (EM) algorithm ([5])) have to be used even when derivation trees are given.

## **4. PAGE**

6 Will-be-set-by-IN-TECH

**Figure 1.** (a) Derivation tree for log *x* + *x* + *C* and (b) its corresponding S-expression in GP.

and its derivation process is represented by the derivation tree in Figure 1(a).

rule. For example, the likelihood (probability) of the derivation tree in Fig. 1(a) is

*<sup>P</sup>*(*W*, *<sup>T</sup>*) = *<sup>π</sup>*(�*expr*�)*β*(�*expr*� <sup>→</sup> �*op*2� �*expr*� �*expr*�)2*β*(�*op*2� <sup>→</sup> +)<sup>2</sup>

*P*(*W*) = ∑

× *β*(�*expr*� → �*op*1� �*expr*�)*β*(�*op*1� → log)

and GGGP ([33, 34]) adopted derivation trees for its chromosome.

*f*(*x*) = log *x* + *x* + *C*,

Although functions and programs are represented with standard tree representations (S-expression) in the conventional GP (Figure 1(b)), derivation trees can express the same functions and programs. Consequently, derivation trees can be used in program evolution,

We next proceed to PCFG, which extends CFG by adding probabilities to each production

where *W* ∈ T <sup>∗</sup> is a sentence (i.e. *W* corresponds to log *x* + *x* + *C* in *G*reg), *T* is a derivation tree, *π*(�*expr*�) is the probability of �*expr*� and *β*(*A* → *α*) is the probability of a production rule *A* → *α*. Furthermore, the probability *P*(*W*) of sentence *W* is given by calculating the

*T*∈Φ(*W*)

where Φ(*W*) is the set of all possible derivation trees which derive *W*. In NLP, inference of the production rule parameters *β*(*A* → *α*) is carried out with learning data **W** = {*W*1, *W*2, ···}, which is a set of sentences. The learning data does not have information about derivation processes. Because there are many possible derivations Φ(*W*) for large sentences, directly calculating *P*(*W*) with marginalization in terms of Φ(*W*) (Equation 2) is computationally intractable. Consequently, a computationally efficient method called the *inside–outside algorithm* is used to estimate the parameters. The inside–outside algorithm takes advantage of dynamic programming to reduce the computational cost. However, in contrast to the case of NLP, the derivation trees are observed in GP-EDAs, and the parameter estimation of production rules in GP-EDAs with PCFG is very easy. However, when using

<sup>×</sup> *<sup>β</sup>*(�*expr*� <sup>→</sup> �*const*�)*β*(�*expr*� <sup>→</sup> �*var*�)2*β*(�*const*� <sup>→</sup> *<sup>C</sup>*)*β*(�*var*� <sup>→</sup> *<sup>x</sup>*)2,

*P*(*W*, *T*), (2)

In this case, the derived function is

marginal probability in terms of *T* ∈ Φ(*W*):

Our proposed algorithm PAGE is based on PCFG-LA. In PCFG-LA, latent annotations are estimated from promising solutions using the EM algorithm, and PCFG-LA takes advantage of forward–backward probabilities for computationally efficient estimation. In this section, we describe the details of PCFG-LA, forward-backward probabilities and a parameter update formula derived from the EM algorithm.

#### **4.1. PCFG-LA**

Although the PCFG-LA used in PAGE has been developed specifically for the present application, it is essentially identical to the conventional PCFG-LA. In this section, we describe the specialized version of PCFG-LA. For further details on PCFG-LA, the reader may refer to Ref. ([17]).

PCFG-LA assumes that every non-terminal is labeled with annotations. In the complete form, non-terminals are represented by *A*[*x*], where *A* is the non-terminal symbol, *x*(∈ *H*) is an annotation (which is latent), and *H* is a set of annotations (in this paper, we take *H* = {0, 1, 2, 3, ··· , *h* − 1}, where *h* is the annotation size). Fig. 2 shows an example of a tree with annotations (a), and the corresponding observed tree (b). The likelihood of an annotated tree (complete data) is given by

$$P(T\_{i\prime}X\_{i\prime}\beta,\pi) = \prod\_{\mathbf{x}\in H} \pi(\mathcal{S}[\mathbf{x}])^{\delta(\mathbf{x};T\_{\boldsymbol{\vartheta}}X\_{\boldsymbol{\vartheta}})} \prod\_{r\in \mathcal{R}[H]} \beta(r)^{c(r;T\_{\boldsymbol{\vartheta}}X\_{\boldsymbol{\vartheta}})},\tag{3}$$

where *Ti* denotes the *i*th derivation tree; *Xi* is the set of latent annotations of *Ti* represented by *Xi* <sup>=</sup> {*x*<sup>1</sup> *<sup>i</sup>* , *<sup>x</sup>*<sup>2</sup> *<sup>i</sup>* , ···} (*x<sup>j</sup> <sup>i</sup>* is the *j*th annotation of *Ti*); *π*(S[*x*]) is the probability of S[*x*] at the root position; *β*(*r*) is the probability of the annotated production rule *r* ∈ R[*H*]; *δ*(*x*; *Ti*, *Xi*) is 1 if the annotation at the root node is *x* in the complete tree *Ti*, *Xi* and is 0 otherwise; *c*(S[*x*] → *α*; *Ti*, *Xi*) is the number of occurrences of rule S[*x*] → *α* in the complete tree *Ti*, *Xi*; *h* is the annotation size that is specified in advance as a parameter; β = {*β*(S[*x*] → *α*)|S[*x*] → *α* ∈ R[*H*]}; and π = {*π*(S[*x*])|*x* ∈ *H*}. The set of annotated rules R[*H*] is given in Equation 8. We summarized variables in Appendix B.

8 Will-be-set-by-IN-TECH 56 Genetic Programming – New Approaches and Successful Applications Programming with Annotated Grammar Estimation <sup>9</sup>

**Figure 3.** (a) Forward and (b) backward probabilities. The superscripts denote the indices of non-terminals (*<sup>i</sup>* in <sup>S</sup>*<sup>i</sup>* [*y*], for example).

The likelihood of an observed tree can be calculated by summing over annotations:

$$P(T\_{i\prime}; \boldsymbol{\beta}, \boldsymbol{\pi}) = \sum\_{\mathbf{X}\_{l}} P(T\_{i\prime} \boldsymbol{X}\_{i\prime} \boldsymbol{\beta}, \boldsymbol{\pi}). \tag{4}$$

PCFG-LA estimates β and π using the EM algorithm. Before explaining the estimation procedure, we should note the form of production rules. In PAGE, production rules are not Chomsky normal form (CNF), as is assumed in the original PCFG-LA, because of the understandability of GP programs. Any function which can be handled with traditional GP can be represented by

$$
\mathcal{S} \to \emptyset \mathcal{S} \dots \mathcal{S},
\tag{5}
$$

**Figure 4.** Example of a derivation tree and values of the specific functions. The superscripts denote the

represents the probability that the tree above the *i*th non-terminal S[*y*] is generated (Fig. 3

*<sup>T</sup>* (*x*;β, π)*β*(S[*x*] → *g*

*b j*

where ch(*i*, *T*) is a function that returns the set of non-terminal children indices of the *i*th non-terminal in *T*, pa(*i*, *T*) returns the parent index of the *i*th non-terminal in *T*, and *g<sup>i</sup>*

terminal symbol in CFG and is connected to the *i*th non-terminal symbol in *T*. For example,

Using the forward–backward probabilities, *P*(*T*; β, π) can be expressed by the following two

Here, cover(*g*, *Ti*) represents a function that returns a set of non-terminal indices at which the production rule generating *g* without annotations is rooted in *Ti*. For example, if *g* = + and

We describe the parameter estimation in PCFG-LA. Because PCFG-LA contains latent variables *X*, the parameter estimation is carried out with the EM algorithm. Let β and π

*<sup>π</sup>*(S[*x*])*b*<sup>1</sup>

*<sup>β</sup>*(S[*x*] <sup>→</sup> *<sup>g</sup>* <sup>S</sup>[*y*]...S[*y*])*<sup>f</sup> <sup>i</sup>*

*x*∈*H*

*<sup>T</sup>* S[*y*]...S[*y*]) ∏

*j*∈ch(*i*,*T*)

*<sup>T</sup>*(*y*;β, π) = *π*(S[*y*]) (*i* = 1), (11)

pa(*i*,*T*)

*b j*

*<sup>T</sup>* S[*y*]...S[*y*])

Programming with Annotated Grammar Estimation 57

*<sup>T</sup>*(*y*;β, π) (*i* � 1), (10)

*<sup>T</sup>* = sin.

*<sup>T</sup>*(*x*; β, π)

*<sup>T</sup>*(*x*;β, π), (12)

. (*i* ∈ cover(*g*, *T*)) (13)

*<sup>T</sup>*(*y*; β, π), (9)

*<sup>T</sup>* is a

(a)). Forward and backward probabilities can be recursively calculated as follows:

*<sup>β</sup>*(S[*x*] <sup>→</sup> *<sup>g</sup><sup>i</sup>*

indices of non-terminals.

equations:

*bi*

*f i*

*<sup>T</sup>*(*x*; β, π) = ∑

*<sup>T</sup>*(*y*; β, π) = ∑

*y*∈*H*

*x*∈*H f* pa(*i*,*T*)

*f i*

× ∏

for the tree shown in Fig. 4, ch(3, *<sup>T</sup>*) = {5, 6}, pa(5, *<sup>T</sup>*) = 3, and *<sup>g</sup>*<sup>2</sup>

*P*(*T*; β, π) = ∑

*T* is the tree represented in Fig. 4, then cover(+, *T*) = {1, 3}.

**4.3. Parameter update formula**

*P*(*T*; β, π) = ∑

*x*,*y*∈*H*

× ∏ *j*∈ch(*i*,*T*)

*b j <sup>T</sup>*(*y*; β, π)

*j*∈ch(pa(*i*,*T*),*T*),*j*� *i*

which is a subset of Greibach normal form (GNF). Here S∈N and *g* ∈ T (N and T are the sets of non-terminal and terminal symbols in CFG; see Section 3). A terminal symbol *g* in CFG is a function node (+, −, sin, cos ∈ F) or a terminal (*v*, *w* ∈ T) in GP (F and T denote set of GP functions and terminals, respectively). Annotated production rules are

$$\mathcal{S}[\mathbf{x}] \to \mathcal{g}\,\mathcal{S}[\mathbf{z}\_1] \dots \mathcal{S}[\mathbf{z}\_{a\_{\text{max}}}]\_\prime \tag{6}$$

where *x*, *zm* ∈ *H* and *a*max is the arity of *g* in GP. If *g* has *a*max arity, the number of parameters for the production rule S → *<sup>g</sup>* <sup>S</sup>...<sup>S</sup> with annotations is *<sup>h</sup>a*max<sup>+</sup>1, which increases exponentially as the arity number increases. In order to reduce the number of parameters, we assume that all the right-hand side non-terminal symbols have the same annotation, that is

$$\mathcal{S}[x] \to \mathcal{g}\,\mathcal{S}[y]\,\mathcal{S}[y]...\mathcal{S}[y].\tag{7}$$

With this assumption, the number of parameters can be reduced to *h*2, which is tractable. Let R[*H*] be the set of annotated rules expressed by Equation 8. R[*H*] is defined by

$$\mathcal{R}[H] = \{ \mathcal{S}[\mathbf{x}] \to \mathcal{g} \: \mathcal{S}[\mathbf{y}] \: \mathcal{S}[\mathbf{y}]...\mathcal{S}[\mathbf{y}] \vert \mathbf{x}, \mathbf{y}, \mathbf{y} \in H, \mathbf{g} \in \mathcal{T} \}. \tag{8}$$

#### **4.2. Forward–backward probability**

We explain forward and backward probabilities for PCFG-LA in this section. PCFG-LA ([17]) adopted forward and backward probabilities to apply the EM algorithm ([5]). The backward probability *b<sup>i</sup> <sup>T</sup>*(*x*;β, π) represents the probability that the tree beneath the *i*th non-terminal <sup>S</sup>[*x*] is generated (<sup>β</sup> and <sup>π</sup> are parameters, Fig. 3 (b)), and the forward probability *<sup>f</sup> <sup>i</sup> <sup>T</sup>*(*y*; β, π)

**Figure 4.** Example of a derivation tree and values of the specific functions. The superscripts denote the indices of non-terminals.

represents the probability that the tree above the *i*th non-terminal S[*y*] is generated (Fig. 3 (a)). Forward and backward probabilities can be recursively calculated as follows:

$$b\_T^i(\mathbf{x}; \boldsymbol{\beta}, \boldsymbol{\pi}) = \sum\_{\boldsymbol{y} \in H} \beta(\mathcal{S}[\mathbf{x}] \to \operatorname{g}\_T^i \mathcal{S}[\boldsymbol{y}]...\mathcal{S}[\boldsymbol{y}]) \prod\_{j \in \operatorname{ch}(i, T)} b\_T^j(\boldsymbol{y}; \boldsymbol{\beta}, \boldsymbol{\pi}), \tag{9}$$

$$f\_T^i(y; \boldsymbol{\beta}, \boldsymbol{\pi}) = \sum\_{\boldsymbol{x} \in H} f\_T^{\text{pa}(i, T)}(\boldsymbol{x}; \boldsymbol{\beta}, \boldsymbol{\pi}) \boldsymbol{\beta}(\mathcal{S}[\boldsymbol{x}] \to \mathcal{g}\_T^{\text{pa}(i, T)} \mathcal{S}[y] \dots \mathcal{S}[y])$$

$$\times \prod\_{j \in \text{ch}(\text{pa}(i, T), T), j \neq i} b\_T^j(y; \boldsymbol{\beta}, \boldsymbol{\pi}) \quad (i \neq 1), \tag{10}$$

$$f\_T^i(y; \beta, \pi) = \pi(\mathcal{S}[y]) \quad (i = 1), \tag{11}$$

where ch(*i*, *T*) is a function that returns the set of non-terminal children indices of the *i*th non-terminal in *T*, pa(*i*, *T*) returns the parent index of the *i*th non-terminal in *T*, and *g<sup>i</sup> <sup>T</sup>* is a terminal symbol in CFG and is connected to the *i*th non-terminal symbol in *T*. For example, for the tree shown in Fig. 4, ch(3, *<sup>T</sup>*) = {5, 6}, pa(5, *<sup>T</sup>*) = 3, and *<sup>g</sup>*<sup>2</sup> *<sup>T</sup>* = sin.

Using the forward–backward probabilities, *P*(*T*; β, π) can be expressed by the following two equations:

$$P(T; \beta, \pi) = \sum\_{\mathbf{x} \in H} \pi(\mathcal{S}[\mathbf{x}]) b\_T^1(\mathbf{x}; \beta, \pi), \tag{12}$$

$$P(T; \mathcal{O}, \pi) = \sum\_{\mathbf{x}, \mathbf{y} \in H} \left\{ \mathcal{S}(\mathcal{S}[\mathbf{x}] \to \operatorname{g} \mathcal{S}[\mathbf{y}] \dots \mathcal{S}[\mathbf{y}]) f\_T^i(\mathbf{x}; \mathcal{O}, \pi) \right.$$

$$\times \prod\_{j \in \operatorname{ch}(i, T)} b\_T^j(\mathbf{y}; \mathcal{O}, \pi) \left\{ \begin{aligned} & (i \in \operatorname{cover}(\operatorname{g}, T) \end{aligned} \right\} \tag{13}$$

Here, cover(*g*, *Ti*) represents a function that returns a set of non-terminal indices at which the production rule generating *g* without annotations is rooted in *Ti*. For example, if *g* = + and *T* is the tree represented in Fig. 4, then cover(+, *T*) = {1, 3}.

#### **4.3. Parameter update formula**

8 Will-be-set-by-IN-TECH

(a) Forward prob. (b) Backward prob.

*Xi*

PCFG-LA estimates β and π using the EM algorithm. Before explaining the estimation procedure, we should note the form of production rules. In PAGE, production rules are not Chomsky normal form (CNF), as is assumed in the original PCFG-LA, because of the understandability of GP programs. Any function which can be handled with traditional GP

which is a subset of Greibach normal form (GNF). Here S∈N and *g* ∈ T (N and T are the sets of non-terminal and terminal symbols in CFG; see Section 3). A terminal symbol *g* in CFG is a function node (+, −, sin, cos ∈ F) or a terminal (*v*, *w* ∈ T) in GP (F and T denote set of GP

where *x*, *zm* ∈ *H* and *a*max is the arity of *g* in GP. If *g* has *a*max arity, the number of parameters for the production rule S → *<sup>g</sup>* <sup>S</sup>...<sup>S</sup> with annotations is *<sup>h</sup>a*max<sup>+</sup>1, which increases exponentially as the arity number increases. In order to reduce the number of parameters, we assume that

With this assumption, the number of parameters can be reduced to *h*2, which is tractable. Let

We explain forward and backward probabilities for PCFG-LA in this section. PCFG-LA ([17]) adopted forward and backward probabilities to apply the EM algorithm ([5]). The backward

<sup>S</sup>[*x*] is generated (<sup>β</sup> and <sup>π</sup> are parameters, Fig. 3 (b)), and the forward probability *<sup>f</sup> <sup>i</sup>*

*P*(*Ti*, *Xi*;β, π). (4)

S → *g* S ...S, (5)

S[*x*] → *g* S[*z*1] ...S[*za*max ], (6)

S[*x*] → *g* S[*y*] S[*y*]...S[*y*]. (7)

*<sup>T</sup>*(*y*; β, π)

R[*H*] = {S[*x*] → *g* S[*y*] S[*y*]...S[*y*]|*x*, *y*, ∈ *H*, *g* ∈T}. (8)

*<sup>T</sup>*(*x*;β, π) represents the probability that the tree beneath the *i*th non-terminal

**Figure 3.** (a) Forward and (b) backward probabilities. The superscripts denote the indices of

The likelihood of an observed tree can be calculated by summing over annotations:

*P*(*Ti*;β, π) = ∑

functions and terminals, respectively). Annotated production rules are

all the right-hand side non-terminal symbols have the same annotation, that is

R[*H*] be the set of annotated rules expressed by Equation 8. R[*H*] is defined by

[*y*], for example).

non-terminals (*<sup>i</sup>* in <sup>S</sup>*<sup>i</sup>*

can be represented by

**4.2. Forward–backward probability**

probability *b<sup>i</sup>*

We describe the parameter estimation in PCFG-LA. Because PCFG-LA contains latent variables *X*, the parameter estimation is carried out with the EM algorithm. Let β and π

#### 10 Will-be-set-by-IN-TECH 58 Genetic Programming – New Approaches and Successful Applications Programming with Annotated Grammar Estimation <sup>11</sup>

be current parameters β and π be nextstep parameters. The Q function to optimize in the EM algorithm can be expressed as follows:

$$\mathcal{Q}(\overline{\beta}, \overline{\pi} | \beta, \pi) = \sum\_{i=1}^{N} \sum\_{X\_i} P(X\_i | T\_{i\cdot} \beta, \pi) \log P(T\_{i\cdot} X\_i \overline{\beta}, \overline{\pi}),\tag{14}$$

where *N* is the number of learning data (promising solutions in EDA). A set of learning data is represented by D≡{*T*1, *T*2, ··· , *TN*}. Using the forward–backward probabilities and maximizing <sup>Q</sup>(β, <sup>π</sup>|β, <sup>π</sup>) under constraints <sup>∑</sup>*<sup>α</sup> <sup>β</sup>*(S[*x*] <sup>→</sup> *<sup>α</sup>*) = 1 and <sup>∑</sup>*<sup>x</sup> π*(S[*x*]) = 1, we

obtain the following update formula:

$$
\overline{\pi}(\mathcal{S}[\boldsymbol{x}]) \propto \pi(\mathcal{S}[\boldsymbol{x}]) \sum\_{i=1}^{N} \frac{b\_{T\_i}^1(\boldsymbol{x}; \boldsymbol{\mathcal{B}}, \boldsymbol{\pi})}{P(T\_i; \boldsymbol{\mathcal{B}}, \boldsymbol{\pi})},\tag{15}
$$

**Figure 5.** Illustrative description of PCFG-LAMM used in UPAGE.

limitation, select terminal nodes unconditionally).

rule according to the probability distribution

and a parameter update formula in this section.

EDA generates new individuals by sampling from the predictive posterior distributions,

Programming with Annotated Grammar Estimation 59

*P*(*T*, *X*|D*g*) = *P*(*T*, *X*; β∗, π∗). Since the EM algorithm is a point estimation method, new individuals can be generated with probabilistic logic sampling which is computationally efficient. The details of the sampling procedures are summarized below (note, when at the maximum depth

(a) A root node is selected following probability distribution π<sup>∗</sup> = {*π*∗(S[*x*])|*x* ∈ *H*}. (b) If there are non-terminal symbols S[*x*] (*x* ∈ *H*) in a derivation tree, select a production

Repeat (b) until there are no non-terminal symbols left in the derivation tree.

In this section, we introduce UPAGE ([11]) which is a mixture model extension of PAGE. UPAGE uses PCFG-LAMM as a baseline grammar, and we explain details of PCFG-LAMM

Although PCFG-LA is suitable for estimating local dependencies among nodes, it cannot consider global contexts behind individuals. Suppose there are two optimal solutions represented by *F*1(*x*) and *F*2(*x*). In this case, a population includes solution candidates for *F*1(*x*) and *F*2(*x*) at the same time. Since building blocks for two optimal solutions are different, model and parameter learning with one model results in slow convergence due to the mixed learning data. Furthermore in GP, there are multiple optimal structures even if the problems to be solved are not multimodal. For instance, if an optimum includes a substructure represented by sin(2*x*), sin(2*x*) as well as 2 sin(*x*) cos(*x*) which are mathematically equivalent can be building blocks, where their tree representations are different. When modeling such a mixed population, it is very difficult for PCFG-LA to estimate these multiple structures separately

*β*∗(S[*x*]) = {*β*∗(S[*x*] → *α*)|S[*x*] → *α* ∈ R[*H*]}.

4. Generation of new individuals

**5. Unsupervised PAGE**

**5.1. PCFG-LAMM**

namely

$$\begin{split} \overline{\mathcal{P}}(\mathcal{S}[\mathbf{x}] \to \operatorname{g} \mathcal{S}[\mathbf{y}] \dots \mathcal{S}[\mathbf{y}]) &\propto \operatorname{\mathcal{A}}(\mathcal{S}[\mathbf{x}] \to \operatorname{g} \, \mathcal{S}[\mathbf{y}] \dots \mathcal{S}[\mathbf{y}]) \\ &\times \sum\_{i=1}^{N} \left[ \frac{1}{P(T\_{i}; \mathcal{B}, \pi)} \sum\_{j \in \operatorname{cover}(\mathcal{g}, T\_{i})} \left\{ f\_{T\_{i}}^{j}(\mathbf{x}; \mathcal{B}, \pi) \prod\_{k \in \operatorname{ch}(j, T\_{i})} b\_{T\_{i}}^{k}(\mathbf{y}; \mathcal{B}, \pi) \right\} \right]. \end{split} \tag{16}$$

The EM algorithm maximizes the log-likelihood given by

$$\mathcal{L}(\mathcal{B}, \boldsymbol{\pi}; \mathcal{D}) = \sum\_{i=1}^{N} \log P(T\_i; \mathcal{B}, \boldsymbol{\pi}). \tag{17}$$

By iteratively performing Equations 15–16, the log-likelihood monotonically increases and we obtain locally maximum likelihood estimation parameters. For the case of the EM algorithm, the annotation size *h* has to be given in advance. Because the EM algorithm is a point estimation method, this algorithm cannot estimate the optimum annotation size. For the case of models that do not include latent variables, a model selection method such as Akaike information criteria (AIC) or Bayesian information criteria (BIC) is often used. However, these methods take advantage of the asymptotic normality of estimators, which is not satisfied in models that include latent variables. In Ref. ([12]), we derived variational Bayesian (VB) ([2]) based inference for PCFG-LA, which can estimate the optimal annotation size. Because the derivation of the VB-based algorithm is much more complicated than that of the EM algorithm and because such explanation is outside the scope of this chapter, we do not explain the details of the VB-based algorithm. For details of VB-based PAGE, please read Ref. ([12]).

The procedures of PAGE are listed below.

1. Generate initial population Initial population P<sup>0</sup> is generated by randomly creating *M* individuals.

2. Select promising solutions


Using a parameter update formula (Equations 15–16), converged parameters (β∗, π∗) are estimated with learning data D*g*.

#### 58 Genetic Programming – New Approaches and Successful Applications Programming with Annotated Grammar Estimation <sup>11</sup> Programming with Annotated Grammar Estimation 59

**Figure 5.** Illustrative description of PCFG-LAMM used in UPAGE.

4. Generation of new individuals

10 Will-be-set-by-IN-TECH

be current parameters β and π be nextstep parameters. The Q function to optimize in the EM

where *N* is the number of learning data (promising solutions in EDA). A set of learning data is represented by D≡{*T*1, *T*2, ··· , *TN*}. Using the forward–backward probabilities and

> *N* ∑ *i*=1

*b*1 *Ti*

(*x*;β, π) ∏

*P*(*Xi*|*Ti*;β, π)log *P*(*Ti*, *Xi*;β, π), (14)

*π*(S[*x*]) = 1, we

. (16)

, (15)

*<sup>β</sup>*(S[*x*] <sup>→</sup> *<sup>α</sup>*) = 1 and <sup>∑</sup>*<sup>x</sup>*

(*x*; β, π) *P*(*Ti*;β, π)

*k*∈ch(*j*,*Ti*)

*bk Ti*

(*y*; β, π)

log *P*(*Ti*;β, π). (17)

algorithm can be expressed as follows:

obtain the following update formula:

× *N* ∑ *i*=1

The procedures of PAGE are listed below.

estimated with learning data D*g*.

implementation, we use the truncation selection.

1. Generate initial population

2. Select promising solutions

3. Parameter estimation

1 *<sup>P</sup>*(*Ti*; <sup>β</sup>, <sup>π</sup>) ∑

The EM algorithm maximizes the log-likelihood given by

Q(β, π|β, π) =

maximizing <sup>Q</sup>(β, <sup>π</sup>|β, <sup>π</sup>) under constraints <sup>∑</sup>*<sup>α</sup>*

*N* ∑ *i*=1 ∑ *Xi*

*π*(S[*x*]) ∝ *π*(S[*x*])

*β*(S[*x*] → *g* S[*y*]...S[*y*]) ∝ *β*(S[*x*] → *g* S[*y*]...S[*y*])

 *f j Ti*

> *N* ∑ *i*=1

By iteratively performing Equations 15–16, the log-likelihood monotonically increases and we obtain locally maximum likelihood estimation parameters. For the case of the EM algorithm, the annotation size *h* has to be given in advance. Because the EM algorithm is a point estimation method, this algorithm cannot estimate the optimum annotation size. For the case of models that do not include latent variables, a model selection method such as Akaike information criteria (AIC) or Bayesian information criteria (BIC) is often used. However, these methods take advantage of the asymptotic normality of estimators, which is not satisfied in models that include latent variables. In Ref. ([12]), we derived variational Bayesian (VB) ([2]) based inference for PCFG-LA, which can estimate the optimal annotation size. Because the derivation of the VB-based algorithm is much more complicated than that of the EM algorithm and because such explanation is outside the scope of this chapter, we do not explain the details

*j*∈cover(*g*,*Ti*)

L(β, π; D) =

of the VB-based algorithm. For details of VB-based PAGE, please read Ref. ([12]).

Initial population P<sup>0</sup> is generated by randomly creating *M* individuals.

*N* individuals D*<sup>g</sup>* are selected from a population of *g*th generation P*g*. In our

Using a parameter update formula (Equations 15–16), converged parameters (β∗, π∗) are

EDA generates new individuals by sampling from the predictive posterior distributions, namely

$$P(T\_\prime X | \mathcal{D}\_\S) = P(T\_\prime X; \mathcal{B}\_{\ast \prime} \pi\_\ast).$$

Since the EM algorithm is a point estimation method, new individuals can be generated with probabilistic logic sampling which is computationally efficient. The details of the sampling procedures are summarized below (note, when at the maximum depth limitation, select terminal nodes unconditionally).


$$\beta\_\*\left(\mathcal{S}[\mathfrak{x}]\right) = \{\beta\_\*\left(\mathcal{S}[\mathfrak{x}] \to \mathfrak{a}\right)|\mathcal{S}[\mathfrak{x}] \to \mathfrak{a} \in \mathcal{R}[H]\}.$$

Repeat (b) until there are no non-terminal symbols left in the derivation tree.

## **5. Unsupervised PAGE**

In this section, we introduce UPAGE ([11]) which is a mixture model extension of PAGE. UPAGE uses PCFG-LAMM as a baseline grammar, and we explain details of PCFG-LAMM and a parameter update formula in this section.

#### **5.1. PCFG-LAMM**

Although PCFG-LA is suitable for estimating local dependencies among nodes, it cannot consider global contexts behind individuals. Suppose there are two optimal solutions represented by *F*1(*x*) and *F*2(*x*). In this case, a population includes solution candidates for *F*1(*x*) and *F*2(*x*) at the same time. Since building blocks for two optimal solutions are different, model and parameter learning with one model results in slow convergence due to the mixed learning data. Furthermore in GP, there are multiple optimal structures even if the problems to be solved are not multimodal. For instance, if an optimum includes a substructure represented by sin(2*x*), sin(2*x*) as well as 2 sin(*x*) cos(*x*) which are mathematically equivalent can be building blocks, where their tree representations are different. When modeling such a mixed population, it is very difficult for PCFG-LA to estimate these multiple structures separately as in the multimodal case. We have proposed a PCFG-LAMM which is a mixture model extension of PCFG-LA and have also proposed UPAGE based on PCFG-LAMM.

*ζ k* ∝ *N* ∑ *i*=1

L(β, π, ζ; D) =

using Equations 21–23. A log-likelihood is given by

implementation, we used the truncation selection.

limitation, select a terminal node unconditionally).

<sup>∗</sup> <sup>=</sup> {*<sup>π</sup>*

rule following the probability distribution

*β*

(a) Select a model following probability distribution <sup>ζ</sup><sup>∗</sup> <sup>=</sup> {*ζ*<sup>1</sup>

<sup>∗</sup>(S[*x*])|*<sup>x</sup>* ∈ *<sup>H</sup>*}.

<sup>∗</sup>(S[*x*]) = {*<sup>β</sup>*

The procedures of UPAGE are listed below.

are estimated with learning data D*g*.

4. Generation of new individuals

distribution π

**5.3. Computer experiments**

1. Generate initial population

2. Select promising solutions

3. Parameter estimation

namely

*ζkP*(*Ti*;β*k*, π*k*) *P*(*Ti*;β, π, ζ)

The parameter inference starts from some initial values and converges to a local optimum

*N* ∑ *i*=1

Initial population P<sup>0</sup> is generated by randomly creating *M* individuals. In our implementation, the ratio between production rules of function nodes (e.g. S[*x*] →

*N* individuals D*<sup>g</sup>* are selected from a population of *g*th generation P*g*. In our

Using a parameter update formula (Equations 21–23), converged parameters (β∗, π∗, ζ∗)

EDA generates new individuals by sampling from the predictive posterior distributions,

*P*(*T*, *X*, *Z*|D*g*) = *P*(*T*, *X*, *Z*;β∗, π∗, ζ∗). Since the EM algorithm is a point estimation method, new individuals can be generated with probabilistic logic sampling, which is computationally cheap. The details of the sampling procedures are summarized below (note, when at the maximum depth

(b) Let the selected model index be . A root node is selected following probability

(c) If there are non-terminal symbols S[*x*] (*x* ∈ *H*) in a derivation tree, select a production

Repeat (c) until there are no non-terminal symbols left in the derivation tree.

In order to show the effectiveness of UPAGE, we analyze UPAGE from the viewpoint of the number of fitness evaluations. We applied UPAGE to three benchmark problems: the royal tree problem (Section 5.3.1), the bipolar royal tree problem (Section 5.3.2) and the deceptive MAX (DMAX) problem (Section 5.3.3). Because we want to study the effectiveness of the mixture model versus PCFG-LA, we specifically compared UPAGE with PAGE. In each benchmark test, we employed the parameter settings shown in Table 1, where UPAGE and

<sup>∗</sup>(S[*x*] <sup>→</sup> *<sup>α</sup>*)|S[*x*] <sup>→</sup> *<sup>α</sup>* ∈ R[*H*]}.

+ S[*y*] S[*y*]) and those of terminal nodes (e.g. S[*x*] → + S[*y*] S[*y*]) are set to 4 : 1.

. (23)

Programming with Annotated Grammar Estimation 61

log *P*(*Ti*;β, π, ζ). (24)

<sup>∗</sup>, *<sup>ζ</sup>*<sup>2</sup>

<sup>∗</sup>, ··· , *<sup>ζ</sup>*

*μ* ∗ }.

PCFG-LAMM assumes that the probability distributions are a mixture of more than two PCFG-LA models. In PCFG-LAMM, each solution is considered to be sampled from either of the PCFG-LA models (Figure 5). We introduce a latent variable *z<sup>k</sup> <sup>i</sup>* , where *<sup>z</sup><sup>k</sup> <sup>i</sup>* is 1 when the *<sup>i</sup>*th derivation tree is generated from the *<sup>k</sup>*th model and 0 otherwise (*Zi* <sup>=</sup> {*z*<sup>1</sup> *<sup>i</sup>* , *<sup>z</sup>*<sup>2</sup> *<sup>i</sup>* , ··· , *z μ <sup>i</sup>* }). We summarized variables in Appendix B. As a consequence, PCFG-LAMM handles *Xi* and *Zi* as latent variables. The likelihood of complete data is given by

$$\begin{split} P(T\_{i}, X\_{i}, Z\_{i}; \boldsymbol{\mathcal{B}}, \boldsymbol{\pi}, \boldsymbol{\zeta}) &= \prod\_{k=1}^{\mu} \left\{ \zeta^{k} P(T\_{i}, X\_{i}; \boldsymbol{\mathcal{B}}^{k}, \boldsymbol{\pi}^{k}) \right\}^{z\_{i}^{k}} \\ &= \prod\_{k=1}^{\mu} \left\{ \zeta^{k} \prod\_{\boldsymbol{x} \in \boldsymbol{H}} \pi^{k} (\mathcal{S}[\boldsymbol{x}])^{\delta(\boldsymbol{x}; T\_{\boldsymbol{\nu}} X\_{i})} \prod\_{\boldsymbol{r} \in \mathcal{R}[\boldsymbol{H}]} \beta^{k}(\boldsymbol{r})^{c(\boldsymbol{r}; T\_{\boldsymbol{\nu}} X\_{i})} \right\}^{z\_{i}^{k}}, \end{split} \tag{18}$$

where *<sup>ζ</sup><sup>k</sup>* is the mixture ratio of the *<sup>k</sup>*th model (<sup>ζ</sup> <sup>=</sup> {*ζ*1, *<sup>ζ</sup>*2, ··· , *<sup>ζ</sup>μ*} where <sup>∑</sup>*<sup>k</sup> <sup>ζ</sup><sup>k</sup>* <sup>=</sup> 1). *<sup>β</sup>k*(*r*) and *<sup>π</sup>k*(S[*x*]) denote the probabilities of production rule *<sup>r</sup>* and root <sup>S</sup>[*x*] of the *<sup>k</sup>*th model, respectively. By calculating the marginal of Equation 18 with respect to *Xi* and *Zi*, the likelihood of observed tree *Ti* is calculated as

$$\begin{split} P(T\_i; \boldsymbol{\beta}, \boldsymbol{\pi}, \boldsymbol{\zeta}) &= \sum\_{k=1}^{\mu} \left\{ \mathbb{S}^k P(T\_i; \boldsymbol{\beta}^k, \boldsymbol{\pi}^k) \right\} \\ &= \sum\_{k=1}^{\mu} \left\{ \mathbb{S}^k \sum\_{\mathbf{x} \in H} \boldsymbol{\pi}^k(\mathcal{S}[\mathbf{x}]) b\_{T\_i}^1(\mathbf{x}; \boldsymbol{\beta}^k, \boldsymbol{\pi}^k) \right\}. \end{split} \tag{19}$$

#### **5.2. Parameter update formula**

As in PCFG-LA, the parameter inference of PCFG-LAMM is carried out via the EM algorithm because PCFG-LAMM contains latent variables *Xi* and *Zi*. Let β, π and ζ be current parameters β, π and ζ be nextstep parameters. The Q function of the EM algorithm is given by

$$\mathcal{Q}(\overline{\beta}, \overline{\pi}, \overline{\zeta} | \beta, \pi, \zeta) = \sum\_{i=1}^{N} \sum\_{X\_i} \sum\_{Z\_i} P(X\_i, Z\_i | T\_{i\cdot}; \beta, \pi, \zeta) \log P(T\_{i\cdot}, X\_i, Z\_i; \overline{\beta}, \overline{\pi}, \overline{\zeta}). \tag{20}$$

By maximizing Q(β, <sup>π</sup>, <sup>ζ</sup>|β, <sup>π</sup>, <sup>ζ</sup>) under constraints (∑ *k <sup>ζ</sup><sup>k</sup>* <sup>=</sup> 1, <sup>∑</sup>*<sup>α</sup> <sup>β</sup>k*(S[*x*] <sup>→</sup> *<sup>α</sup>*) = 1 and

$$\sum\_{\mathbf{x}} \pi^k(\mathcal{S}[\mathbf{x}]) = 1), \text{ a parameter update formula can be obtained as follows (see Appendix B):}$$

$$\overline{\beta}^{k}(\mathcal{S}[\mathbf{x}] \to \mathcal{g}\mathcal{S}[\mathbf{y}] \cdots \mathcal{S}[\mathbf{y}]) \propto \sum\_{i=1}^{N} \left\{ \frac{\beta^{k}(\mathcal{S}[\mathbf{x}] \to \mathcal{g}\mathcal{S}[\mathbf{y}] \cdot \cdots \mathcal{S}[\mathbf{y}])}{P(T\_{i}; \mathcal{S}, \pi, \xi)} \tilde{\xi}^{k} \right.$$

$$\times \sum\_{\ell \in \text{cover}(\mathbf{g}, T\_{\ell})} f\_{T\_{\ell}}^{\ell}(\mathbf{x}; \mathcal{B}^{k}, \pi^{k}) \prod\_{j \in \text{ch}(\ell, T\_{i})} b\_{T\_{i}}^{j}(\mathbf{y}; \mathcal{B}^{k}, \pi^{k}) \int\_{\mathcal{X}} \mathcal{L}\mathbf{x}^{k}$$

$$N \int\_{\mathcal{X}} \mathbf{f}\_{\ell}(\mathbf{x}; \mathbf{y}, \mathbf{1}) \, d\mathbf{x} = \mathbf{y}$$

$$\pi^{k} \propto \sum\_{i=1}^{N} \left\{ \frac{\pi^{k}(\mathcal{S}[\mathbf{x}])}{P(T\_{i}; \mathcal{B}, \pi, \zeta)} \zeta^{k} b\_{T\_{i}}^{1}(\mathbf{x}; \mathcal{B}^{k}, \pi^{k}) \right\},\tag{22}$$

#### 60 Genetic Programming – New Approaches and Successful Applications Programming with Annotated Grammar Estimation <sup>13</sup> Programming with Annotated Grammar Estimation 61

$$\overline{\xi}^{k} \propto \sum\_{i=1}^{N} \left\{ \frac{\mathbb{E}^{k} P(T\_{i}; \beta^{k}, \pi^{k})}{P(T\_{i}; \beta, \pi, \zeta)} \right\}. \tag{23}$$

The parameter inference starts from some initial values and converges to a local optimum using Equations 21–23. A log-likelihood is given by

$$\mathcal{L}(\beta, \pi, \zeta; \mathcal{D}) = \sum\_{i=1}^{N} \log P(T\_i; \beta, \pi, \zeta). \tag{24}$$

The procedures of UPAGE are listed below.

1. Generate initial population

12 Will-be-set-by-IN-TECH

as in the multimodal case. We have proposed a PCFG-LAMM which is a mixture model

PCFG-LAMM assumes that the probability distributions are a mixture of more than two PCFG-LA models. In PCFG-LAMM, each solution is considered to be sampled from either

We summarized variables in Appendix B. As a consequence, PCFG-LAMM handles *Xi* and *Zi*

, π*<sup>k</sup>* ) *<sup>z</sup><sup>k</sup> i*

(S[*x*])*δ*(*x*;*Ti*,*Xi*) ∏

*r*∈R[*H*]

*P*(*Xi*, *Zi*|*Ti*;β, π, ζ)log *P*(*Ti*, *Xi*, *Zi*;β, π, ζ). (20)

*<sup>ζ</sup><sup>k</sup>* <sup>=</sup> 1, <sup>∑</sup>*<sup>α</sup>*

, <sup>π</sup>*k*) ∏

*j*∈ch(,*Ti*)

*b j Ti* (*y*;β*k*, π*<sup>k</sup>*

, (22)

) 

, (21)

*ζkP*(*Ti*, *Xi*;β*<sup>k</sup>*

*πk*

where *<sup>ζ</sup><sup>k</sup>* is the mixture ratio of the *<sup>k</sup>*th model (<sup>ζ</sup> <sup>=</sup> {*ζ*1, *<sup>ζ</sup>*2, ··· , *<sup>ζ</sup>μ*} where <sup>∑</sup>*<sup>k</sup> <sup>ζ</sup><sup>k</sup>* <sup>=</sup> 1). *<sup>β</sup>k*(*r*) and *<sup>π</sup>k*(S[*x*]) denote the probabilities of production rule *<sup>r</sup>* and root <sup>S</sup>[*x*] of the *<sup>k</sup>*th model, respectively. By calculating the marginal of Equation 18 with respect to *Xi* and *Zi*,

*ζkP*(*Ti*;β*<sup>k</sup>*

As in PCFG-LA, the parameter inference of PCFG-LAMM is carried out via the EM algorithm because PCFG-LAMM contains latent variables *Xi* and *Zi*. Let β, π and ζ be current parameters β, π and ζ be nextstep parameters. The Q function of the EM algorithm is given

*<sup>π</sup>k*(S[*x*]) = 1), a parameter update formula can be obtained as follows (see Appendix B):

, π*<sup>k</sup>* ) 

(S[*x*])*b*<sup>1</sup> *Ti* (*x*; β*<sup>k</sup>* , π*<sup>k</sup>* ) 

*k*

*<sup>β</sup>k*(S[*x*] <sup>→</sup> *<sup>g</sup>* <sup>S</sup>[*y*] ···S[*y*])

*f Ti* (*x*;β*<sup>k</sup>*

*ζk b*1 *Ti* (*x*;β*<sup>k</sup>* , π*<sup>k</sup>* ) 

*<sup>P</sup>*(*Ti*; <sup>β</sup>, <sup>π</sup>, <sup>ζ</sup>) *<sup>ζ</sup><sup>k</sup>*

*<sup>i</sup>* , where *<sup>z</sup><sup>k</sup>*

*βk*(*r*)*c*(*r*;*Ti*,*Xi*)

*<sup>i</sup>* is 1 when the

*<sup>i</sup>* , ··· , *z μ <sup>i</sup>* }).

, (18)

*<sup>i</sup>* , *<sup>z</sup>*<sup>2</sup>

*<sup>z</sup><sup>k</sup> i*

. (19)

*<sup>β</sup>k*(S[*x*] <sup>→</sup> *<sup>α</sup>*) = 1 and

extension of PCFG-LA and have also proposed UPAGE based on PCFG-LAMM.

*<sup>i</sup>*th derivation tree is generated from the *<sup>k</sup>*th model and 0 otherwise (*Zi* <sup>=</sup> {*z*<sup>1</sup>

*<sup>ζ</sup><sup>k</sup>* <sup>∏</sup>*<sup>x</sup>*∈*<sup>H</sup>*

*μ* ∑ *k*=1

= *μ* ∑ *k*=1

*N* ∑ *i*=1 ∑ *Xi* ∑ *Zi*

> *N* ∑ *i*=1

× ∑ ∈cover(*g*,*Ti*)

*<sup>π</sup>k*(S[*x*]) *P*(*Ti*;β, π, ζ)

By maximizing Q(β, <sup>π</sup>, <sup>ζ</sup>|β, <sup>π</sup>, <sup>ζ</sup>) under constraints (∑

*π<sup>k</sup>* ∝

*N* ∑ *i*=1  *<sup>ζ</sup><sup>k</sup>* ∑ *x*∈*H πk*

of the PCFG-LA models (Figure 5). We introduce a latent variable *z<sup>k</sup>*

as latent variables. The likelihood of complete data is given by

= *μ* ∏ *k*=1

*μ* ∏ *k*=1

*P*(*Ti*, *Xi*, *Zi*;β, π, ζ) =

the likelihood of observed tree *Ti* is calculated as

**5.2. Parameter update formula**

Q(β, π, ζ|β, π, ζ) =

(S[*x*] → *g* S[*y*] ···S[*y*]) ∝

by

∑*x*

*β k* *P*(*Ti*;β, π, ζ) =

Initial population P<sup>0</sup> is generated by randomly creating *M* individuals. In our implementation, the ratio between production rules of function nodes (e.g. S[*x*] → + S[*y*] S[*y*]) and those of terminal nodes (e.g. S[*x*] → + S[*y*] S[*y*]) are set to 4 : 1.


EDA generates new individuals by sampling from the predictive posterior distributions, namely

*P*(*T*, *X*, *Z*|D*g*) = *P*(*T*, *X*, *Z*;β∗, π∗, ζ∗).

Since the EM algorithm is a point estimation method, new individuals can be generated with probabilistic logic sampling, which is computationally cheap. The details of the sampling procedures are summarized below (note, when at the maximum depth limitation, select a terminal node unconditionally).


$$\mathcal{J}\_\*^\ell(\mathcal{S}[\mathfrak{x}]) = \{ \mathcal{J}\_\*^\ell(\mathcal{S}[\mathfrak{x}] \to \mathfrak{a}) | \mathcal{S}[\mathfrak{x}] \to \mathfrak{a} \in \mathcal{R}[H] \}.$$

Repeat (c) until there are no non-terminal symbols left in the derivation tree.

#### **5.3. Computer experiments**

In order to show the effectiveness of UPAGE, we analyze UPAGE from the viewpoint of the number of fitness evaluations. We applied UPAGE to three benchmark problems: the royal tree problem (Section 5.3.1), the bipolar royal tree problem (Section 5.3.2) and the deceptive MAX (DMAX) problem (Section 5.3.3). Because we want to study the effectiveness of the mixture model versus PCFG-LA, we specifically compared UPAGE with PAGE. In each benchmark test, we employed the parameter settings shown in Table 1, where UPAGE and

14 Will-be-set-by-IN-TECH 62 Genetic Programming – New Approaches and Successful Applications Programming with Annotated Grammar Estimation <sup>15</sup>


Average number of fitness evaluations Standard deviation

Programming with Annotated Grammar Estimation 63

UPAGE 6171 28 PAGE 6237 18

If a subtree rooted at X*ij* has a correct root but is not a perfect tree.

In the present chapter, we employ the following GP functions and terminals:

F = {*a*, *b*, *c*, *d*}, T = {*x*}.

Here, F and T denote function and terminal sets, respectively, of GP. For details of the royal

Table 2 shows the average number of fitness evaluations (along with their standard deviation) and the P-value of a *t*-test (Welch, two-tailed). As can been seen with Table 2, there is no noticeable difference between UPAGE and PAGE in the average number of fitness evaluations, which is confirmed by the P-value of *t*-test. The royal tree problem is not multimodal, and hence the optimal solution has only one tree expression. Consequently, we do not have to consider global contexts behind optimal solutions, which is an advantage of UPAGE over

We next apply UPAGE to the bipolar royal tree problem. In the field of GA-EDAs, a mixture model based method UEBNA was proposed, and it was reported that UEBNA is especially effective in multimodal problems such as two-max problem. Consequently, we apply UPAGE to a bipolar problem having two optimal solutions, which is a multimodal extension of the royal tree problem. In order to make the royal tree problem multimodal, we set T = {*x*, *y*} and *Score*(*x*) = *Score*(*y*) = 1. With this setting, the royal tree problem has two optimal solutions of *x* (Fig. 7(a)) and *y* (Fig. 7(b)). PAGE and UPAGE stop when either of the two

Table 3 shows the average number of fitness evaluations along with their standard deviation. We see that UPAGE can obtain an optimal solution with a smaller number of fitness

problem.

• *wbi*

PAGE.

• *Partial Bonus* = 1

• *Complete Bonus* = 2

tree problem, please see Ref. ([22]).

*5.3.2. Bipolar royal tree problem*

optimal solutions is obtained.

If X*ij* is not a correct root.

If a subtree rooted at X*<sup>i</sup>* is a perfect tree.

• *Penalty* = 1/3

• *Otherwise* = 1

P-value of *t*-test (Welch, two-tailed) 0.74 **Table 2.** The number of fitness evaluations, standard deviation and P-value of *t*-test in the royal tree



**Table 1.** Main parameter settings of UPAGE and PAGE.

PAGE used the same population size, elite rate and selection rate. For the method-specific parameters of PAGE and UPAGE, we determined *h* and *μ* so that the number of parameters to be estimated is almost the same in UPAGE and PAGE. In the three benchmark problems, we carried out UPAGE and PAGE 30 times to compare the number of fitness evaluations and also performed the Welch *t*-test (two-tailed) to determine the statistical significance.

#### *5.3.1. Royal tree problem*

We apply UPAGE to the royal tree problem ([22]), which has only one optimal solution. The royal tree problem is a popular benchmark problem in GP. The royal tree problem is suitable for analyzing GP because the optimal structure of the royal tree is composed of smaller substructures (building blocks), and hence it well reflects the behavior of GP.

The royal tree problem defines the state *perfect tree* at each level. The perfect tree at a given level is composed of the perfect tree that is one level smaller than the given level. Thus, the perfect tree of level *c* is composed of the perfect tree of level *b*. In perfect trees, alphabets of functions descend by one from a root to leaves in a tree. A function *a* has a terminal *x*. The fitness function of the royal tree problem is given by

$$Score(\mathcal{X}\_i) = wb\_i \sum\_{j} (wa\_{i\bar{j}} \times Score(\mathcal{X}\_{i\bar{j}})) \,\tag{25}$$

where X*<sup>i</sup>* is the *i*th node in tree structures, and X*ij* denotes the *j*th child of X*i*. The fitness value of the royal tree problem is calculated recursively from a root node. In Equation 25, *wbi* and *waij* are weights which are defined as follows:

	- *Full Bonus* = 2

If a subtree rooted at X*ij* has a correct root and is a perfect tree.


**Table 2.** The number of fitness evaluations, standard deviation and P-value of *t*-test in the royal tree problem.


14 Will-be-set-by-IN-TECH

PAGE and UPAGE Meaning Royal Bipolar DMAX Tree Royal Tree

*M* Population size 1000 3000 3000 *Ps* Selection rate 0.1 0.1 0.1 *Pe* Elite rate 0.01 0.01 0.01

UPAGE

*h* Annotation size 11 22 22 *μ* The number of mixtures 2 2 2

> PAGE Meaning Royal Bipolar DMAX Tree Royal Tree

*h* Annotation size 16 32 32

PAGE used the same population size, elite rate and selection rate. For the method-specific parameters of PAGE and UPAGE, we determined *h* and *μ* so that the number of parameters to be estimated is almost the same in UPAGE and PAGE. In the three benchmark problems, we carried out UPAGE and PAGE 30 times to compare the number of fitness evaluations and

We apply UPAGE to the royal tree problem ([22]), which has only one optimal solution. The royal tree problem is a popular benchmark problem in GP. The royal tree problem is suitable for analyzing GP because the optimal structure of the royal tree is composed of smaller

The royal tree problem defines the state *perfect tree* at each level. The perfect tree at a given level is composed of the perfect tree that is one level smaller than the given level. Thus, the perfect tree of level *c* is composed of the perfect tree of level *b*. In perfect trees, alphabets of functions descend by one from a root to leaves in a tree. A function *a* has a terminal *x*. The

*j*

where X*<sup>i</sup>* is the *i*th node in tree structures, and X*ij* denotes the *j*th child of X*i*. The fitness value of the royal tree problem is calculated recursively from a root node. In Equation 25, *wbi* and

(*waij* × *Score*(X*ij*)), (25)

also performed the Welch *t*-test (two-tailed) to determine the statistical significance.

substructures (building blocks), and hence it well reflects the behavior of GP.

*Score*(X*i*) = *wbi* ∑

If a subtree rooted at X*ij* has a correct root and is a perfect tree.

**Table 1.** Main parameter settings of UPAGE and PAGE.

fitness function of the royal tree problem is given by

*waij* are weights which are defined as follows:

*5.3.1. Royal tree problem*

• *waij*

• *Full Bonus* = 2

Meaning Royal Bipolar DMAX

Tree Royal Tree


In the present chapter, we employ the following GP functions and terminals:

$$
\mathfrak{F} = \{a, b, c, d\}.
$$

$$
\mathfrak{T} = \{\mathfrak{x}\}.
$$

Here, F and T denote function and terminal sets, respectively, of GP. For details of the royal tree problem, please see Ref. ([22]).

Table 2 shows the average number of fitness evaluations (along with their standard deviation) and the P-value of a *t*-test (Welch, two-tailed). As can been seen with Table 2, there is no noticeable difference between UPAGE and PAGE in the average number of fitness evaluations, which is confirmed by the P-value of *t*-test. The royal tree problem is not multimodal, and hence the optimal solution has only one tree expression. Consequently, we do not have to consider global contexts behind optimal solutions, which is an advantage of UPAGE over PAGE.

#### *5.3.2. Bipolar royal tree problem*

We next apply UPAGE to the bipolar royal tree problem. In the field of GA-EDAs, a mixture model based method UEBNA was proposed, and it was reported that UEBNA is especially effective in multimodal problems such as two-max problem. Consequently, we apply UPAGE to a bipolar problem having two optimal solutions, which is a multimodal extension of the royal tree problem. In order to make the royal tree problem multimodal, we set T = {*x*, *y*} and *Score*(*x*) = *Score*(*y*) = 1. With this setting, the royal tree problem has two optimal solutions of *x* (Fig. 7(a)) and *y* (Fig. 7(b)). PAGE and UPAGE stop when either of the two optimal solutions is obtained.

Table 3 shows the average number of fitness evaluations along with their standard deviation. We see that UPAGE can obtain an optimal solution with a smaller number of fitness

Average number of fitness evaluations Standard deviation

Programming with Annotated Grammar Estimation 65

UPAGE 25839 4737 PAGE 31878 4333

**Figure 8.** Transitions of loglikelihood of UPAGE in the bipolar royal tree problem.

this setting, the GP terminals and functions are

solution in the present setting is given by

is statistically significant.

maximum tree depth. However, the symbols used in the DMAX problem are different from those used in the MAX problem. The DMAX problem has three parameters, and the difficulty of the problem can be tuned using these three parameters. For the problem of interest in the present chapter, we selected *m* = 3 and *r* = 2, whose deceptiveness is of medium degree. In

F = {+3, ×3}, T = {0.95, −1}, where +<sup>3</sup> and ×<sup>3</sup> are 3 arity addition and multiplication operators, respectively. The optimal

Table 4 shows the average number of fitness evaluations along with their standard deviation for the DMAX problem. We can see that UPAGE obtained the optimal solution with a smaller number of fitness evaluations compared to PAGE. Table 4 gives the P-value of a *t*-test (Welch and two-tailed) and allows us to say that the difference in the averages of UPAGE and PAGE

In the bipolar royal tree problem, expressions of the two optimal solutions (*x* or *y*) are different, and thus building blocks of the optima are also different. In contrast, the DMAX problem has mathematically only one optimal solution, which are represented by Equation 26. Although the DMAX problem is a unimodal problem, the DMAX problem has different expressions for the optimal solution due to commutative operators such as +<sup>3</sup> and ×3. From this experiment, we see that UPAGE is superior to PAGE for this class of benchmark problems.

(−<sup>1</sup> <sup>×</sup> <sup>3</sup>)26(0.95 <sup>×</sup> <sup>3</sup>) 7.24 <sup>×</sup> 1012. (26)

tree problem.

P-value of *t*-test (Welch, two-tailed) 4.49 <sup>×</sup> <sup>10</sup>−<sup>6</sup> **Table 3.** The number of fitness evaluations, standard deviation and P-value of *t*-test in the bipolar royal

**Figure 6.** Example of fitness calculation in the bipolar royal tree problem. (a) Derivation tree and (b) S-expression.

**Figure 7.** (a) Optimum structure of *x* and (b) that of *y* in the bipolar royal tree problem. These two structures have the same fitness value.

evaluations than PAGE. Table 3 gives the P-value of a *t*-test (Welch, two-tailed), which allows us to say that the difference between UPAGE and PAGE is statistically significant.

Because the bipolar royal tree problem has two optimal solutions (*x* and *y*), PAGE learns the production rule probabilities with learning data containing solution candidates of both *x* and *y* optima. Let us consider the annotation size required to express optimal solutions of the bipolar royal tree problem of depth 5. For the case of PAGE, the minimum annotation size to be able to learn the two optimal solutions separately is 10. In contrast, UPAGE can express the two optimal solutions with mixture size 2 and annotation size 5, which results in a smaller number of parameters. This consideration shows that a mixture model is more suitable for this class of problems.

Figure 8 shows the increase in the log-likelihood for the bipolar royal tree problem, in particular, the transitions at generation 0 and generation 5. As can been seen from the figure, the log-likelihood converges after about 10 iterations. The log-likelihood improvement at generation 5 is larger than that at generation 0 because the tree structures have converged toward the end of the search.

#### *5.3.3. DMAX Problem*

We apply UPAGE to the DMAX problem ([8, 10]), which has deceptiveness when it is solved with GP. The main objective of the DMAX problem is identical to that of the original MAX problem: to find the functions that return the largest *real* value under the limitation of a


16 Will-be-set-by-IN-TECH

**Figure 6.** Example of fitness calculation in the bipolar royal tree problem. (a) Derivation tree and (b)

**Figure 7.** (a) Optimum structure of *x* and (b) that of *y* in the bipolar royal tree problem. These two

us to say that the difference between UPAGE and PAGE is statistically significant.

evaluations than PAGE. Table 3 gives the P-value of a *t*-test (Welch, two-tailed), which allows

Because the bipolar royal tree problem has two optimal solutions (*x* and *y*), PAGE learns the production rule probabilities with learning data containing solution candidates of both *x* and *y* optima. Let us consider the annotation size required to express optimal solutions of the bipolar royal tree problem of depth 5. For the case of PAGE, the minimum annotation size to be able to learn the two optimal solutions separately is 10. In contrast, UPAGE can express the two optimal solutions with mixture size 2 and annotation size 5, which results in a smaller number of parameters. This consideration shows that a mixture model is more suitable for

Figure 8 shows the increase in the log-likelihood for the bipolar royal tree problem, in particular, the transitions at generation 0 and generation 5. As can been seen from the figure, the log-likelihood converges after about 10 iterations. The log-likelihood improvement at generation 5 is larger than that at generation 0 because the tree structures have converged

We apply UPAGE to the DMAX problem ([8, 10]), which has deceptiveness when it is solved with GP. The main objective of the DMAX problem is identical to that of the original MAX problem: to find the functions that return the largest *real* value under the limitation of a

S-expression.

structures have the same fitness value.

this class of problems.

*5.3.3. DMAX Problem*

toward the end of the search.

**Table 3.** The number of fitness evaluations, standard deviation and P-value of *t*-test in the bipolar royal tree problem.

**Figure 8.** Transitions of loglikelihood of UPAGE in the bipolar royal tree problem.

maximum tree depth. However, the symbols used in the DMAX problem are different from those used in the MAX problem. The DMAX problem has three parameters, and the difficulty of the problem can be tuned using these three parameters. For the problem of interest in the present chapter, we selected *m* = 3 and *r* = 2, whose deceptiveness is of medium degree. In this setting, the GP terminals and functions are

$$\begin{aligned} \mathfrak{F} &= \{ \mathfrak{F}\_{\mathfrak{F}} \times\_{\mathfrak{F}} \mathfrak{F}\_{\mathfrak{F}} \}, \\ \mathfrak{T} &= \{ 0.95, -1 \}, \end{aligned}$$

where +<sup>3</sup> and ×<sup>3</sup> are 3 arity addition and multiplication operators, respectively. The optimal solution in the present setting is given by

$$(-1 \times 3)^{26}(0.95 \times 3) \div 7.24 \times 10^{12} \,. \tag{26}$$

Table 4 shows the average number of fitness evaluations along with their standard deviation for the DMAX problem. We can see that UPAGE obtained the optimal solution with a smaller number of fitness evaluations compared to PAGE. Table 4 gives the P-value of a *t*-test (Welch and two-tailed) and allows us to say that the difference in the averages of UPAGE and PAGE is statistically significant.

In the bipolar royal tree problem, expressions of the two optimal solutions (*x* or *y*) are different, and thus building blocks of the optima are also different. In contrast, the DMAX problem has mathematically only one optimal solution, which are represented by Equation 26. Although the DMAX problem is a unimodal problem, the DMAX problem has different expressions for the optimal solution due to commutative operators such as +<sup>3</sup> and ×3. From this experiment, we see that UPAGE is superior to PAGE for this class of benchmark problems.

#### 18 Will-be-set-by-IN-TECH 66 Genetic Programming – New Approaches and Successful Applications Programming with Annotated Grammar Estimation <sup>19</sup>


Successful runs / Total runs

Programming with Annotated Grammar Estimation 67

UPAGE 10/15 PAGE 0/15

in 10 out of 15 runs, whereas PAGE could not obtain them at all.

capability of learning distributions in a separate way.

shown in Table 5.

**6. Discussion**

including GMPE and POLE.

**Table 6.** The number of runs which could obtain both optimal solutions. We carried out 15 runs in total.

show that UPAGE can obtain both optimal solutions in a single run. Parameter settings are

Table 6 shows the number of successful runs in which both optimal solutions are obtained in a single run. As can been seen in Table 6, UPAGE succeeded in obtaining both optimal solutions

Table 7 shows production rule probabilities of UPAGE in a successful run. Although the mixture size is *μ* = 4, we have only presented probabilities of Model = 0 and Model = 3, which are related to optimal solutions of *y* (Fig. 7(b)) and *x* (Fig. 7(a)), respectively (i.e. Model = 1 and Model = 2 are not shown). Because we see in Model = 0 that the probabilities generating *y* are very high, we consider that the optimal solution of *y* was generated by Model = 0. On the other hand, it is estimated that the optimal solution of *x* was generated by Model = 3. From this probability table, we can confirm that UPAGE successfully estimated the mixed population separately, because Model = 3 and 0 can generate optimal solutions of *x* and *y* with relatively high probability. It is very difficult for PAGE to estimate multiple solutions because PCFG-LA is not a mixture model and it is almost impossible to learn the distributions separately. As was shown in Section 5.3, UPAGE is superior to PAGE in terms of the number of fitness evaluations. From Table 7, it is considered that this superiority is due to UPAGE's

In the present chapter, we have introduced PAGE and UPAGE. PAGE is based on PCFG-LA, which takes into account latent annotations to weaken the context freedom assumption. By considering latent annotations, dependencies among nodes can be considered. We reported in Ref. ([12]) that PAGE is more powerful for several benchmark tests than other GP-EDAs,

Although PCFG-LA is suitable for estimating dependencies among local nodes, it cannot consider global contexts (contexts of entire tree structures) behind individuals. In many real-world problems, not only local dependencies but also global contexts have to be taken into account. In order to consider the global contexts, we have proposed UPAGE by extending PCFG-LA into a mixture model (PCFG-LAMM). In the bipolar royal tree problem, there are two optimal structures of *x* and *y* and the global contexts represent which optima (*x* or *y*) each tree structure comes from. From Table 7, the mixture model of UPAGE successfully worked and UPAGE could estimate mixed population separately. We have also shown that a mixture model is effective not only in multimodal problems but also in some unimodal problems, namely in the DMAX problem. Although the optimal solution of the DMAX problem is represented by mathematically one expression, the tree expressions are not unique, due to commutative operators (×<sup>3</sup> and +3). Consequently, the mixture model is also effective in the DMAX problem (see Section 5.3.3), and this situation where there exists the expression diversity often arises in real world problems. When obtaining multiple optimal solutions in a single run, UPAGE succeeded in cases for which PAGE obtained only one of the

**Table 4.** The number of fitness evaluations, standard deviation and P-value of *t*-test in the DMAX problem.

**Figure 9.** The average number of fitness evaluations (smaller is better) in royal tree problem, bipolar royal tree problem and DMAX problem relative to those of PAGE (i.e. the PAGE results are normalized to 1).





#### **5.4. Multimodal problem**

In the preceding section, we evaluated the performance of UPAGE from the viewpoint of the average number of fitness evaluations. In this section, we show the effectiveness of UPAGE in terms of its capability for obtaining multiple solutions of a multimodal problem. Because there are two optimal solutions in the bipolar royal tree problem (see Fig. 7(a) and (b)), we


**Table 6.** The number of runs which could obtain both optimal solutions. We carried out 15 runs in total.

show that UPAGE can obtain both optimal solutions in a single run. Parameter settings are shown in Table 5.

Table 6 shows the number of successful runs in which both optimal solutions are obtained in a single run. As can been seen in Table 6, UPAGE succeeded in obtaining both optimal solutions in 10 out of 15 runs, whereas PAGE could not obtain them at all.

Table 7 shows production rule probabilities of UPAGE in a successful run. Although the mixture size is *μ* = 4, we have only presented probabilities of Model = 0 and Model = 3, which are related to optimal solutions of *y* (Fig. 7(b)) and *x* (Fig. 7(a)), respectively (i.e. Model = 1 and Model = 2 are not shown). Because we see in Model = 0 that the probabilities generating *y* are very high, we consider that the optimal solution of *y* was generated by Model = 0. On the other hand, it is estimated that the optimal solution of *x* was generated by Model = 3. From this probability table, we can confirm that UPAGE successfully estimated the mixed population separately, because Model = 3 and 0 can generate optimal solutions of *x* and *y* with relatively high probability. It is very difficult for PAGE to estimate multiple solutions because PCFG-LA is not a mixture model and it is almost impossible to learn the distributions separately. As was shown in Section 5.3, UPAGE is superior to PAGE in terms of the number of fitness evaluations. From Table 7, it is considered that this superiority is due to UPAGE's capability of learning distributions in a separate way.

### **6. Discussion**

18 Will-be-set-by-IN-TECH

P-value of *t*-test (Welch, two-tailed) 1.94 <sup>×</sup> <sup>10</sup>−<sup>2</sup>

UPAGE 36729 3794 PAGE 38709 2233

**Table 4.** The number of fitness evaluations, standard deviation and P-value of *t*-test in the DMAX

**Figure 9.** The average number of fitness evaluations (smaller is better) in royal tree problem, bipolar royal tree problem and DMAX problem relative to those of PAGE (i.e. the PAGE results are normalized

> *M* Population size 6000 *Ps* Selection rate 0.3 *Pe* Elite rate 0.1

*h* Annotation size 16 *μ* The number of mixtures 4

*h* Annotation size 32

**Table 5.** Parameter settings for a multimodal problem.

**5.4. Multimodal problem**

Common parameters in PAGE and UPAGE Meaning Bipolar Royal Tree

UPAGE

PAGE Meaning Bipolar Royal Tree

In the preceding section, we evaluated the performance of UPAGE from the viewpoint of the average number of fitness evaluations. In this section, we show the effectiveness of UPAGE in terms of its capability for obtaining multiple solutions of a multimodal problem. Because there are two optimal solutions in the bipolar royal tree problem (see Fig. 7(a) and (b)), we

Meaning Bipolar Royal Tree

problem.

to 1).

Average number of fitness evaluations Standard deviation

In the present chapter, we have introduced PAGE and UPAGE. PAGE is based on PCFG-LA, which takes into account latent annotations to weaken the context freedom assumption. By considering latent annotations, dependencies among nodes can be considered. We reported in Ref. ([12]) that PAGE is more powerful for several benchmark tests than other GP-EDAs, including GMPE and POLE.

Although PCFG-LA is suitable for estimating dependencies among local nodes, it cannot consider global contexts (contexts of entire tree structures) behind individuals. In many real-world problems, not only local dependencies but also global contexts have to be taken into account. In order to consider the global contexts, we have proposed UPAGE by extending PCFG-LA into a mixture model (PCFG-LAMM). In the bipolar royal tree problem, there are two optimal structures of *x* and *y* and the global contexts represent which optima (*x* or *y*) each tree structure comes from. From Table 7, the mixture model of UPAGE successfully worked and UPAGE could estimate mixed population separately. We have also shown that a mixture model is effective not only in multimodal problems but also in some unimodal problems, namely in the DMAX problem. Although the optimal solution of the DMAX problem is represented by mathematically one expression, the tree expressions are not unique, due to commutative operators (×<sup>3</sup> and +3). Consequently, the mixture model is also effective in the DMAX problem (see Section 5.3.3), and this situation where there exists the expression diversity often arises in real world problems. When obtaining multiple optimal solutions in a single run, UPAGE succeeded in cases for which PAGE obtained only one of the


Method Estimation of Position independent Consideration of

optima. This result shows that UPAGE is more effective than PAGE not only quantitatively but also qualitatively. We also note that UPAGE is more powerful than PAGE in terms of computational time. In our computer experiments, we set the number of parameters in UPAGE and PAGE to be approximately the same. Figure 10 shows the relative computational time per generation of UPAGE and PAGE (the computational time of PAGE is normalized to 1) and we see that UPAGE required only sixty percent of the time required by PAGE. Although we have shown in Section 5.3.1 that UPAGE and PAGE required approximately the same number of fitness evaluations to obtain the optimal solution in the royal tree problem, UPAGE is more effective even for the royal tree problem if the actual computational time is considered.

**Figure 10.** The computational time per generation of UPAGE and PAGE (smaller is better). The time of

Table 8 summarizes functionalities of several GP-EDAs. SG-GP employs the conventional PCFG and hence it cannot estimate dependencies among nodes. Although GT-EDA, GMPE and PAGE adopt different types of grammar models, they belong to the same class in the sense that these three methods can take into account dependencies among nodes, which is enabled by a use of specialized production rules depending on contexts. However, these methods cannot consider global contexts, and consequently, they are not suitable for estimating problems having complex distributions. In contrast, in addition to local dependencies among nodes, UPAGE can consider global contexts of tree structures. The model of UPAGE is the most flexible among these GP-EDAs, and this flexibility is reflected by the search performance. In the present implementation of UPAGE, we had to set the mixture size *μ* and the annotation size *h* in advance because UPAGE employed the EM algorithm. However, it is desirable to

Scalar SG-GP No Yes No Vectorial SG-GP Partially No No GT-EDA Yes No No GMPE Yes Yes No PAGE Yes Yes No UPAGE Yes Yes Yes

**Table 8.** Classification of GP-EDAs and their capabilities.

PAGE is normalized to 1.

interaction among nodes model global contexts

Programming with Annotated Grammar Estimation 69

**Table 7.** Estimated parameters by UPAGE in a successful run. Although the number of mixtures is *μ* = 4, we only show Model = 0 and Model = 3 related to optimal solutions of *y* and *x*, respectively. Due to limited space, we do not show parameters of production rules which are smaller than 0.1.



20 Will-be-set-by-IN-TECH

**Table 7.** Estimated parameters by UPAGE in a successful run. Although the number of mixtures is *μ* = 4, we only show Model = 0 and Model = 3 related to optimal solutions of *y* and *x*, respectively. Due

to limited space, we do not show parameters of production rules which are smaller than 0.1.

Model = 3 Pr *ζ*<sup>3</sup> 0.52 S[11] 1.00 S[0] → *a* S[13] 0.16 S[0] → *a* S[2] 0.29 S[0] → *a* S[5] 0.32 S[1] → *b* S[0] S[0] 0.13 S[1] → *b* S[14] S[14] 0.19 S[1] → *b* S[3] S[3] 0.15 S[1] → *b* S[7] S[7] 0.17 S[1] → *b* S[8] S[8] 0.32 S[10] → *c* S[1] S[1] S[1] 1.00 S[11] → *d* S[10] S[10] S[10] S[10] 1.00 S[12] → *a* S[4] 0.13 S[12] → *c* S[13] S[13] S[13] 0.34 S[12] → *x* 0.13 S[13] → *x* 0.72 S[13] → *y* 0.28 S[14] → *a* S[15] 0.16 S[14] → *a* S[4] 0.10 S[14] → *a* S[5] 0.45 S[14] → *a* S[6] 0.13 S[15] → *x* 0.89 S[15] → *y* 0.11 S[2] → *x* 0.99 S[3] → *a* S[13] 0.11 S[3] → *a* S[15] 0.14 S[3] → *a* S[2] 0.20 S[3] → *a* S[5] 0.44 S[4] → *x* 0.68 S[4] → *y* 0.32 S[5] → *x* 0.92 S[6] → *x* 0.93 S[7] → *a* S[13] 0.23 S[7] → *a* S[2] 0.31 S[7] → *a* S[4] 0.10 S[7] → *a* S[5] 0.29 S[8] → *a* S[2] 0.17 S[8] → *a* S[4] 0.18 S[8] → *a* S[5] 0.41 S[8] → *a* S[6] 0.16 S[9] → *a* S[13] 0.19 S[9] → *a* S[4] 0.19 S[9] → *a* S[5] 0.38

Model = 0 Pr *ζ*<sup>0</sup> 0.11 S[1] 1.00 S[0] → *a* S[10] 0.20 S[0] → *a* S[2] 0.18 S[0] → *a* S[5] 0.28 S[1] → *d* S[4] S[4] S[4] S[4] 1.00 S[10] → *x* 0.14 S[10] → *y* 0.86 S[11] → *x* 0.14 S[11] → *y* 0.86 S[12] → *a* S[10] 0.17 S[12] → *a* S[2] 0.18 S[12] → *a* S[5] 0.32 S[13] → *x* 0.21 S[13] → *y* 0.79 S[14] → *b* S[7] S[7] 0.10 S[14] → *c* S[10] S[10] S[10] 0.15 S[15] → *x* 0.12 S[15] → *y* 0.88 S[2] → *x* 0.25 S[2] → *y* 0.75 S[3] → *a* S[10] 0.21 S[3] → *a* S[15] 0.18 S[3] → *a* S[2] 0.17 S[3] → *a* S[5] 0.22 S[4] → *c* S[8] S[8] S[8] 1.00 S[5] → *y* 0.97 S[6] → *y* 1.00 S[7] → *x* 0.52 S[7] → *y* 0.48 S[8] → *b* S[0] S[0] 0.50 S[8] → *b* S[12] S[12] 0.17 S[8] → *b* S[3] S[3] 0.31 S[9] → *x* 0.14 S[9] → *y* 0.86

optima. This result shows that UPAGE is more effective than PAGE not only quantitatively but also qualitatively. We also note that UPAGE is more powerful than PAGE in terms of computational time. In our computer experiments, we set the number of parameters in UPAGE and PAGE to be approximately the same. Figure 10 shows the relative computational time per generation of UPAGE and PAGE (the computational time of PAGE is normalized to 1) and we see that UPAGE required only sixty percent of the time required by PAGE. Although we have shown in Section 5.3.1 that UPAGE and PAGE required approximately the same number of fitness evaluations to obtain the optimal solution in the royal tree problem, UPAGE is more effective even for the royal tree problem if the actual computational time is considered.

**Figure 10.** The computational time per generation of UPAGE and PAGE (smaller is better). The time of PAGE is normalized to 1.

Table 8 summarizes functionalities of several GP-EDAs. SG-GP employs the conventional PCFG and hence it cannot estimate dependencies among nodes. Although GT-EDA, GMPE and PAGE adopt different types of grammar models, they belong to the same class in the sense that these three methods can take into account dependencies among nodes, which is enabled by a use of specialized production rules depending on contexts. However, these methods cannot consider global contexts, and consequently, they are not suitable for estimating problems having complex distributions. In contrast, in addition to local dependencies among nodes, UPAGE can consider global contexts of tree structures. The model of UPAGE is the most flexible among these GP-EDAs, and this flexibility is reflected by the search performance.

In the present implementation of UPAGE, we had to set the mixture size *μ* and the annotation size *h* in advance because UPAGE employed the EM algorithm. However, it is desirable to estimate *μ* and *h*, as well as *β*, *π* and *ζ* during search. In the case of PAGE, we proposed PAGE-VB in Ref. ([12]), which adopted VB to estimate the annotation size *h*. In a similar fashion, it is possible to apply VB to UPAGE to enable the inference of *μ* and *h*.

**Appendix B: Derivation of a parameter update formula for UPAGE**

We here derive β. Maximization of Q(β, π, ζ|β, π, ζ) under a constraint ∑*<sup>α</sup> β*

*∂β k*

for β, π and ζ can be calculated separately.

with

update formula:

*β k*

can be performed by the method of Lagrange multipliers:

*<sup>L</sup>* = Q(β, <sup>π</sup>, <sup>ζ</sup>|β, <sup>π</sup>, <sup>ζ</sup>) + ∑

(S[*x*] → *g* S[*y*] ···S[*y*]) ∝

*<sup>c</sup>k*(S[*x*] <sup>→</sup> *<sup>g</sup>* <sup>S</sup>[*y*] ···S[*y*]; *Ti*)

*<sup>c</sup>k*(S[*x*] <sup>→</sup> *<sup>g</sup>* <sup>S</sup>[*y*] ···S[*y*]; *Ti*) <sup>=</sup> *<sup>β</sup>k*(S[*x*] <sup>→</sup> *<sup>g</sup>* <sup>S</sup>[*y*] ···S[*y*])

*<sup>P</sup>*(*Xi*, *Zi*|*Ti*;β, <sup>π</sup>, <sup>ζ</sup>)*z<sup>k</sup>*

*<sup>P</sup>*(*Ti*;β, <sup>π</sup>, <sup>ζ</sup>) ∑

probabilities. Let *<sup>c</sup>k*(S[*x*] <sup>→</sup> *<sup>g</sup>* <sup>S</sup>[*y*] ···S[*y*]; *Ti*) be

*∂P*(*Ti*, *Xi*, *Zi*;β, π, ζ) *∂βk*(S[*x*] <sup>→</sup> *<sup>g</sup>* <sup>S</sup>[*y*] ···S[*y*]) <sup>=</sup> *<sup>ζ</sup><sup>k</sup>*

= ∑ *Xi* ∑ *Zi*

*g* S[*y*] ···S[*y*]), we have

The last term is calculated as

∑ *Xi* ∑ *Zi*

We here explain details of the parameter update formula for UPAGE (see Section 4.1). By separating Q(β, π, ζ|β, π, ζ) into terms containing β, π and ζ, a parameter update formula

*∂L*

(S[*x*] → *α*)

*k*,*x ξk*,*<sup>x</sup>* 

*N* ∑ *i*=1 ∑ *Xi* ∑ *Zi* 

where *ξk*,*<sup>x</sup>* denote Lagrange multipliers. By calculating Equation 27, we obtain the following

Because Equation 29 includes summation in terms of *Xi*, direct calculation is intractable due to exponential increase of computational cost. Consequently, we use forward–backward

By differentiating the likelihood of complete data (Equation 18) with respect to *<sup>β</sup>k*(S[*x*] <sup>→</sup>

*Xi* ∑ *Zi*

∑ *Xi*

<sup>=</sup> *<sup>ζ</sup><sup>k</sup>* ∑

∈cover(*g*,*Ti*)

By this procedure, the update formula for *β* is expressed with Equation 21, and the update formula for *π* is calculated in a similar way (and much easier). The update formula for *ζ* is

<sup>1</sup> <sup>−</sup> <sup>∑</sup>*<sup>α</sup> β k* *k*

= 0, (27)

Programming with Annotated Grammar Estimation 71

*i*

(S[*x*] → *α*)

*<sup>P</sup>*(*Xi*, *Zi*|*Ti*;β, <sup>π</sup>, <sup>ζ</sup>)*z<sup>k</sup>*

*<sup>i</sup> c*(S[*x*] → *g* S[*y*] ···S[*y*]; *Ti*, *Xi*).

*∂P*(*Ti*, *Xi*, *Zi*;β, π, ζ) *∂βk*(S[*x*] <sup>→</sup> *<sup>g</sup>* <sup>S</sup>[*y*] ···S[*y*]).

> ) ∏ *j*∈ch(,*Ti*)

*b j Ti* (*y*; β*<sup>k</sup>* , π*<sup>k</sup>* ).

*∂P*(*Ti*, *Xi*;β*k*, π*k*) *∂βk*(S[*x*] → *<sup>g</sup>* S[*y*] ···S[*y*])

> *f Ti* (*x*; β*<sup>k</sup>* , π*<sup>k</sup>*

×*c*(S[*x*] → *g* S[*y*] ···S[*y*]; *Ti*, *Xi*)} . (29)

(S[*x*] → *α*) = 1

, (28)

We have shown the effectiveness of PAGE and UPAGE with benchmark problems not having intron structures. However, in real-world applications, problems generally include intron structures, which make the model and parameter inference much more difficult. For such problems, we consider that intron removal algorithms ([13, 30]) are effective, and application of such algorithms to GP-EDAs is left as a topic of future study.
