**2. Related work**

2 Will-be-set-by-IN-TECH

PCFG to model tree structures. PCFG-based methods are more advantageous than PPT-based methods in the sense that PCFG-based methods can estimate position-independent building

The conventional PCFG adopts the context freedom assumption that the probabilities of production rules do not depend on their contexts, namely parent or sibling nodes. Although the context freedom assumption makes parameter estimation easier, it cannot in principle consider interaction among nodes. In general, programs and functions have dependencies among nodes, and as a consequence, the conventional PCFG is not suitable as a baseline model of GP-EDAs. In the field of natural language processing (NLP), many approaches have been proposed in order to weaken the content freedom assumption of PCFG. For instance, the vertical Markovization annotates symbols with their ancestor symbols and has been adopted as a baseline grammar of vectorial stochastic grammar based GP (vectorial SG-GP) or grammar transformation in an EDA (GT-EDA) ([4]) (see Section 2). Matsuzaki *et al.* ([17]) proposed the PCFG with latent annotations (PCFG-LA), which assumes that all annotations are latent and the annotations are estimated from learning data. Because the latent annotation models are much richer than fixed annotation models, it is expected that GP-EDAs using PCFG-LA may more precisely grasp the interactions among nodes than other fixed annotation based GP-EDAs. In GA-EDAs, EDAs with Bayesian networks or Markov networks exhibited better search performance than simpler models such as a univariate model. In a similar way, it is generally expected that GP-EDAs using PCFG-LA are more powerful than GP-EDAs with PCFG with heuristics-based annotations because the model flexibility of PCFG-LA is much richer. We have proposed a GP-EDA named programming with annotated grammar estimation (PAGE) which adopts PCFG-LA as a baseline grammar ([9, 12]). In Section 4 of the present chapter, we explain the details of PAGE, including the parameter update formula. As explained above, EDAs model promising solutions with parametric distributions. For the case in multimodal problems, it is not sufficient to express promising solutions with only one model, because dependencies for each optimal solution are different in general. When considering tree structures, this problem arises even in unimodal optimization problems due to diversity of tree expression. These problems can be tackled by considering global contexts in each individual, which represents which optima (e.g. multiple solutions in multimodal problems) it derives from. Consequently, we have proposed the PCFG-LA mixture model (PCFG-LAMM) which extends PCFG-LA into a mixture model, and have also proposed a new GP-EDA named unsupervised PAGE (UPAGE) which employs PCFG-LAMM as a baseline grammar ([11]). By using PCFG-LAMM, not only local dependencies but also global contexts

blocks.

behind individuals can be taken into account.

addition to the local dependencies.

The main objectives of proposed algorithms may be summarized as follows:

1. PAGE employs PCFG-LA to consider local dependencies among nodes.

2. UPAGE employs PCFG-LAMM to take into account global contexts behind individuals in

This chapter is structured as follows: Following a section on related work, we briefly introduce the basics of PCFG. We explain PAGE in Section. 4, where details of PCFG-LA, forward–backward probabilities and a parameter update formula are provided. In Section 5, we propose UPAGE, which is a mixture model extension of PAGE. We describe PCFG-LAMM and also derive a parameter update formula for UPAGE. We compare the performance of Many GP-EDAs have been proposed, and these methods can be broadly classified into two groups: (i) PPT based methods and (ii) grammar model based methods.

Methods of type (i) employ techniques developed in GA-EDAs. This type of algorithm converts tree structures into the fixed-length chromosomes used in GA and applies probabilistic models of GA-EDAs. Probabilistic incremental program evolution (PIPE) ([25]) is a univariate model, which can be considered to be a combination of population-based incremental learning (PBIL) ([3]) and GP. Because tree structures have explicit edges between parent and children nodes, estimation of distribution programming (EDP) ([37, 38]) considers the parent–children relationships in the tree structures. Extended compact GP (ECGP) ([26]) is an extension of the extended compact GA (ECGA) ([7]) to GP and ECGP can take into account the interactions among nodes. ECGP infers the group of marginal distribution using the minimum description length (MDL) principle. BOA programming (BOAP) ([15]) uses Bayesian networks for grasping dependencies among nodes and is a GP extension of the Bayesian optimization algorithm (BOA) ([20]). Program optimization with linkage estimation (POLE) ([8, 10]) estimates the interactions among nodes by estimating the Bayesian network. POLE uses a special chromosome called an *expanded parse tree* ([36]) to convert GP programs into linear arrays, and several extended algorithms of POLE have been proposed ([27, 39]). Meta-optimizing semantic evolutionary search (MOSES) ([16]) extends the hierarchical Bayesian optimization algorithm (hBOA) ([19]) to program evolution.

Methods of type (ii) are based on Whigham's grammar-guided genetic programming (GGGP) ([33]). GGGP expresses individuals using derivation trees (see Section 3), which is in contrast with the conventional GP. Whigham indicated the connection between PCFG and GP ([35]), and actually, the probability table learning in GGGP can be viewed as an EDA with local search. Stochastic grammar based GP (SG-GP) ([23]) applied the concept of PBIL to GGGP. The authors of SG-GP also proposed vectorial SG-GP, which considers depth in its grammar (simple SG-GP is then called scalar SG-GP). Program evolution with explicit learning (PEEL) ([28]) takes into account the positions (arguments) and depths of symbols. Unlike SG-GP and PEEL, which employ predefined grammars, grammar model based program evolution (GMPE) ([29]) learns not only parameters but also the grammar itself from promising solutions. GMPE starts from specialized production rules which exclusively generate learning data and merges non-terminals to yield more general production rules using the MDL principle. Grammar transformation in an EDA (GT-EDA) ([4]) extracts good subroutines using the MDL principle. GT-EDA starts from general rules and expands non-terminals to yield more specialized production rules. Although the concept of GT-EDA is similar to that of GMPE, the learning procedure is opposite to GMPE [specialized to general (GMPE) versus general to specialized (GT-EDA)]. Tanev proposed GP based on a probabilistic context sensitive grammar ([31, 32]). He used sibling nodes and a parent node as context information, and production rule probabilities are expressed by conditional probabilities of these context information. Bayesian automatic programming (BAP) ([24]) uses a Bayesian network to consider relations among production rules in PCFG.

4 Will-be-set-by-IN-TECH 52 Genetic Programming – New Approaches and Successful Applications Programming with Annotated Grammar Estimation <sup>5</sup>

There are other GP-EDAs not belonging to either of the groups presented above. *N*-gram GP ([21]) is based on the linear GP ([18]), which is the assembly language of a register-based CPU, and learns the sub-sequences using an *N*-gram model. The *N*-gram model is very popular in NLP which considers *N* consecutive sub-sequences for calculating the probabilities of symbols. AntTAG ([1]) also shares similar concepts with GP-EDAs, although AntTAG does not employ a statistical inference method for probability learning; instead, AntTAG employs the ant colony optimization method (ACO), where the pheromone matrix in ACO can be interpreted as a probability distribution.
