**5. Treed generalized linear models**

Some of tree-based methods such as CART, QUEST, C4.5, and CHAID fit a constant model in the nodes of tree, thus a large tree is generated, and this tree has hard interpretation. Treed models, unlike conventional tree models, partition data into subsets and then fit a parametric model such as linear regression, Poisson regression, and logistic regression instead of using constant models (mean or proportion) for data prediction. Treed models generate smaller trees in comparison to tree models. Also, treed models can be a good alternative for traditional parametric models such as GLMs, when these parametric models cannot estimate relationship between outcome variable and predictor variables across a dataset. Several tree algorithms are developed that fit parametric models into terminal nodes, and to study these algorithms, refer to Refs. [71–77].

### **6. Bayesian classification and regression trees**

The classic CART algorithm was developed by Breiman et al. in 1984, and this model is one of the best known classic classification and regression trees for data mining. But this algorithm suffers from some problems such as greediness, instability, and

#### *Enhanced Expert Systems*

bias in split rule selection. CART generates a tree by using a greedy search algorithm, and this search algorithm has disadvantages such as: limit the exploration of tree space, dependence future splits to previous splits, generate optimistic error rates, and the inability of the search to find a global optimum [78]. CART has instability problem, because by resampling or drawing bootstrap samples from dataset may generate tree with different splits [79]. The splitting method in CART model is biased toward predictor variables with many distinct values and more missing values [80, 81].

Several tree models are suggested to solve these problems and these remedial models are ensemble of trees such as Random Forests [82], Bagging [83], Boosting [84], Multiboost [85], and LogitBoost [86] (for solving instability problem), tree algorithms such as CRUISE [33, 35], QUEST [32], GUIDE [34], CTREE [49], and LOTUS [71] (for solving bias in split rule selection problem), and Bayesian tree approaches and evtree algorithm [78] are suggested to solve greediness problem of CART. Also, Bayesian tree approaches can quantify uncertainty, and these approaches explore the tree space more than classic approaches.

Several Bayesian approaches are proposed for tree-based methods [87–98]. In these Bayesian tree approaches like classic tree approaches, a model is called Bayesian classification trees if the outcome variable is a qualitative variable. Also, a model is called Bayesian regression trees if the outcome variable is a quantitative variable. The method of data prediction in these Bayesian approaches is like classic approaches. The method of data prediction for Bayesian classification trees is based on fitting a constant model like the proportion of the outcome variable in the terminal nodes. Data prediction in Bayesian regression tree is based on fitting a constant model like the mean of the outcome variable in the terminal nodes.

Classic tree approaches use only observations for data analysis, but Bayesian approaches combine prior information with observations. Bayesian tree approaches define prior distributions on the components of classic tree approaches and then utilize stochastic search algorithms through Markov chain Monte Carlo (MCMC) algorithms or deterministic search algorithms for exploring tree space [87–98].

Bayesian tree approaches have materials such as prior distribution function, posterior distribution function, data likelihood function, marginal likelihood function, stochastic search algorithm or deterministic search algorithm for exploring tree space, stopping rule of simulation algorithm (if stochastic search algorithms are used to simulate from posterior distribution and explore tree space) and criteria for identify good trees (if model produces several trees). In this section, we review Bayesian tree approaches and also mention the results of published papers based on using these Bayesian algorithms for data analysis.

#### **6.1 BUNTINE's Bayesian classification tree approach**

The first Bayesian tree approach for classification tree model was proposed by Buntine in 1992. This proposed approach offers a full Bayesian analysis for classification tree model by using a deterministic search algorithm instead of using a stochastic search algorithm [87]. This model like other classic tree models uses a splitting function for tree growth using Bayesian statistics with similar performance to splitting methods such as Information Gain and Gini. Buntine also like traditional tree models, in order to prevent overfitting model, used Bayesian smoothing and averaging techniques instead of pruning the tree.

In this Bayesian approach, prior distributions are defined on the tree space and data distribution in the terminal nodes of tree (similar priors distributions use for data distribution in the terminal nodes unlike prior distributions considered on the tree space). Buntine showed the superior performance of Bayesian approach in comparison to classic tree algorithms such as CART model of Breiman et al. and C4

**37**

trees namely T <sup>0</sup>

,T <sup>1</sup> ,T <sup>2</sup>

iteratively simulates the transitions from T <sup>i</sup>

*Classic and Bayesian Tree-Based Methods DOI: http://dx.doi.org/10.5772/intechopen.83380*

**6.2 CGM's Bayesian CART approach**

Bayes' theorem:

following steps:

tree.

follows:

selected variable)

equal to the newly created child nodes.

model of Quinlan et al. [99] on several datasets [87]. This Bayesian approach may be

CGM (Chipman, George, McCulloch) proposed a Bayesian approach for CART

where p(T) and p(<sup>Θ</sup> <sup>∣</sup> T) show the prior distribution for the tree and parameters

of the node η. α parameter is the base probability of tree growth by splitting a current node, and β parameter determines the rate at which the propensity to split decreases as the tree gets larger). α and β parameters control the shape and size of the tree and these parameters provide a penalty to avoid overfitting

• If terminal node η splits, then a splitting rule ρ is assigned to this node according to the distribution PRULE (discrete uniform distribution is used for selecting predictor variable to split the terminal node η and splitting threshold for this

• Let T as newly created tree from step 3 and run steps 2 and 3 on this tree with η

In this approach, the posterior distribution function p(T|X, y) is computed with combining the marginal likelihood function p(Y|X, T) and tree prior p(T) as

p(T|X,y) ∝ p(y|X,T) p(T) (2)

p(y|X,T) = ∫p(y|X,Θ,Τ) p(Θ| T)dΘ (3)

A stochastic search algorithm is used for finding good models and simulating from relation (2) by using a MCMC algorithm such as Metropolis-Hastings algorithm. This Metropolis-Hastings algorithm simulates a Markov chain sequence of

, …, and this algorithm starts with an initial tree T <sup>0</sup>

to T i+1

p(y|X,Θ,Τ) in Eq. (3) shows the data likelihood function.

in terminal nodes given the tree, respectively. In this approach, a similar treegenerating stochastic process is used for p(T) of both classification and regression tree models [89], and this recursive stochastic process for tree growth includes the

• Start from T that includes only a root node (terminal node η).

• Slit terminal node η with probability PSPLIT = α(1 + dη)<sup>−</sup><sup>β</sup>

P(Θ,T) = p(Θ|T)p(T) (1)

(dη shows the depth

, then

by two steps as shown below:

model by defining prior distributions on the two components of CART model (Θ,T) in 1998, and these components are a binary tree T with terminal nodes and parameter set Θ =(θ1, θ2,…,θ) [89, 91–93]. Indeed, they define prior distributions on tree structure and parameters in terminal nodes. In this approach, following equation is established for joint posterior distribution of components according to

obtained from: http://ksvanhorn.com/bayes/free-bayes-software.html.

*Enhanced Expert Systems*

bias in split rule selection. CART generates a tree by using a greedy search algorithm, and this search algorithm has disadvantages such as: limit the exploration of tree space, dependence future splits to previous splits, generate optimistic error rates, and the inability of the search to find a global optimum [78]. CART has instability problem, because by resampling or drawing bootstrap samples from dataset may generate tree with different splits [79]. The splitting method in CART model is biased toward predictor variables with many distinct values and more missing values [80, 81]. Several tree models are suggested to solve these problems and these remedial models are ensemble of trees such as Random Forests [82], Bagging [83], Boosting [84], Multiboost [85], and LogitBoost [86] (for solving instability problem), tree algorithms such as CRUISE [33, 35], QUEST [32], GUIDE [34], CTREE [49], and LOTUS [71] (for solving bias in split rule selection problem), and Bayesian tree approaches and evtree algorithm [78] are suggested to solve greediness problem of CART. Also, Bayesian tree approaches can quantify uncertainty, and these

Several Bayesian approaches are proposed for tree-based methods [87–98]. In these Bayesian tree approaches like classic tree approaches, a model is called Bayesian classification trees if the outcome variable is a qualitative variable. Also, a model is called Bayesian regression trees if the outcome variable is a quantitative variable. The method of data prediction in these Bayesian approaches is like classic approaches. The method of data prediction for Bayesian classification trees is based on fitting a constant model like the proportion of the outcome variable in the terminal nodes. Data prediction in Bayesian regression tree is based on fitting a constant

Classic tree approaches use only observations for data analysis, but Bayesian approaches combine prior information with observations. Bayesian tree approaches define prior distributions on the components of classic tree approaches and then utilize stochastic search algorithms through Markov chain Monte Carlo (MCMC) algorithms or deterministic search algorithms for exploring tree space [87–98]. Bayesian tree approaches have materials such as prior distribution function, posterior distribution function, data likelihood function, marginal likelihood function, stochastic search algorithm or deterministic search algorithm for exploring tree space, stopping rule of simulation algorithm (if stochastic search algorithms are used to simulate from posterior distribution and explore tree space) and criteria for identify good trees (if model produces several trees). In this section, we review Bayesian tree approaches and also mention the results of published papers based on

The first Bayesian tree approach for classification tree model was proposed by Buntine in 1992. This proposed approach offers a full Bayesian analysis for classification tree model by using a deterministic search algorithm instead of using a stochastic search algorithm [87]. This model like other classic tree models uses a splitting function for tree growth using Bayesian statistics with similar performance to splitting methods such as Information Gain and Gini. Buntine also like traditional tree models, in order to prevent overfitting model, used Bayesian smoothing and

In this Bayesian approach, prior distributions are defined on the tree space and data distribution in the terminal nodes of tree (similar priors distributions use for data distribution in the terminal nodes unlike prior distributions considered on the tree space). Buntine showed the superior performance of Bayesian approach in comparison to classic tree algorithms such as CART model of Breiman et al. and C4

approaches explore the tree space more than classic approaches.

model like the mean of the outcome variable in the terminal nodes.

using these Bayesian algorithms for data analysis.

averaging techniques instead of pruning the tree.

**6.1 BUNTINE's Bayesian classification tree approach**

**36**

model of Quinlan et al. [99] on several datasets [87]. This Bayesian approach may be obtained from: http://ksvanhorn.com/bayes/free-bayes-software.html.
