**6.2 CGM's Bayesian CART approach**

CGM (Chipman, George, McCulloch) proposed a Bayesian approach for CART model by defining prior distributions on the two components of CART model (Θ,T) in 1998, and these components are a binary tree T with terminal nodes and parameter set Θ =(θ1, θ2,…,θ) [89, 91–93]. Indeed, they define prior distributions on tree structure and parameters in terminal nodes. In this approach, following equation is established for joint posterior distribution of components according to Bayes' theorem:

$$\mathbf{P(\Theta,T) = p(\Theta|T)p(T)}\tag{1}$$

where p(T) and p(<sup>Θ</sup> <sup>∣</sup> T) show the prior distribution for the tree and parameters in terminal nodes given the tree, respectively. In this approach, a similar treegenerating stochastic process is used for p(T) of both classification and regression tree models [89], and this recursive stochastic process for tree growth includes the following steps:


In this approach, the posterior distribution function p(T|X, y) is computed with combining the marginal likelihood function p(Y|X, T) and tree prior p(T) as follows:

$$\mathbf{p(T|X,y)} \propto \mathbf{p(y|X,T)} \cdot \mathbf{p(T)}\tag{2}$$

$$\mathbf{p}\{\mathbf{y}|\mathbf{X},\mathbf{T}\} = \int \mathbf{p}\{\mathbf{y}|\mathbf{X},\Theta,\mathbf{T}\} \,\mathbf{p}\{\Theta|\mathbf{T}\} d\Theta \tag{3}$$

p(y|X,Θ,Τ) in Eq. (3) shows the data likelihood function.

A stochastic search algorithm is used for finding good models and simulating from relation (2) by using a MCMC algorithm such as Metropolis-Hastings algorithm. This Metropolis-Hastings algorithm simulates a Markov chain sequence of trees namely T <sup>0</sup> ,T <sup>1</sup> ,T <sup>2</sup> , …, and this algorithm starts with an initial tree T <sup>0</sup> , then iteratively simulates the transitions from T <sup>i</sup> to T i+1 by two steps as shown below:

1.Generate a candidate value T<sup>∗</sup> with probability distribution q(T<sup>i</sup> , T<sup>∗</sup> ).

2.Set Ti+1 = T<sup>∗</sup> with probability below:

$$\mathbf{a}\{\mathbf{T}^{i},\mathbf{T}^{\*}\} = \min\left\{ \frac{\mathbf{q}\{\mathbf{T}^{\*},\mathbf{T}^{i}\}\mathbf{p}\{\mathbf{Y}|\mathbf{X},\mathbf{T}^{\*}\}\mathbf{p}\{\mathbf{T}^{\*}\}}{\mathbf{q}\{\mathbf{T}^{i},\mathbf{T}^{\*}\}\mathbf{p}\{\mathbf{Y}|\mathbf{X},\mathbf{T}^{i}\}\mathbf{p}\{\mathbf{T}^{i}\}},\mathbf{1}\right\}\tag{4}$$

Else, set Ti+1 = T<sup>i</sup> .

In this simulation algorithm, q(T,T<sup>∗</sup> ) generates T<sup>∗</sup> from T by randomly selecting among four steps. These steps are GROW step, PRUNE step, CHANGE step, and SWAP step. This simulation algorithm is run with multiple restarts instead of a single long chain for reasons such as convergence of the posterior distribution or simulation chain, to avoid wasting long time waiting in areas of trees with high posterior distribution function, and generate a wide variety of different trees. Also, the stopping criterion of simulation algorithm is based on that the chain became trapped in a local posterior model.

This Bayesian approach unlike classic CART model does not generate a single tree, thus good trees for classification tree are selected based on criteria such as having lowest misclassification and largest marginal likelihood function. Also, good trees for regression tree are determined based on having the largest marginal likelihood function and lowest residual sums of squares. CGM by using simulation showed that stochastic search algorithm can find better trees than a greedy tree algorithm. They indicated that the Bayesian classification approach has lower misclassification rate than CART model and they also used Bayesian model averaging for improving prediction accuracy of Bayesian classification trees [89].

#### **6.3 DMS'S Bayesian CART approach**

DMS (Denison, Mallick, Smith) in 1998 proposed a Bayesian approach for the CART model, and this approach is quite similar to Bayesian approach of CGM with just minor differences [88]. In this approach, prior distributions are defined over the splitting node (S), splitting variable (V), splitting rule (R), tree size (), and parameters of data distribution in terminal nodes (ѱ).

In this Bayesian approach, joint distribution of model parameters is defined as follows (p(): prior distribution for size of tree, p(θk|): prior distribution for parameter set θk = {Rk,Sk,Vk,ѱk} given (tree size), p(y|, θk): data likelihood function):

$$\mathbf{p(\mathcal{K}, \theta\_k, \mathbf{y}) = p(\mathcal{K})} \ \mathbf{p(\theta\_k|\mathcal{K})} \ \mathbf{p(\mathbf{y}|\mathcal{K}, \theta\_k)} \tag{5}$$

This Bayesian approach puts a prior distribution over the tree size to avoid overfitting data and uses a truncated Poisson distribution with parameter ( shows the expected number of nodes in the tree and a weakly informative is used prior for tree size by setting equal to 10) for p () as follows:

$$\mathbf{p}\text{(\mathcal{K})} \propto \frac{\lambda^{\mathcal{K}}}{(\mathbf{c}^{\lambda} - \mathbf{1})\mathcal{K}!} \tag{6}$$

**39**

(<sup>∗</sup> , θ<sup>∗</sup>

current tree):

*Classic and Bayesian Tree-Based Methods DOI: http://dx.doi.org/10.5772/intechopen.83380*

respectively, and algorithm is as follows:

2.Set to the tree size in the present tree.

if (u ≤ B), then go to BIRTH step

else if ( b ≤ ≤ b + d), then go to DEATH step

else if ( b + d ≤ ≤ b + d + V), then go to VARIABLE step

DEATH step:<sup>α</sup> <sup>=</sup> min{1, (likelihood ratio) ×\_\_\_\_\_\_\_

if (u ≤ α), then proposed tree to accept, else reject.

1.Stating with an initial tree.

3.Generate u ~U[0, 1]

else, go to RULE step

conditions):

follows:

So, prior for this Bayesian approach is defined as follows:

p(θk|) p() = p(Rk|Vk,Sk,)p(Vk|Sk,)

In this approach, Bayesian analysis of tree size and parameter set θk is as

p(θk,|y) = p(|y) p(θk|,y) (9)

Also, simulation from the above equation is done by using MCMC algorithms to find good trees and Reversible Jump MCMC algorithm is used to simulate from this equation [100]. This simulation algorithm is performed for a single long chain with a burn-in period to explore the tree space. In this simulation algorithm, trees cannot have sample size less than 5 in the terminal nodes and also cannot have size higher than 6 in during burn-in period of simulation chain of posterior distribution. Reversible Jump MCMC algorithm used by DMS to simulate from Eq. (9) includes four steps: BIRTH (GROW), DEATH (PRUNE), VARIABLE, and SPLITTING RULE. In this simulation algorithm, BIRTH step, DEATH step, VARIABLE step, and Splitting RULE step are randomly chosen with probability b, d, V, and R,

4.Go to step type determined by u (a step type is determined based on following

Then, acceptance probability (α) of each step that changes tree (, θ) to tree

BIRTH step:<sup>α</sup> <sup>=</sup> min{1, (likelihood ratio) <sup>×</sup> (die <sup>+</sup> 1) \_\_\_\_\_\_\_ } (10)

VARIABLE and RULE steps:α = min{1, (likelihood ratio)} (12)

The stopping criterion of the above simulation algorithm is based on the stability of the posterior distribution and it can be assessed by drawing a plot of iterations of chain against sampled parameter values. This Bayesian approach, unlike CART, does not produce a tree using stochastic search algorithm. Thus, good classification trees are selected based on criteria such as misclassification rate, deviance (−2log p(y|,θ k)), and posterior probability, and good classification trees have lowest misclassification

) as follows (die shows the number of possible locations for a death in the

(die <sup>+</sup> 1)} (11)

p(Sk|) p(ѱk|V,S,) p() (8)

Also, p(θk|) in Eq. (5) is defined as follows:

$$\mathbf{p}\{\boldsymbol{\theta}\_{\mathrm{k}}|\mathcal{K}\} = \mathbf{p}\{\mathbf{R}\_{\mathrm{k}}|\mathbf{V}\_{\mathrm{ks}}, \mathbf{S}\_{\mathrm{ks}}\mathcal{K}\} \mathbf{p}\{\mathbf{V}\_{\mathrm{k}}|\mathbf{S}\_{\mathrm{ks}}, \mathcal{K}\} \ \mathbf{p}\{\mathbf{S}\_{\mathrm{k}}|\mathcal{K}\} \ \mathbf{p}\{\boldsymbol{\psi}\_{\mathrm{k}}|\mathbf{V}, \mathcal{S}, \mathcal{K}\} \tag{7}$$

*Enhanced Expert Systems*

2.Set Ti+1 = T<sup>∗</sup>

Else, set Ti+1

sification trees [89].

function):

α(T<sup>i</sup>

= T<sup>i</sup> . In this simulation algorithm, q(T,T<sup>∗</sup>

trapped in a local posterior model.

**6.3 DMS'S Bayesian CART approach**

parameters of data distribution in terminal nodes (ѱ).

size by setting equal to 10) for p () as follows:

p() <sup>∝</sup> <sup>λ</sup> \_\_\_\_\_\_\_

Also, p(θk|) in Eq. (5) is defined as follows:

1.Generate a candidate value T<sup>∗</sup>

with probability below:

) <sup>=</sup> min{

q(T<sup>∗</sup> ,T<sup>i</sup> ) \_\_\_\_\_\_\_\_ q(T<sup>i</sup> ,T<sup>∗</sup> )

ing among four steps. These steps are GROW step, PRUNE step, CHANGE step, and SWAP step. This simulation algorithm is run with multiple restarts instead of a single long chain for reasons such as convergence of the posterior distribution or simulation chain, to avoid wasting long time waiting in areas of trees with high posterior distribution function, and generate a wide variety of different trees. Also, the stopping criterion of simulation algorithm is based on that the chain became

This Bayesian approach unlike classic CART model does not generate a single tree, thus good trees for classification tree are selected based on criteria such as having lowest misclassification and largest marginal likelihood function. Also, good trees for regression tree are determined based on having the largest marginal likelihood function and lowest residual sums of squares. CGM by

using simulation showed that stochastic search algorithm can find better trees than a greedy tree algorithm. They indicated that the Bayesian classification approach has lower misclassification rate than CART model and they also used Bayesian model averaging for improving prediction accuracy of Bayesian clas-

DMS (Denison, Mallick, Smith) in 1998 proposed a Bayesian approach for the CART model, and this approach is quite similar to Bayesian approach of CGM with just minor differences [88]. In this approach, prior distributions are defined over the splitting node (S), splitting variable (V), splitting rule (R), tree size (), and

In this Bayesian approach, joint distribution of model parameters is defined as follows (p(): prior distribution for size of tree, p(θk|): prior distribution for parameter set θk = {Rk,Sk,Vk,ѱk} given (tree size), p(y|, θk): data likelihood

p(, θk,y) = p() p(θk|) p(y|, θk) (5)

This Bayesian approach puts a prior distribution over the tree size to avoid overfitting data and uses a truncated Poisson distribution with parameter ( shows the expected number of nodes in the tree and a weakly informative is used prior for tree

(e<sup>λ</sup> − 1)!

p(θk|) = p(Rk|Vk,Sk,)p(Vk|Sk,) p(Sk|) p(ѱk|V,S,) (7)

,T<sup>∗</sup>

with probability distribution q(T<sup>i</sup>

p(Y|X,T<sup>∗</sup>

) generates T<sup>∗</sup>

)p(T<sup>∗</sup> )

 \_\_\_\_\_\_\_\_\_\_\_\_\_\_ p(Y|X,T<sup>i</sup> )p(Ti ) , 1 , T<sup>∗</sup> ).

} (4)

(6)

from T by randomly select-

**38**

So, prior for this Bayesian approach is defined as follows:

$$\mathbf{p}\left(\theta\_{\mathbf{k}}|\mathcal{K}\right)\mathbf{p}\left(\mathcal{K}\right) = \mathbf{p}\left(\mathbf{R}\_{\mathbf{k}}|\mathbf{V}\_{\mathbf{k}}, \mathbf{S}\_{\mathbf{k}}, \mathcal{K}\right)\mathbf{p}\left(\mathbf{V}\_{\mathbf{k}}|\mathbf{S}\_{\mathbf{k}}, \mathcal{K}\right)$$

$$\mathbf{p}\left(\mathbf{S}\_{\mathbf{k}}|\mathcal{K}\right)\mathbf{p}\left(\boldsymbol{\upmu}\_{\mathbf{k}}|\mathbf{V}, \mathbf{S}, \mathcal{K}\right)\mathbf{p}\left(\mathcal{K}\right)\tag{8}$$

In this approach, Bayesian analysis of tree size and parameter set θk is as follows:

$$\mathbf{p}\{\boldsymbol{\Theta}\_{\mathbf{k}}, \mathcal{K}\|\mathbf{y}\} = \mathbf{p}\{\mathcal{K}|\mathbf{y}\} \,\mathbf{p}\{\boldsymbol{\Theta}\_{\mathbf{k}}|\mathcal{K}, \mathbf{y}\} \tag{9}$$

Also, simulation from the above equation is done by using MCMC algorithms to find good trees and Reversible Jump MCMC algorithm is used to simulate from this equation [100]. This simulation algorithm is performed for a single long chain with a burn-in period to explore the tree space. In this simulation algorithm, trees cannot have sample size less than 5 in the terminal nodes and also cannot have size higher than 6 in during burn-in period of simulation chain of posterior distribution. Reversible Jump MCMC algorithm used by DMS to simulate from Eq. (9) includes four steps: BIRTH (GROW), DEATH (PRUNE), VARIABLE, and SPLITTING RULE. In this simulation algorithm, BIRTH step, DEATH step, VARIABLE step, and Splitting RULE step are randomly chosen with probability b, d, V, and R, respectively, and algorithm is as follows:


if (u ≤ B), then go to BIRTH step else if ( b ≤ ≤ b + d), then go to DEATH step else if ( b + d ≤ ≤ b + d + V), then go to VARIABLE step else, go to RULE step

Then, acceptance probability (α) of each step that changes tree (, θ) to tree (<sup>∗</sup> , θ<sup>∗</sup> ) as follows (die shows the number of possible locations for a death in the current tree):

$$\text{BIRTH step:} \alpha = \min \left\{ 1, \left( \text{likelihood ratio} \right) \times \frac{\left( \mathcal{K}\_{\text{de}} \star 1 \right)}{\mathcal{K}} \right\} \tag{10}$$

$$\text{DEATH step:} \alpha = \min \left\{ 1, \text{(likelihood ratio)} \times \frac{\mathcal{K}}{\left(\mathcal{K}\_{\text{dir}} + 1\right)} \right\} \tag{11}$$

$$\text{VARIABLE and RULE steps:} \\ \text{a = min} \{ \mathbf{1}, \text{(likelihood ratio)} \} \tag{12}$$

if (u ≤ α), then proposed tree to accept, else reject.

The stopping criterion of the above simulation algorithm is based on the stability of the posterior distribution and it can be assessed by drawing a plot of iterations of chain against sampled parameter values. This Bayesian approach, unlike CART, does not produce a tree using stochastic search algorithm. Thus, good classification trees are selected based on criteria such as misclassification rate, deviance (−2log p(y|,θ k)), and posterior probability, and good classification trees have lowest misclassification

rate, deviance, and largest posterior probability. Also, good regression trees have largest posterior probability and lowest residual sum of squares. DMS indicated that Bayesian approach provides richer output and superior performance than classic CART model [88].

#### **6.4 CGM's hierarchical priors for Bayesian regression tree shrinkage approach**

CGM, 2000 proposed a Bayesian approach for regression tree with mean-shift model based on computational strategy of CGM's Bayesian approach in 1998. Unlike the Bayesian approach (1998), it can assume dependence of parameters in the terminal nodes. Indeed, hierarchical priors are used for these parameters and therefore shrunk trees are generated [90]. Hierarchical priors have some advantages such as: shrinkage is used in the stochastic search algorithm unlike proposed methods for tree shrinkage (because these methods use shrinkage after searching tree), fitting a larger tree to the dataset without overfitting and improve predictions. CGM by using simulation showed the superior performance of new Bayesian approach for regression tree with mean-shift model in comparison to Bayesian approach of CGM in 1998, CART model, and tree shrinkage methods of Hastie and Pregibon [90, 101].

#### **6.5 WTW'S Bayesian CART approach**

WTW (Wu, Tjelmeland, West), 2007 proposed a Bayesian approach for CART model based on the computational strategy of Bayesian approach of CGM (1998) [95]. In this approach, prior distributions define on the tree, splitting variables, splitting thresholds, and parameters in the terminal nodes. This Bayesian approach like approaches of CGM [89, 90, 92, 93] simulates from the posterior distribution by using the Metropolis-Hastings algorithm. The steps used in simulation algorithm of WTW include GROW step and PRUNE step, CHANGE step, SWAP step, and RESTRUCTURE (RADICAL) step (first three steps are similar to steps of simulation algorithm in Bayesian approaches of CGM). RESTRUCTURE step creates large changes in the structure of tree, but tree size is unchanged. There are some advantages by adding this step to simulation algorithm of posterior distribution such as: improving the convergence of the MCMC algorithm, elimination of the need for restarts of the simulation algorithm unlike Bayesian approaches of CGM, and large changes in the structure of tree without change in tree size.

In this approach, convergence diagnostics of simulation algorithm are based on plots such as: plots of iteration number against log posterior distribution, log marginal likelihood function, number of terminal nodes, and number of times that a particular predictor variable is shown as a splitting variable in the tree. WTW showed the superior performance of Bayesian approach in comparison to CART model and that the Bayesian approach had a lower misclassification rate than the CART model [95].

#### **6.6 OML'S Bayesian CART approach**

OML (O'Leary, Mengersen, Low Choy), 2008, proposed a Bayesian approach for CART model by extending the Bayesian approach of DMS. These two Bayesian approaches have differences such as the stopping rule of the simulation algorithm or convergence diagnostic plots, criteria for identifying good trees and prior distributions considered for parameters in the terminal nodes [88, 96, 98].

The stopping criterion of simulation chain in OML'S Bayesian classification trees approach has two steps. The first step includes the plot of iterations against

**41**

*Classic and Bayesian Tree-Based Methods DOI: http://dx.doi.org/10.5772/intechopen.83380*

and biological interpretability [96, 98, 103].

false positive rate and misclassification rate [96].

of outcome variable in predicting data [103].

approach did not show any evidence of overfitting [98].

splitting node, splitting variable, and splitting rule as follows:

until convergence.

accuracy measures (false and positive negative rate and misclassification rate), log posterior, log likelihood, and tree size. If these plots show stability in mentioned items, then in second step, structure of component trees (variables and splitting rules at each splitting node) examines in the set of good trees and if this structure was stabilized and/or the same trees were in this set, then convergence has occurred for this simulation chain; otherwise, iterations must be increased

The set of good trees in this Bayesian classification tree approach is determined based on the accuracy measures computed from the confusion matrix of Fielding and Bell [102]. Good trees have lowest misclassification rate and false positive and negative rate (or using highest sensitivity and specificity instead of lowest false positive and negative rate) [96, 98, 103]. After convergence of simulation chain, two or three trees are selected as the best trees among set of good trees based on criteria such as modal structure of tree (same size tree with the same variables and splitting rules), lowest misclassification rate, false negative and positive rate and deviance, highest posterior probability and likelihood, using expert judgment

The stopping rule of simulation algorithm for regression tree like classification tree includes two steps. In the first step, plot of iterations are drawn against posterior probability, residual sum of squares, and deviance. If these abovementioned items are stable, then structure of component trees examines in the set of good trees and if this structure was stabilized, convergence has been occurred for this simulation chain. Also, set of good trees for regression tree is selected based on having the highest posterior probability and likelihood, lowest residual sum of squares, and deviance [98]. OML compared the Bayesian classification trees with the classic CART model on an ecological dataset and concluded that Bayesian approach has smaller false positive rate, misclassification rate, and deviance than CART model, while the CART model has lower false negative rate, but this model had higher false positive rate [96]. They, in 2008, indicated that this Bayesian approach had a lower false negative rate in comparison to Bayesian approach of DMS, but approach of DMS had a lower

OML in 2009 compared predictive performance of random forests with the Bayesian classification trees on the three datasets and they concluded that the best tree selected with Bayesian classification trees has higher sensitivity and better accuracy in comparison to random forests. They expressed that the Bayesian approach may have better performance than random forests in determining important predictor variables in datasets with a large number of noise predictor variables. OML also indicated that the Bayesian classification tree approach unlike random forests is not biased toward assignment of observations to the largest class

OML and Hu in 2011 compared the performance of Bayesian classification trees with the CART of Breiman et al., and they concluded that the Bayesian approach has higher sensitivity and specificity in comparison to CART. They also investigated overfitting of the Bayesian approach by using cross-validation method, and this

OMML (O'Leary, Mengersen, Murray, Low Choy), 2008, proposed a Bayesian classification tree approach based on the computational strategy of Bayesian classification tree approach of OML and by using informative priors [96, 97]. In this Bayesian approach, informative priors are used to define Dirichlet distributions for

**6.7 OMML'S expert elicitation for Bayesian classification tree approach**

#### *Classic and Bayesian Tree-Based Methods DOI: http://dx.doi.org/10.5772/intechopen.83380*

*Enhanced Expert Systems*

CART model [88].

[90, 101].

**6.5 WTW'S Bayesian CART approach**

changes in the structure of tree without change in tree size.

rate, deviance, and largest posterior probability. Also, good regression trees have largest posterior probability and lowest residual sum of squares. DMS indicated that Bayesian approach provides richer output and superior performance than classic

**6.4 CGM's hierarchical priors for Bayesian regression tree shrinkage approach**

CGM, 2000 proposed a Bayesian approach for regression tree with mean-shift model based on computational strategy of CGM's Bayesian approach in 1998. Unlike the Bayesian approach (1998), it can assume dependence of parameters in the terminal nodes. Indeed, hierarchical priors are used for these parameters and therefore shrunk trees are generated [90]. Hierarchical priors have some advantages such as: shrinkage is used in the stochastic search algorithm unlike proposed methods for tree shrinkage (because these methods use shrinkage after searching tree), fitting a larger tree to the dataset without overfitting and improve predictions. CGM by using simulation showed the superior performance of new Bayesian approach for regression tree with mean-shift model in comparison to Bayesian approach of CGM in 1998, CART model, and tree shrinkage methods of Hastie and Pregibon

WTW (Wu, Tjelmeland, West), 2007 proposed a Bayesian approach for CART model based on the computational strategy of Bayesian approach of CGM (1998) [95]. In this approach, prior distributions define on the tree, splitting variables, splitting thresholds, and parameters in the terminal nodes. This Bayesian approach like approaches of CGM [89, 90, 92, 93] simulates from the posterior distribution by using the Metropolis-Hastings algorithm. The steps used in simulation algorithm of WTW include GROW step and PRUNE step, CHANGE step, SWAP step, and RESTRUCTURE (RADICAL) step (first three steps are similar to steps of simulation algorithm in Bayesian approaches of CGM). RESTRUCTURE step creates large changes in the structure of tree, but tree size is unchanged. There are some advantages by adding this step to simulation algorithm of posterior distribution such as: improving the convergence of the MCMC algorithm, elimination of the need for restarts of the simulation algorithm unlike Bayesian approaches of CGM, and large

In this approach, convergence diagnostics of simulation algorithm are based on plots such as: plots of iteration number against log posterior distribution, log marginal likelihood function, number of terminal nodes, and number of times that a particular predictor variable is shown as a splitting variable in the tree. WTW showed the superior performance of Bayesian approach in comparison to CART model and that the Bayesian approach had a lower misclassification rate than the

OML (O'Leary, Mengersen, Low Choy), 2008, proposed a Bayesian approach for CART model by extending the Bayesian approach of DMS. These two Bayesian approaches have differences such as the stopping rule of the simulation algorithm or convergence diagnostic plots, criteria for identifying good trees and prior distribu-

The stopping criterion of simulation chain in OML'S Bayesian classification trees approach has two steps. The first step includes the plot of iterations against

tions considered for parameters in the terminal nodes [88, 96, 98].

**40**

CART model [95].

**6.6 OML'S Bayesian CART approach**

accuracy measures (false and positive negative rate and misclassification rate), log posterior, log likelihood, and tree size. If these plots show stability in mentioned items, then in second step, structure of component trees (variables and splitting rules at each splitting node) examines in the set of good trees and if this structure was stabilized and/or the same trees were in this set, then convergence has occurred for this simulation chain; otherwise, iterations must be increased until convergence.

The set of good trees in this Bayesian classification tree approach is determined based on the accuracy measures computed from the confusion matrix of Fielding and Bell [102]. Good trees have lowest misclassification rate and false positive and negative rate (or using highest sensitivity and specificity instead of lowest false positive and negative rate) [96, 98, 103]. After convergence of simulation chain, two or three trees are selected as the best trees among set of good trees based on criteria such as modal structure of tree (same size tree with the same variables and splitting rules), lowest misclassification rate, false negative and positive rate and deviance, highest posterior probability and likelihood, using expert judgment and biological interpretability [96, 98, 103].

The stopping rule of simulation algorithm for regression tree like classification tree includes two steps. In the first step, plot of iterations are drawn against posterior probability, residual sum of squares, and deviance. If these abovementioned items are stable, then structure of component trees examines in the set of good trees and if this structure was stabilized, convergence has been occurred for this simulation chain. Also, set of good trees for regression tree is selected based on having the highest posterior probability and likelihood, lowest residual sum of squares, and deviance [98].

OML compared the Bayesian classification trees with the classic CART model on an ecological dataset and concluded that Bayesian approach has smaller false positive rate, misclassification rate, and deviance than CART model, while the CART model has lower false negative rate, but this model had higher false positive rate [96]. They, in 2008, indicated that this Bayesian approach had a lower false negative rate in comparison to Bayesian approach of DMS, but approach of DMS had a lower false positive rate and misclassification rate [96].

OML in 2009 compared predictive performance of random forests with the Bayesian classification trees on the three datasets and they concluded that the best tree selected with Bayesian classification trees has higher sensitivity and better accuracy in comparison to random forests. They expressed that the Bayesian approach may have better performance than random forests in determining important predictor variables in datasets with a large number of noise predictor variables. OML also indicated that the Bayesian classification tree approach unlike random forests is not biased toward assignment of observations to the largest class of outcome variable in predicting data [103].

OML and Hu in 2011 compared the performance of Bayesian classification trees with the CART of Breiman et al., and they concluded that the Bayesian approach has higher sensitivity and specificity in comparison to CART. They also investigated overfitting of the Bayesian approach by using cross-validation method, and this approach did not show any evidence of overfitting [98].

#### **6.7 OMML'S expert elicitation for Bayesian classification tree approach**

OMML (O'Leary, Mengersen, Murray, Low Choy), 2008, proposed a Bayesian classification tree approach based on the computational strategy of Bayesian classification tree approach of OML and by using informative priors [96, 97]. In this Bayesian approach, informative priors are used to define Dirichlet distributions for splitting node, splitting variable, and splitting rule as follows:

$$\mathbf{p}\{\mathbf{S}\_{k}|\mathcal{K}\} = \text{Dir}\{\mathbf{S}\_{k}|\mathbf{a}\_{\mathbf{S}\_{1}}, \dots, \mathbf{a}\_{\mathbf{S}\_{k}}\} \tag{13}$$

$$\mathbf{p}\left(\mathbf{V}\_{k}|\mathbf{S}\_{k},\mathcal{K}\right) = \text{Dir}\left(\mathbf{V}\_{k}|\mathbf{a}\_{V\_{1}},\ldots,\mathbf{a}\_{V\_{k}}\right) \tag{14}$$

$$\mathbf{p}\{\mathbf{R}\_{k}|\mathbf{V}\_{k}, \mathbf{S}\_{k}, \mathcal{K}\} = \text{Dir}\{\mathbf{R}\_{k}|\mathbf{a}\_{\text{R}\_{1}}, \dots, \mathbf{a}\_{\text{R}\_{k}}\} \tag{15}$$

In Bayesian approach of OML, there was no prior information about splitting node, splitting variable, splitting rule, and hyperparameters in the Dirichlet distributions of above equations. So, these hyperparameters were set equal to 1 and uniform non-informative priors used for splitting node, splitting variable, and splitting rule [96, 98, 103]. In this new approach, an expert is subjected with three questions (ordering, grading, and weighting) about splitting node, splitting variable, splitting rule, and tree size for defining informative priors. Then, existing hyperparameters in the relations (13), (14) and (15) are determined by following the result of a question. Three questions are used for size of the tree to determine λ in relation (6). DMS and OML used a weakly informative prior for tree size by setting λ = 10 [88, 96, 98, 103]. But OMML unlike DMS and OML used an informative prior for size of the tree [96, 97].

O'Leary et al. in 2008 investigated sensitivity to the choice of the hyperparameters of informative priors for tree size, splitting nodes, splitting variables, and splitting rules in classification trees and they concluded that posterior distribution is relatively robust to these priors except for extreme choices of them [96, 97].

OMML by simulation indicated that the best tree of Bayesian classification trees based on the informative priors has lower false negative rate in comparison to the best tree of Bayesian classification trees based on the non-informative priors [96, 97]. They also indicated the superior performance of Bayesian classification trees based on the informative priors in comparison to proposed expert elicitation approaches for Bayesian logistic regression model [97, 104–107].

#### **6.8 Other approaches for Bayesian classification and regression trees**

Pratola like Wu et al. proposed new Metropolis-Hastings proposals for Bayesian regression trees for improving the convergence of the MCMC algorithm [108]. CGM, 2003, proposed Bayesian treed GLMs by extending CGM's Bayesian approach (1998) [91]. Gramacy and Lee developed Bayesian treed Gaussian process models for a continuous outcome by combining standard Gaussian processes with treed partitioning [109]. Other Bayesian approaches are also proposed for tree-based models that we mention in the references. Refer to the Refs. [110–112] for other Bayesian tree approaches of CGM. Also, Chipman et al. review advance models for Bayesian treed methods and refer to the Ref. [113]. For study about other tree-based Bayesian approach, refer to Refs. [114–118]. Also, Refs. [119, 120] are proposed Bayesian approaches for ensemble trees.

### **7. Criteria for determining the predictive performance of classification and regression trees**

Predictive performance of classification tree models can compare using accuracy measures such as [17, 121]: sensitivity, specificity, false positive rate, false negative rate, positive predictive value, negative predictive value, positive likelihood ratio,

**43**

*Classic and Bayesian Tree-Based Methods DOI: http://dx.doi.org/10.5772/intechopen.83380*

**8. Conclusion**

perfect diagnostic performance has an AUC equal to 1.

negative likelihood ratio, accuracy, Youden's index , diagnostic odds ratio (DOR), F-measure, and area under curve (AUC). Sensitivity, specificity, positive and negative predictive values, Youden's index, and accuracy have values between 0 and 1, and when these criteria are near to 1, then classification tree algorithm has better predictive performance. Also, false positive and false negative rates are between 0 and 1, and when these values are near to 0, then classification tree algorithm has better predictive performance. Classification tree models with positive likelihood ratio >10, negative likelihood ratio <0.1, and high diagnostic odds ratio have good predictive performance. AUC shows an overall performance measure and is between 0 and 1. Higher value shows an overall good performance measure, and a

Predictive performance of regression tree algorithms can compare using criteria such as [122, 123]: Pearson correlation coefficient, root mean-squared error (RMSE),

relative error (RE), mean error (ME), mean absolute errors (MAE), and bias.

Bayesian tree has some advantages in comparison to classic tree-based approaches. Classic CART model cannot explore the space of the tree fully and the result of tree is only locally optimal due to using greedy search algorithm. But Bayesian tree approaches investigate different tree structures with different splitting variables, splitting rules, and tree sizes, so these models can explore the tree space more than classic tree approaches. Indeed, Bayesian approaches are remedies for solving this problem of CART model. Also, CART is biased toward predictor variables with many distinct values, and Bayesian tree models can be a remedial for solving this problem. Because Bayesian approaches proposed by CGM, DMS, OML, and WTW utilize uniform distribution for selecting splitting node, splitting variables, and splitting rules, thus these approaches generate unbiased splits or have not any bias toward predictor variables with more splits. These approaches unlike classic tree approaches generate several trees that this advantage makes researchers to select the best tree based on study aim. Because in some studies, sensitivity is

important for researcher and in other studies, specificity is important.

of these papers, refer to Refs. [124–127].

STATISTICA, R, and WEKA.

Some authors compared Bayesian approaches with classic tree approaches such as CART and random forests of Breiman and others models. Results of most papers indicated that Bayesian approach tends to present that the Bayesian method is superior to all other competitors. This can be for a variety of reasons: publication bias (methods that do not demonstrate superior performance typically do not get published), choice of examples that demonstrate superiority of their method, or more careful use of their method than the competing methods. Studies that may give more reliable comparisons would be ones in which there is no new method, and the paper is devoted to a comparison of existing approaches. For study about some

According to empirical results, we can conclude that Bayesian approaches have

better performance in comparison to classic CART model. Also, despite some advantages for Bayesian tree approaches in comparison with classic tree models, the number of published articles based on using Bayesian tree approaches for data analysis is low. One of the major reasons for this problem can be related to lack of user-friendly software and or need to have programming knowledge. On the other hand, the number of published papers based on employing CART model, random forests, and other classic tree models is many and one of the reasons for this frequency can be several software programs such as CART, SPSS, TANAGRA,

#### *Classic and Bayesian Tree-Based Methods DOI: http://dx.doi.org/10.5772/intechopen.83380*

negative likelihood ratio, accuracy, Youden's index , diagnostic odds ratio (DOR), F-measure, and area under curve (AUC). Sensitivity, specificity, positive and negative predictive values, Youden's index, and accuracy have values between 0 and 1, and when these criteria are near to 1, then classification tree algorithm has better predictive performance. Also, false positive and false negative rates are between 0 and 1, and when these values are near to 0, then classification tree algorithm has better predictive performance. Classification tree models with positive likelihood ratio >10, negative likelihood ratio <0.1, and high diagnostic odds ratio have good predictive performance. AUC shows an overall performance measure and is between 0 and 1. Higher value shows an overall good performance measure, and a perfect diagnostic performance has an AUC equal to 1.

Predictive performance of regression tree algorithms can compare using criteria such as [122, 123]: Pearson correlation coefficient, root mean-squared error (RMSE), relative error (RE), mean error (ME), mean absolute errors (MAE), and bias.

### **8. Conclusion**

*Enhanced Expert Systems*

prior for size of the tree [96, 97].

p(Sk|) = Dir(Sk|αS1

p(Vk|Sk,) = Dir(Vk|αV1

p(Rk|Vk,Sk,) = Dir(Rk|αR1

In Bayesian approach of OML, there was no prior information about splitting node, splitting variable, splitting rule, and hyperparameters in the Dirichlet distributions of above equations. So, these hyperparameters were set equal to 1 and uniform non-informative priors used for splitting node, splitting variable, and splitting rule [96, 98, 103]. In this new approach, an expert is subjected with three questions (ordering, grading, and weighting) about splitting node, splitting variable, splitting rule, and tree size for defining informative priors. Then, existing hyperparameters in the relations (13), (14) and (15) are determined by following the result of a question. Three questions are used for size of the tree to determine λ in relation (6). DMS and OML used a weakly informative prior for tree size by setting λ = 10 [88, 96, 98, 103]. But OMML unlike DMS and OML used an informative

O'Leary et al. in 2008 investigated sensitivity to the choice of the hyperparameters of informative priors for tree size, splitting nodes, splitting variables, and splitting rules in classification trees and they concluded that posterior distribution is relatively robust to these priors except for extreme choices of them [96, 97]. OMML by simulation indicated that the best tree of Bayesian classification trees based on the informative priors has lower false negative rate in comparison to the best tree of Bayesian classification trees based on the non-informative priors [96, 97]. They also indicated the superior performance of Bayesian classification trees based on the informative priors in comparison to proposed expert elicitation

Pratola like Wu et al. proposed new Metropolis-Hastings proposals for Bayesian

regression trees for improving the convergence of the MCMC algorithm [108]. CGM, 2003, proposed Bayesian treed GLMs by extending CGM's Bayesian approach (1998) [91]. Gramacy and Lee developed Bayesian treed Gaussian process models for a continuous outcome by combining standard Gaussian processes with treed partitioning [109]. Other Bayesian approaches are also proposed for tree-based models that we mention in the references. Refer to the Refs. [110–112] for other Bayesian tree approaches of CGM. Also, Chipman et al. review advance models for Bayesian treed methods and refer to the Ref. [113]. For study about other tree-based Bayesian approach, refer to Refs. [114–118]. Also, Refs. [119, 120] are

**7. Criteria for determining the predictive performance of classification** 

Predictive performance of classification tree models can compare using accuracy measures such as [17, 121]: sensitivity, specificity, false positive rate, false negative rate, positive predictive value, negative predictive value, positive likelihood ratio,

approaches for Bayesian logistic regression model [97, 104–107].

proposed Bayesian approaches for ensemble trees.

**and regression trees**

**6.8 Other approaches for Bayesian classification and regression trees**

,…,αSk) (13)

,…,αVk) (14)

,…,αRk) (15)

**42**

Bayesian tree has some advantages in comparison to classic tree-based approaches. Classic CART model cannot explore the space of the tree fully and the result of tree is only locally optimal due to using greedy search algorithm. But Bayesian tree approaches investigate different tree structures with different splitting variables, splitting rules, and tree sizes, so these models can explore the tree space more than classic tree approaches. Indeed, Bayesian approaches are remedies for solving this problem of CART model. Also, CART is biased toward predictor variables with many distinct values, and Bayesian tree models can be a remedial for solving this problem. Because Bayesian approaches proposed by CGM, DMS, OML, and WTW utilize uniform distribution for selecting splitting node, splitting variables, and splitting rules, thus these approaches generate unbiased splits or have not any bias toward predictor variables with more splits. These approaches unlike classic tree approaches generate several trees that this advantage makes researchers to select the best tree based on study aim. Because in some studies, sensitivity is important for researcher and in other studies, specificity is important.

Some authors compared Bayesian approaches with classic tree approaches such as CART and random forests of Breiman and others models. Results of most papers indicated that Bayesian approach tends to present that the Bayesian method is superior to all other competitors. This can be for a variety of reasons: publication bias (methods that do not demonstrate superior performance typically do not get published), choice of examples that demonstrate superiority of their method, or more careful use of their method than the competing methods. Studies that may give more reliable comparisons would be ones in which there is no new method, and the paper is devoted to a comparison of existing approaches. For study about some of these papers, refer to Refs. [124–127].

According to empirical results, we can conclude that Bayesian approaches have better performance in comparison to classic CART model. Also, despite some advantages for Bayesian tree approaches in comparison with classic tree models, the number of published articles based on using Bayesian tree approaches for data analysis is low. One of the major reasons for this problem can be related to lack of user-friendly software and or need to have programming knowledge. On the other hand, the number of published papers based on employing CART model, random forests, and other classic tree models is many and one of the reasons for this frequency can be several software programs such as CART, SPSS, TANAGRA, STATISTICA, R, and WEKA.

#### *Enhanced Expert Systems*

Bayesian tree approaches need more research, because these approaches unlike CART and random forests cannot impute missing values. These approaches also cannot create linear combination splits like other tree algorithms (CART, QUEST, and CRUISE), even though interpretation of these splits is hard, but results indicated that tree methods with these splits have superior prediction accuracy in comparison to tree with univariate splits [128].
