*3.1.3 MCMC diagnostics*

After a large enough number of iterations, the MCMC eventually converges to the posterior distribution. A diagnostic statistic is needed to determine whether the MCMC has already converged to the stationary distribution or more iterations are needed. Several diagnostic statistics have been proposed, but we will use the Gelman and Rubin and graphical diagnostics.

*Gelman and Rubin diagnostic (GR)* [7]. This diagnostic uses several chains, f g *Xi*0, … , *Xin*�<sup>1</sup> , *i* ¼ 1, … ,*m*, drawn from an overdispersed density with respect to the target density *π*ð Þ� . In 1992, Gelman and Rubin defined two estimators of the variance of *X* when *X* � *π θ*ð Þ:

1.The *within-chain variance: W* <sup>¼</sup> <sup>P</sup>*<sup>m</sup> i*¼1 P*n*�<sup>1</sup> *<sup>j</sup>*¼<sup>0</sup> *Xij* � *Xi*� � �<sup>2</sup> *=*ð Þ *m n*ð Þ � 1 and

2.The *pooled variance: <sup>V</sup>*^ <sup>¼</sup> ð Þ ð Þ *<sup>n</sup>* � <sup>1</sup> *<sup>=</sup><sup>n</sup> <sup>W</sup>* <sup>þ</sup> *<sup>B</sup>=n*.

where *<sup>B</sup>=<sup>n</sup>* <sup>¼</sup> <sup>P</sup>*<sup>m</sup> <sup>i</sup>*¼<sup>1</sup> *Xi*� � *<sup>X</sup>*�� � �<sup>2</sup> *=*ð Þ *m* � 1 is the *between-chain variance* estimate, *Xi*� is the mean of the chain *i*, *i* ¼ 1, … ,*m*, and *X*�� is the overall mean. The potential scale reduction factor (PSRF) or Rhat is defined by:

$$
\hat{R} = \frac{\hat{V}}{W} \tag{28}
$$

The variance in the numerator of *R*^ overestimates the target variance, while the variance in the denominator underestimates it. This fact produces *R*^ greater than 1. One criterion for stopping the MCMC simulation is that *R*^ ≈1 or *R*^ <1*:*1. The GR and ESS diagnostics are implemented in the *coda* package [8].

*Graphical diagnostics*. MCMC trace plots are the most widely used diagnostic plots to determine convergence. They are a time series that shows the behavior of the Markov chains around their state space and their achievements at each iteration. When the visible trends show changes in the dispersion of the chain trace, the MCMC has not reached a stationary state. In contrast, when good mixing is observed, the MCMC sampling is said to converge to the target distribution.

### **3.2 Model checking and model comparison**

Any Bayesian analysis should include a check of the adequacy of the fit of the postulated model to the data. The adequacy of the fit of a model is measured by how well the distribution of the proposed model approximates the distribution of the data; the better the fit of the postulated model to the data, the better the model. But if the fit is poor, it does not mean that the model is bad, but rather that it contains deficiencies that can be improved. This section explains a model assessment method based on the posterior predictive distribution.

Let us define the replicated data *yrep* as one that could be observed tomorrow if the experiment that produced the current data *y* were replicated tomorrow with the same model and the same values of *θ* that produced *y*. The distribution of *yrep* given the current data *y* is called *posterior predictive distribution* and defined as [9].

$$p(\mathcal{y}^{rep}|\mathcal{y}) = \int p(\mathcal{y}^{rep}|\theta)p(\theta|\mathcal{y})d\theta \tag{29}$$

If the model is accurate, that is, it has a reasonably good fit, the replicated data should be similar to the observed data.

#### *3.2.1 Log pointwise predictive density*

The performance of the fitted model can be measured by the quality of its predictions in the new data *yf* . Pointwise predictions are predictions of each element *y f <sup>i</sup>* in *y<sup>f</sup>* that are summarized using an appropriate statistic.

*Bayesian Multilevel Modeling in Dental Research DOI: http://dx.doi.org/10.5772/intechopen.108442*

Access to *y<sup>f</sup>* is not always easy and sometimes impossible. Instead, performance of the fitted model can be done using the current data *y*. This method for calculating predictive accuracy and to compare models is known as *within-sample predictive accuracy*.

The *log pointwise predictive density* (lppd) of the fitted model to the observed data and unknown parameter *θ* is defined as

$$dppd = \log \prod\_{i=1}^{n} p(\theta | \mathbf{y}\_i) = \sum\_{i=1}^{n} \log \int p(\mathbf{y}\_i | \theta) p(\theta | \mathbf{y}\_i) d\theta \tag{30}$$

In general, the expected predictive accuracy of a model fitted to new data is poorer than the expected predictive accuracy of the same model with the observed data. With the *computed lppd* (clppd), we can evaluate the expression using draws from *p*ð Þ *θ*j*y* obtained with MCMC, *θ<sup>s</sup>* , *s* ¼ 1, … ,*S* using sufficient draws:

$$c dppd = \sum\_{i=1}^{n} \log \left( \frac{1}{S} \sum\_{s=1}^{S} p \left( y\_i | \theta^{\prime} \right) \right) \tag{31}$$

The clppd of the observed data *y* is an overestimate of the clppd for future data.

A second method to assess posterior predictive expectation is the *adjusted withinsample predictive accuracy* that consists of a bias correction of the lppd estimated using information criteria such as Akaike information criterion, deviance information criterion, or Watanabe–Akaike information criterion.

A third method to assess posterior predictive expectation is the *cross-validation*, which captures the out-of-sample predictive error by fitting the model to the training data and assessing the predictive fit in the holdout data [9]. In model comparison, the best model is the one with the lowest predictive error. Let us explain this method in detail:

*Leave-one-out cross-validation* (LOO-CV) works with *n* partitions in which each holdout set has only one observation, which generates *<sup>n</sup>* different inferences, *ppost*ð Þ �*<sup>i</sup>* , obtained through *S* posterior simulations, *θis*.

The Bayesian LOO-CV estimate of the predictive fit out of the sample is

$$lppd\_{loo-cv} = \sum\_{i=1}^{n} \log p\_{pot(-i)}(\mathbf{y}\_i) \approx \sum\_{i=1}^{n} \log \left(\frac{1}{S} \sum\_{i=1}^{S} p\left(\mathbf{y}\_i | \theta^i\right)\right) \tag{32}$$

Each prediction is conditioned in *n* � 1 data points, which underestimates the predictive fit. For large *n*, the difference is insignificant; however, for small *n*, a first-order bias correction *<sup>b</sup>* <sup>¼</sup> *lppd* � *lppd*�*<sup>i</sup>* can be used, where

$$\overline{1ppd}\_{-i} = \frac{1}{n} \sum\_{i=1}^{n} \sum\_{j=1}^{n} \log p\_{post(-i)} \left( y\_j \right) \approx \frac{1}{n} \sum\_{i=1}^{n} \sum\_{j=1}^{n} \log \left( \frac{1}{S} \sum\_{s=1}^{S} p \left( y\_j | \theta^{\mathrm{si}} \right) \right) \tag{33}$$

The bias-corrected Bayesian LOO-CV is

$$lppd\_{col-cv} = lppd\_{loo-cv} + b \tag{34}$$

An estimation of the effective number of parameters is

$$p\_{1oo-cv} = ldp p - ldp p\_{lo-cv} \tag{35}$$

When comparing two fitted models, we can estimate the difference in their expected predictive accuracy by the difference in *elppdloo*�*cv*. The standard error of the difference can be computed using a paired estimate to take advantage of the fact that the same set of *n* data points is used to fit both models.

Suppose we are comparing models I and II, with corresponding fit measures *elpd<sup>I</sup> loo*�*cv* and *elpdII loo*�*cv*; then difference and its standard error are

$$\begin{aligned} \text{elpd\\_diff} &= \text{elpd}\_{\text{loo}-cv}^{\text{I}} - \text{elpd}\_{\text{loo}-cv}^{\text{II}}\\ \text{se\\_diff} &= \text{se} \left( \text{elpd}\_{\text{loo}-cv}^{\text{I}} - \text{elpd}\_{\text{loo}-cv}^{\text{II}} \right) = \sqrt{nV\_i^n \left( \text{elpd}\_{\text{loo},i}^{\text{I}} - \text{elpd}\_{\text{loo},i}^{\text{II}} \right)} \end{aligned} \tag{36}$$

When two models are compared using the LOO-CV statistic, the one with the lowest value of this statistic is declared the best model. If elpd\_diff is used with the loo\_compare function of the brms library [6], the value of the difference is reported in the best model accompanied by its se\_diff. When comparing two models, the value of the difference is reported in the column of the best model. There is more evidence of the superiority of one model over another when the elpd\_diff is larger than the se\_diff.
