**3. Bayesian inference**

Bayesian inference is a more attractive alternative to frequentist maximum likelihood estimation when: (1) we have information about the parameters in the model, (2) the frequentist estimation method does not converge, (3) the sample size is small at the highest level of the data, or (4) nonlinear functions of the parameters are to be estimated. With this motivation, let us define some concepts.

The heart of the Bayesian inference is the posterior distribution of *θ*, *p*ð Þ *θ*j*y* , which is defined as the joint probability distribution of the observed data *y* and the parameter *<sup>θ</sup>*, *p y*ð Þ¼ , *<sup>θ</sup> p y*ð Þ <sup>j</sup>*<sup>θ</sup> <sup>p</sup>*ð Þ*<sup>θ</sup>* , conditioned on the known value of *<sup>y</sup>*, *p y*ð Þ¼ <sup>Ð</sup> *p*ð Þ*θ p y*ð Þ j*θ dθ*). Using Bayes' theorem, we obtain

$$p(\theta|\mathbf{y}) = \frac{p(\theta)p(\mathbf{y}|\theta)}{p(\mathbf{y})} \tag{24}$$

where *p y*ð Þ j*θ* is the *likelihood* of the data *y*, and *p*ð Þ*θ* is the *prior distribution* of *theta*. *p y*ð Þ¼ <sup>Ð</sup> *p*ð Þ*θ p y*ð Þ j*θ dθ* with fixed *y* is a normalization constant not depending on *θ*. So, an equivalent equation to (24) is

$$p(\theta|y) \propto p(\theta)p(y|\theta) \tag{25}$$

Prior distributions can be *informative* or *non-informative*. When the researcher has a high degree of certainty about *θ*, the prior distribution will have a small variance and so will also be informative. If this fact does not happen, that is, the researcher has low degree of certainty about *θ*, the prior distribution will have a large variance and so will be non-informative. Since the prior distribution is a factor in the posterior distribution, when the prior is informative, it will have a great impact on the posterior, so the researcher must be careful when an informative prior distribution is used.

Bayesian estimators are only mean or median vector of the posterior distribution, that is, ^*<sup>θ</sup>* <sup>¼</sup> <sup>Ð</sup> *θ <sup>p</sup>*ð Þ*<sup>θ</sup> p y*ð Þ <sup>j</sup>*<sup>θ</sup> p y*ð Þ *<sup>d</sup>θ*. However, if *<sup>θ</sup>* has high dimension, this implies to obtain multiple integrals that usually do not have a closed solution. Sometimes, *θ* ¼ ð Þ *θa*, *θ<sup>b</sup>* and *θ<sup>b</sup>* are nuisance parameters that must be ignored. The solution is to integrate the posterior distribution with respect to the nuisance parameters, but again, this multiple integral may have no closed solution.

The most widely used method is Markov chain Monte Carlo to obtain means, medians, and quantiles of the posterior distribution.

### **3.1 Markov chain Monte Carlo**

### *3.1.1 Markov chain*

A discrete-time *Markov chain* is a sequence of random variables, *Xn*,*n*≥ 1, that take values in a finite or countable Ω set that satisfies

*Bayesian Multilevel Modeling in Dental Research DOI: http://dx.doi.org/10.5772/intechopen.108442*

$$p(X\_{n+1} = j | X\_0 = i\_0, \dots, X\_n = i\_n) = p(X\_{n+1} = j | X\_n = i\_n) \tag{26}$$

for all *n* and any states *i*0, … ,*in*, *j* in Ω. Under regularity conditions, the chain will gradually forget its initial state *i*0, and starting from a state *t*, *pt* ð Þ �j*X*<sup>0</sup> ¼ *i*<sup>0</sup> will converge to a unique stationary distribution *ϕ*ð Þ� (invariant) that does not depend on *t* or *i*0.

As the number of sampled points f g *Xt* increases, they will look more like dependent samples from *ϕ*ð Þ� . The *burn-in* of an MCMC is the number of iterations, *m*, to eliminate so that the rest show a behavior of dependent samples from the stationary distribution *ϕ*ð Þ� [4]. When the number of burn-in samples is *m*, an estimator of the expectation of *f X*ð Þ is

$$E[\hat{f}(\hat{\mathbf{X}})] = \frac{1}{n-m} \sum\_{t=m+1}^{n} f(\mathbf{X}\_t) \tag{27}$$

## *3.1.2 Hamiltonian Monte Carlo*

The Gibbs sampling and the random walk Metropolis are methods whose distributions converge to the target distributions; however, complex models with a large number of parameters may require an unacceptably long time to converge to the target distribution. This problem is largely caused by inefficient random walks that estimate the parameters'space.

The Hamiltonian Monte Carlo (HCM) algorithm or hybrid Monte Carlo algorithm eliminates random walks using momentum variables that transform the target distribution sampling problem into the Hamiltonian dynamics simulation problem. The St*ö*rmer–Verlet "leapfrog" (jump steps) integrator is used to simulate the time evolution of this system. Given a sample *m*, a step size *ε*, and a number of steps *L*, the HMC algorithm consists of resampling the momentum variables *rd* from a standard multivariate normal distribution (it can be considered a Gibbs sampling update) and then applying *L* "leapfrog" updates to the position and momentum variables (*θ* and *r*) to generate a pair of proposed position and momentum variables (~*θ*, ~*r*), which are defined as *<sup>θ</sup><sup>m</sup>* <sup>¼</sup> <sup>~</sup>*<sup>θ</sup>* and *<sup>r</sup><sup>m</sup>* <sup>¼</sup> <sup>~</sup>*r*, and will be accepted or rejected according to the Metropolis algorithm. For more details, see [5]. In general, specifying the step size (*ε*) and number of steps (*L*) is quite difficult when the path is too short, too long, or too straight.

This method for generating MCMC is implemented in the brms package [6] to perform Bayesian estimation in multilevel models.
