**2. Bayesian inference with intractable distributions**

Bayesian inference assumes that prior knowledge is available, so it should be exploited to improve the estimation accuracy. Such knowledge is often represented as a prior distribution of the parameters to be estimated. It can be provided by the domain expertise or otherwise be subjectively selected. As already mentioned, nonlinear and high-dimensional models are rarely tractable analytically. This section presents four basic strategies on how to deal with intractable distributions in Bayesian inference. In particular, the sampling methods define a proposal distribution, which is much easier to sample from. Most of these methods form a Markov chain of samples, which gradually converges to the desired distribution. The filtering methods update the estimates iteratively, so they are particularly suitable for streaming data. These methods are also more efficient in higher dimensions than the pure sampling methods. The approximate methods strive to improve the efficiency, even at the cost of estimation accuracy. The last group is the likelihood-free methods, which are particularly suited for parameter estimation using simulations.

#### **2.1 Sampling methods**

Consider, for example, the expectation (3). The integral (3) may be not only mathematically intractable but also numerically intractable, provided that the distribution *p x*ð Þ is difficult or even impossible to sample from. A common strategy then is to choose and sample from another, simpler distribution, *q x*ð Þ, such that *q x*ð Þ> 0, whenever *f x*ð Þ*p x*ð Þ>0. The resulting method is known as importance sampling (IS), and *q x*ð Þ is referred to as the proposal distribution. Eq. (3) is then rewritten as

$$\int f(\mathbf{x}) \ p(\mathbf{x}) d\mathbf{x} = \int f(\mathbf{x}) \frac{p(\mathbf{x})}{q(\mathbf{x})} q(\mathbf{x}) d\mathbf{x} \approx \frac{1}{N} \sum\_{i=1}^{N} \int \left(\mathbf{x}^{(i)}\right) \underbrace{\frac{p(\mathbf{x}^{(i)})}{q(\mathbf{x}^{(i)})}}\_{w^{(i)}} \tag{5}$$

where the sum is an unbiased and consistent estimator of the expectation due to the law of large numbers.

Although the estimate (5) is unbiased and consistent, the number of samples required grows exponentially with the number of dimensions. It is also desirable to make the weights, *<sup>w</sup>*ð Þ*<sup>i</sup>* , more uniform; that is, *q x*ð Þ should approximate *p x*ð Þ, in order to reduce the variance of the estimator in (5). Alternatively, the distribution *p x*ð Þ could be approximated by a histogram; the bins of such histogram can be optimized for a maximum efficiency. The histogram can be sampled directly using a uniform random number generator.

The sampling efficiency of the IS methods, particularly in high dimensions, can be improved by accounting for the likelihood of samples. The Markov chain Monte Carlo (MCMC) strategy represents a broad variety of sampling methods, such that the current sample is affected by the location of the previous sample. It allows learning complex target distributions in multiple dimensions. However, the MCMC sampling

requires a burn-in period for achieving the convergence. Furthermore, the samples may get stuck in areas of small probability (requiring a random restart), and the correlation between consecutive samples increases the estimator variance.

Metropolis–Hastings sampling is the most commonly used MCMC algorithm. It can be succinctly described as the following three steps:


The challenge is to find a good proposal, *q*, with a bounded support to achieve fast convergence to the target distribution, *π*ð Þ *x* . The target distribution only needs to be known up to a proportionality constant. The generated samples tend to concentrate in the areas of large probability. The generated samples occasionally contain subsequences of the same value. More importantly, as with any other sampling methods, there is an exploration–exploitation trade-off. Large acceptance rate means slow exploration of the target distribution, *π*, as well as large correlations between the samples. On the other hand, a small acceptance rate means large jumps across the support of *π*. The acceptance rate can be controlled by assuming a parameterized proposal, *qϕ*. The acceptance rate of about 50% is recommended for samples in a few dimensions, and it should be reduced to about 25% for distributions in many dimensions.

In the literature, many variants of the Metropolis–Hastings sampling algorithm were proposed, for example, combining random walk sampling, adaptive rejection sampling, Langevin algorithm, and augmented estimator.

Gibbs sampling iteratively samples from the conditional densities

$$\begin{array}{ll} X\_1^{(t)} & \sim & p\_1 \left( \mathbf{x}\_1 | \mathbf{x}\_2^{(t)}, \mathbf{x}\_3^{(t)}, \dots, \mathbf{x}\_N^{(t)} \right) \\\\ X\_2^{(t)} & \sim & p\_2 \left( \mathbf{x}\_2 | \mathbf{x}\_1^{(t)}, \mathbf{x}\_3^{(t)}, \dots, \mathbf{x}\_N^{(t)} \right) \\\\ & \vdots \\\\ X\_N^{(t)} & \sim & p\_N \left( \mathbf{x}\_N | \mathbf{x}\_1^{(t)}, \mathbf{x}\_2^{(t)}, \dots, \mathbf{x}\_{N-1}^{(t)} \right) . \end{array} \tag{6}$$

It can be represented as a set of *N* univariate Metropolis–Hastings samplers. The sampling dimensions can be chosen systematically or at random. Even though Gibbs sampling has 100% acceptance rate, it can still experience a slow convergence to the target distribution, so it is often combined with other sampling methods.

Gibbs sampling motivates the Rao–Blackwell estimators

$$E[h(\mathbf{x}\_1)] \approx \frac{1}{T} \sum\_{t=1}^{T} E\left[h(\mathbf{x}\_1) \, | \, \mathbf{x}\_2^{(t)}, \, \dots, \mathbf{x}\_N^{(t)}\right] \tag{7}$$

$$p\_i(\mathbf{x}\_i) \approx \frac{1}{T} \sum\_{t=1}^{T} p\_i\left(\mathbf{x}\_i | \mathbf{x}\_j^{(t)}, j \neq i\right) \tag{8}$$

of function expectation and of the marginal distribution, respectively. These estimators are unbiased and have lower variance, since

$$var\left[E\left[h(\mathbf{x}\_1)|\mathbf{x}\_2^{(t)},\ldots,\mathbf{x}\_N^{(t)}\right]\right] \le \nu ar[E[h(\mathbf{x}\_1)]].\tag{9}$$

An alternative strategy combines importance sampling with MCMC sampling assuming independent proposals, Q*<sup>N</sup> <sup>i</sup>*¼<sup>1</sup>*qi* ð Þ *xi* or <sup>Q</sup>*<sup>N</sup> <sup>i</sup>*¼<sup>1</sup>*qt xi*j*x*ð Þ *<sup>t</sup>*�<sup>1</sup> *i* � �.

An interesting problem is how to design MCMC sampling when the target distribution is not stationary. The adaptive MCMC sampling can be trained online with the generation of new data. However, using past samples for adaptation invalidates the Markov assumption, so the convergence to the target distribution may be problematic, or the adaptation should cease after the burn-in period.

The Hamiltonian MCMC sampling tracks movements of a hypothetical ball under the potential and kinetic energy constraints. The new sample representing the updated ball location is obtained by integrating the ball's speed. The integration can be approximated, for example, by discretization. More importantly, the interval of integration trades-off the acceptance rate and the sampling rate, so, also the amount of correlations between subsequent samples and the waiting time. The main advantage of Hamiltonian MCMC method is a fast mix-in; that is, the samples converge quickly to the target distribution.

Another sampling method motivated by models in statistical physics is a restricted Boltzmann machine (RBM). This method learns the target distribution as a configuration, *h*, of a bipartite graph. The target distribution is *p h*ð Þ¼ exp ð Þ �*E h*ð Þ *=Z*, where *E h*ð Þ is the energy of the configuration *<sup>h</sup>*, and *<sup>Z</sup>* <sup>¼</sup> <sup>P</sup> *<sup>h</sup>* exp ð Þ �*E h*ð Þ is a partition (scaling) function.

The reversible-jump MCMC sampling can be used for distributions having an uncertain number of parameters (the model order). The idea is to increase the number of parameters by 1 with a certain probability every time the sample is taken from the proposal distribution.

The common pitfall of all sampling methods is how to recognize that the generated samples are not converging to the target distribution, for example, since the assumptions were violated, or the method has not been implemented correctly. These issues may not be easily detected by simply evaluating the samples or the simulation outputs.

### **2.2 Filtering methods**

Processing time-series and streaming data must be often done recursively. The corresponding data model can be usually represented as a dynamic system, so the current state, *xt*, is updated from the previous state, *xt*�1, by an innovation random process, and the states, *xt*, are observed as values, *zt*. The states form a Markov chain, that is, *p xt* ð j*xt*�1, … , *x*1Þ ¼ *p xt* ð Þ j*xt*�<sup>1</sup> , and the observations are assumed to be independent, that is, *p zi*, *zj*j*xt* � � <sup>¼</sup> *p zi* ð Þ <sup>j</sup>*xt p zj*j*xt* � �. Assuming the Bayes's theorem

$$p(\mathbf{x}\_{1:t}|\mathbf{z}\_{1:t}) = \frac{p(\mathbf{z}\_{1:t}|\mathbf{x}\_{1:t})p(\mathbf{x}\_{1:t}|\mathbf{z}\_{1:t-1})}{p(\mathbf{z}\_t|\mathbf{z}\_{1:t-1})},\tag{10}$$

the states can be estimated recursively from the incoming observations as a two-step procedure. The first step predicts the (distribution of) current state as

$$p(\mathbf{x}\_t|\mathbf{z}\_{1:t-1}) = \int p(\mathbf{x}\_t|\mathbf{x}\_{t-1}) p(\mathbf{x}\_{t-1}|\mathbf{z}\_{1:t-1}) \, d\mathbf{x}\_{t-1}.\tag{11}$$

The predicted state is corrected in the second step after receiving the latest observation, *zt*, as

$$p(\mathbf{x}\_{t}|\mathbf{z}\_{1:t}) = \frac{p(\mathbf{z}\_{t}|\mathbf{x}\_{t})p(\mathbf{x}\_{t}|\mathbf{z}\_{1:t-1})}{\int p(\mathbf{z}\_{t}|\mathbf{x}\_{t})p(\mathbf{x}\_{t}|\mathbf{z}\_{1:t-1})d\mathbf{x}\_{t}}.\tag{12}$$

For linear systems with Gaussian inputs, the processes remain Gaussian, so only their mean values, *E x*1:*t*j*z*1:*<sup>t</sup>* ½ �, and covariances, *cov x*1:*t*j*z*1:*<sup>t</sup>* ½ �, need to be tracked. In such a case, the recursive estimator implementing (11) and (12) is known as Kalman filter. The Kalman filtering can be succinctly described as

$$\hat{\boldsymbol{x}}\_{t} = \begin{pmatrix} \text{prediction} \ \text{of} \ \boldsymbol{x}\_{t} \end{pmatrix} + \boldsymbol{K}\_{t} \cdot \begin{bmatrix} \text{z}\_{t} - \begin{pmatrix} \text{prediction} \ \text{of} \ \boldsymbol{z}\_{t} \end{bmatrix} \end{pmatrix} \tag{13}$$

where *Kt* is the Kalman gain at time *t*. It can be shown that Kalman filter is unbiased, and it is optimum in a sense of having the smallest possible variance; it belongs to a class of the best linear unbiased estimators (BLUE). Practical implementations of Kalman filter are usually concerned with numerical stability and process observability. The posterior distribution of the model states can be estimated using kernel and other smoothing methods.

Kalman filter can be modified to work with non-linear models. In particular, extended Kalman filter performs local linearization at each iteration assuming the first-order Taylor expansion. This can, however, cause large estimation errors and stability issues. Unscented Kalman filter defines a set of sigma points, which are tracked through non-linearity. The weighted and transformed sigma points are used to reconstruct the mean and the covariance of Gaussian distribution, so canonical Kalman filter can be then used. Ensemble Kalman filter has been adopted for large but sparse systems. Expectation maximization (EM) filtering is used when there are also unknown model parameters; see Eqs. (16) and (17) below.

In practice, having an approximate solution for a complex model is often much more useful than obtaining the exact solution of a simplified model. This has motivated the development of particle filters to estimate random signals under non-linear and non-Gaussian conditions. The key idea is to represent the posterior distribution by a set of random weighted samples (called particles), *xi*, that is,

$$p(\mathbf{x}) \approx \sum\_{i=1}^{N} w\_i \; \delta(\mathbf{x} - \mathbf{\tilde{x}}\_i) \quad \Rightarrow \quad E[f(\mathbf{x})] \approx \sum\_{i=1}^{N} w\_i f(\mathbf{\tilde{x}}\_i) \tag{14}$$

where *δ* denotes the Dirac-delta function. However, the number of required particles grows quickly with the number of dimensions.

The particles should be chosen to represent high-density regions by many samples with large weights. Provided that the particles are generated using some MCMC sampling strategies, the resulting recursive Bayesian estimator is referred to as the sequential MC (SMC) method. Since only after a few iterations, most particles often have negligible weights (so-called particle degeneracy problem), particle resampling (with replacement and proportionally to their weight) is necessary in order to recover a good representation of the underlying distribution while maintaining a constant number of particles.

*Bayesian Methods and Monte Carlo Simulations DOI: http://dx.doi.org/10.5772/intechopen.108699*

Expectation maximization (EM) method has been developed to deal with missing data and to filter data under parameter uncertainty. Assuming the observed but incomplete data, *x*, as well as the missing data, *y*, the data likelihood conditioned on an unknown parameter, *θ*, can be computed as

$$p(\boldsymbol{x}|\boldsymbol{\theta}) = \int p(\boldsymbol{x}, \boldsymbol{y}|\boldsymbol{\theta})d\boldsymbol{y} = \int p(\boldsymbol{y}|\boldsymbol{x}, \boldsymbol{\theta})p(\boldsymbol{x}|\boldsymbol{\theta})d\boldsymbol{y}.\tag{15}$$

The integral (15) is evaluated iteratively by first computing the Q-function (the E-step) as

$$Q(\theta, \theta\_{n-1}) = E[\log p(\mathbf{x}, \ y | \theta) | \mathbf{x}, \theta\_{n-1}] \tag{16}$$

followed by estimating the parameter *θ* (the M-step) as

$$
\theta\_n = \operatorname{argmax}\_{\theta} Q(\theta, \theta\_{n-1}).\tag{17}
$$

This procedure guarantees that the likelihood is maximized, since always *p x*ð Þ j*θ<sup>n</sup>* >*p x*ð Þ j*θ<sup>n</sup>*�<sup>1</sup> . However, Eqs. (16) and (17) are normally analytically intractable, so they must be evaluated numerically.

### **2.3 Approximation methods**

The sampling methods, which mostly involve MCMC sampling, for performing Bayesian inference are asymptotically exact. However, these methods are also numerically expensive, especially for high-dimensional models. When working with big data, or when there is a need to choose among many plausible models, having less accurate but numerically cheaper solution is highly desirable. Variational inference deterministically approximates the log-evidence *p x*ð Þ in Bayes's theorem (1) as

$$\log p(\mathbf{x}) = \mathcal{L} + \text{KL}[q(\theta) \| p(\theta|\mathbf{x})] \ge \mathcal{L} \tag{18}$$

where L is the evidence lower-bound (ELBO), since the Kullback–Leibler (KL) divergence is always non-negative. Therefore, given the evidence *p x*ð Þ, minimizing the KL divergence between the approximation, *q*ð Þ*θ* , and the posterior, *p*ð Þ *θ*j*x* , is equivalent to maximizing ELBO, L. Alternatively, it is possible to approximate the prior, *p*ð Þ*θ* , instead of the posterior. These are functional optimization problems defined as *<sup>d</sup> dq*Lð Þ¼ *<sup>q</sup>* 0, s.t., <sup>Ð</sup> *<sup>q</sup>*ð Þ*<sup>θ</sup> <sup>d</sup><sup>θ</sup>* <sup>¼</sup> 1, with the optimum solutions *<sup>q</sup>* <sup>∗</sup> ð Þ¼ *<sup>θ</sup> <sup>p</sup>*ð Þ *<sup>θ</sup>*j*<sup>x</sup>* or *<sup>q</sup>* <sup>∗</sup> ð Þ¼ *<sup>θ</sup> <sup>p</sup>*ð Þ*<sup>θ</sup>* .

Any function, *q*ð Þ*θ* , can be assumed, but some choices are better than others. The mean-field approximation assumes that the dimensions are independent, that is, *<sup>q</sup>*ð Þ¼ *<sup>θ</sup>* <sup>Q</sup>*<sup>M</sup> <sup>i</sup>*¼<sup>1</sup>*qi* ð Þ *θ<sup>i</sup>* . Then, *qi* ð Þ *θ<sup>i</sup>* can be optimized in turn until a convergence is obtained, indicated by no more increases in the ELBO value. There is a performance penalty if the independence assumption is not satisfied. Another popular choice is to assume a class of distributions, *qϕ*ð Þ*θ* , parameterized by *ϕ*.

The mean-field assumption is implemented in the coordinated ascent variational inference (CAVI) algorithm. It is related to Gibbs sampling and message passing. However, this method is not computationally efficient and only works for distributions, which are conditional conjugates of the prior. Stochastic variational inference

(SVI) improves the efficiency as well as accuracy of CAVI by assuming the stochastic approximation of the ELBO gradient and by optimizing the parameters using only a subset of data in each iteration (similarly to the EM method). Unfortunately, both CAVI and SVI require deriving the approximation components, *qi* , analytically, which is often impractical.

Motivated by the limitations of conventional variational inference methods, the black box variational inference (BBVI) algorithm samples from a family of distributions, *qϕ*ð Þ*θ* , assuming that *p x*ð Þ j*θ p*ð Þ¼ *θ p x*ð Þ , *θ* is known. The objective is to avoid the need for analytical derivations. The BBVI method can be used to estimate a wide range of linear and non-linear models. Since it is also the most efficient method for performing variational inference, it has become available as a library in many programming languages.

The divergence between prior or posterior and its approximation cannot be performed directly but only by maximizing the ELBO. Although, at each iteration, the ELBO is guaranteed not to decrease, the convergence can be slow. However, the efficiency of variational methods is still better than that of the sampling and EM methods. The KL divergence can be replaced with alternative measures such as *α* divergence, *f* divergence, and mutual information. In addition, there are only a few theoretical guarantees, and the overall estimation accuracy of variational methods is crucially affected by not knowing how well the posterior (or prior) has been approximated and approximating asymmetric distributions is more challenging.

#### **2.4 Likelihood-free inference**

In many practical scenarios, a mathematical model can be available as a simulation. It is often easy to simulate the model for different parameter values. Thus, by generating and comparing many simulation outputs for different parameter values, the simulation run can be identified that best matches the observed data. The parameter values for this simulation run are then declared to be the estimate. The inference methods implementing this simple idea are known as being likelihood free. They can be found in the literature as indirect inference, bootstrap filter, synthetic likelihood, and other methods. However, the most popular among these methods is approximate Bayesian computation (ABC).

The ABC method only requires that a generative model to generate data is available. There is otherwise no need to assume, know, or calculate the model posterior, parameter likelihoods, or importance weights. The basic ABC rejection sampling is performed by repeating the following three steps:


The posterior distribution, *πε*ð Þ *θ*j*x* , can be estimated after enough samples, *θ* and *x*, have been collected. It is clear that the number of samples required to accurately estimate *πε*ð Þ *θ*j*x* grows quickly with the number of dimensions. The exact inference can be obtained in the limit lim *<sup>ε</sup>*!<sup>0</sup>*πε*ð Þ¼ *θ p*ð Þ *θ*j*x* . On the other hand, no learning occurs if lim *<sup>ε</sup>*!<sup>∞</sup>*πε*ð Þ¼ *θ p*0ð Þ*θ* .

*Bayesian Methods and Monte Carlo Simulations DOI: http://dx.doi.org/10.5772/intechopen.108699*

The efficiency of the ABC method can be improved by considering summary statistics, *S x*ð Þ, rather than directly comparing the data, *x*. If *S x*ð Þ is the sufficient statistic for estimating the parameter *θ*, then *πε*ð Þ¼ *θ*j*S x*ð Þ *πε*ð Þ *θ*j*x* ; however, in practice, an informative statistic is often used.

Even though the ABC method can be readily parallelized, it can still be very inefficient, especially if the informative statistic has been poorly chosen and in high dimensions. Since the default ABC method does not exploit information about the accepted and rejected values of *θ*, the efficiency could be improved by employing the MCMC sampling. The resulting ABC-MCMC algorithm performs the following four steps:

1.Augment the posterior distribution as *<sup>p</sup>* <sup>~</sup>*θ*, *<sup>x</sup>*~j*<sup>x</sup>* ;

2.At the current state <sup>~</sup>*θ*, generate a new sample, *<sup>θ</sup>* � *<sup>q</sup> <sup>θ</sup>*j~*<sup>θ</sup>* ;

3. Simulate the data, *x*, for the new state *θ*;

4.Accept the new state, *θ*, with some probability, *α θ*ð Þ , *x* .

This algorithm can be fine-tuned by gradually reducing *ε* in order to balance rejections with the estimation accuracy while the convergence requires *ε* to be sufficiently small. Unlike pure MCMC sampling, the MCMC-ABC method requires a small acceptance rate in order to achieve good accuracy.
