**3. Other topics in Bayesian analysis**

The Bayesian frameworks find many real-world applications in data analysis and data-driven planning and decision making. Bayesian experiment design is about choosing experiments and models that can provide the largest information gain or utility. Bayesian hypothesis testing accounts for prior knowledge to bias the hypothesis rejection or acceptance. This framework can be also used to make statistically informed decisions. Bayesian machine learning has become increasingly popular in recent years. It considers data labels to be random variables. Both supervised and semi-supervised strategies are discussed. Bayesian optimization shows a remarkable promise in optimizing complex, difficult-to-evaluate systems. It considers the observed system responses to be samples of a random process.

#### **3.1 Bayesian experiment design**

In Bayesian experiment design, the task is to choose the optimum experiment and possibly also a data model [9]. This can be probabilistically represented as the augmented posterior (1), that is,

$$p(\theta|\mathbf{x},d,m) = \frac{p(\mathbf{x}|\theta,d,m)p(\theta,m)}{p(\mathbf{x}|d,m)}\tag{19}$$

where *d* and *m*, respectively, represent the experiment and the model choice. More specifically, *θ* denotes uncontrolled inputs to the experiment as well as unknown

model parameters, whereas *d* are the experiment inputs that can be controlled, that is, designed by a selection. The objective is to specify the optimum design, *d*, in order to facilitate the estimation of *θ* as well as selection of the best model, *m*, for observations, *x*.

For every configuration of the experiment including the model selection, let *U d*ð Þ , *x*, *m*, *θ* be the perceived utility, for instance, the amount of information that can be gained from the experiment. Since the observations are random and the data model is unknown, the optimum experiment maximizes the average utility,

$$d^\* = \text{argmax}\_{d \in \mathcal{D}} \overline{U}(d) = \text{argmax}\_{d \in \mathcal{D}} E\_{\text{x}, m, \theta}[U(d, \text{ x, } m, \theta)]. \tag{20}$$

Alternatively, the model selection can be formulated as a hypothesis-testing problem. The observed data and models are combined to obtain and then sample the predictive distribution for each model candidate. Additional experiments can be performed as needed in order to maximize reduction in the model uncertainty. A mismatch between the supports of the model prior and the model likelihood indicates that the model has either too few or too many parameters, and there is not enough evidence to make the model selection with a high confidence [10].

Instead of choosing one best model, the predictions from multiple models can be combined. The joint prediction has potentially much larger discriminatory power than individual predictions [10]. Moreover, how well the model can describe the majority of outcomes from many random experiments under different experimental conditions can be as important as the model likelihood.

The information value of an experiment can be quantified as the Fisher information matrix with the entries

$$[\mathcal{T}; (\theta)]\_{ij} = E\left[ \left( \frac{\partial}{\partial \theta\_i} \log p(X; \theta) \right) \left( \frac{\partial}{\partial \theta\_j} \log p(X; \theta) \right) | \theta \right]. \tag{21}$$

Since performing experiments is often costly, the following optimization objectives have been defined in the literature for linear models:


These optimality objectives may include prior distributions about the model and experiment parameters and combine other objectives for model selection.

The main challenge in evaluating the experiment design (20) is the computational complexity involved in searching the whole design space. The search strategies include linearization, local search, discretization, enumeration, approximation by regression or surrogate models, random (e.g., MCMC) sampling, genetic algorithms, as well as using Bayesian optimization methods, as will be discussed in Section 3.4.

*Bayesian Methods and Monte Carlo Simulations DOI: http://dx.doi.org/10.5772/intechopen.108699*

Furthermore, the design (20) yields the single best experiment. In batch-optimum experiment design, *N* experiments are performed simultaneously. The data from different experiments are conditionally independent given the parameters, *θ*. The expected utility from these *N* experiments will be, in general, different from the sum of utilities of individual experiments. A simpler objective is to minimize the predicted variance of observations, *x*, when the experiment designed as a single optimum is repeated *N* times.

In the sequential experiment design, the posterior, *p θ*, *xt* ð Þ j*dt* , from the *t*th experiment is used as a prior for the ð Þ *t* þ 1 th experiment, and then, the experiment conditions, *dt*þ1, are optimized. This is a greedy (sub-optimum) approach; however, it may still outperform batch design due to an inherent adaptation to *xt* and *dt*. The optimum sequential experiment design is more complex, and it leads to a problem of dynamic programming.

#### **3.2 Bayesian hypothesis testing**

Let *θ*<sup>0</sup> be the critical parameter value, so that the null hypothesis, H0, can be accepted, if the parameter value *θ* < *θ*0, and it is rejected in favor of the alternative hypothesis, H*a*, otherwise. Denoting the loss function as *L*ð Þ *θ*; H<sup>0</sup> and *L*ð Þ *θ*; H*<sup>a</sup>* under the respective hypotheses, the Bayesian risk for observation, *x*, is computed as

$$R(\mathbf{x}; \theta\_0) = \int\_{\theta < \theta\_0} L(\theta; \mathcal{H}\_0) Pr(\mathbf{x}|\theta) p(\theta) d\theta + \int\_{\theta \ge \theta\_0} L(\theta; \mathcal{H}\_d) Pr(\mathbf{x}|\theta) p(\theta) d\theta. \tag{22}$$

The optimum decision threshold is then obtained by minimizing the average risk, that is, *θ* <sup>∗</sup> <sup>0</sup> ¼ *argminθEx*½ � *R x*ð Þ ; *θ* . The statistical power of the hypothesis test is determined by the sample size [11].

#### **3.3 Bayesian machine learning**

The general objective of machine learning is to learn how to label unseen data. In supervised learning, this objective is accomplished by training the model with labeled data. For instance, consider the minimum mean square estimation (MMSE) of label *Y* for data **X** [3], that is,

$$f^\*\left(\mathbf{x}\right) = \operatorname\*{argmin}\_f E\left[\left(Y - f(\mathbf{X} = \mathbf{x})\right)^2\right] = E[Y|\mathbf{X} = \mathbf{x}].\tag{23}$$

The training data, ð Þ **X**1, *Y*<sup>1</sup> ,ð Þ **X**2, *Y*<sup>2</sup> , … ,ð Þ **X***n*, *Yn* , are samples from the distribution, *p Y*ð Þ¼ , **X** *p*ð Þ **X**j*Y p Y*ð Þ, assuming the likelihood, *p*ð Þ **X**j*Y* , and the prior, *p Y*ð Þ, of data labels, *Y*. The MMSE expectation (23) can be then approximated as

$$E[Y|\mathbf{X}=\mathbf{x}] \approx \frac{1}{n} \sum\_{i=1}^{n} Y\_i \cdot \mathcal{Z}; \mathbf{x}\_{i:\mathbf{X}=\mathbf{x}} \tag{24}$$

assuming the indicator function I.

In semi-supervised learning, there are additional unlabeled data, **X***<sup>n</sup>*þ1,**X***<sup>n</sup>*þ2, … , which can be used to estimate the distribution, Ð *p*ð Þ **X**, *Y dY*. It allows formulating and improving the estimates of the conditional expectation, *E Y*½ � j**X** ¼ **x** , and cast it as a missing labels problem. More generally, automated data labeling involves the problem of finding the discriminative data model, *p Y*ð Þ j**X** , or *p*ð Þ **X**j*Y* . However, the full generative model, *p Y*ð Þ j**X** *p*ð Þ¼ **X** *p*ð Þ **X**j*Y p Y*ð Þ, may be required by some of the Bayesian inference methods outlined in the previous section.

Assuming data labels as random variables with their prior and posterior distributions is useful in tolerating label errors as well as missing labels. However, the challenge is that the assumed distributions may be biased or even incorrect. In addition to automated data labeling, Bayesian machine learning can generate more training data and improve the quality of training data by correcting the labels.

There is also an interesting connection to causal machine learning. In particular, the label can be assumed to be a cause of observed data (features) representing the effects. Since the probability, *Pr*ð Þ effect , and the conditional probability, *Pr*ð Þ causejeffect , are not independent, the data samples can be used to estimate their distribution, *p*ð Þ **X** , which in turn facilitates an anti-causal learning of causes (i.e., the data labels, *Y*). On the other hand, the probabilities *Pr*ð Þ cause and *Pr*ð Þ effectjcause are independent, and so, intuitively, they cannot be exploited for causal learning of data from their labels.

#### **3.4 Bayesian optimization**

Consider the task of minimizing or maximizing a function *f x*ð Þ over some highdimensional feasible set, *A* ⊆ ℛ*<sup>n</sup>*. It is typically assumed that evaluating the function is numerically or otherwise very expensive, whereas testing whether *x*∈ *A* is computationally cheap. Even though *f* is normally assumed to be continuous, its derivatives are not known. Furthermore, the problem is non-convex, as the function has no special structure, it usually contains many local optima, and its observations may be noisy.

The basic idea to solve such a difficult optimization problem is to learn a surrogate approximation of *f*, which is cheap to evaluate. The surrogate approximation is constructed in the context of Bayesian inference [12]. In particular, Bayesian optimization starts by evaluating the function *f* at a few randomly chosen points, *x*. The obtained values, *f x*ð Þ, are assumed to be samples of a random process. A Gaussian process (GP) is most commonly assumed, since it can yield closed-form expressions for the extrapolated values, even though inverting large covariance matrices is often numerically rather problematic. Then, the following steps are repeated as many times as can be practically afforded.


Assuming Bayesian regression over a Gaussian process, the *n* existing samples, *f x*ð Þ<sup>1</sup> , … ,*f x*ð Þ*<sup>n</sup>* , are normally distributed and have the joint mean, *μ*0ð Þ *x*<sup>1</sup>:*<sup>n</sup>* , and the covariance matrix, Σ0ð Þ *x*<sup>1</sup>:*<sup>n</sup>* . The corresponding posterior distribution of a candidate sample, *f x*ð Þ, is also normal with the mean, *<sup>μ</sup>n*ð Þ *<sup>x</sup>* , and the variance, *<sup>σ</sup>*<sup>2</sup> *<sup>n</sup>*ð Þ *x* .

*Bayesian Methods and Monte Carlo Simulations DOI: http://dx.doi.org/10.5772/intechopen.108699*

The acquisition function predicts the value of a new sample after already knowing the values, *f x*ð Þ<sup>1</sup> , … ,*f x*ð Þ*<sup>n</sup>* . The values of the acquisition function tend to be larger when the calculated credible intervals are wider and when the posterior mean is larger. Since the mean, *μn*ð Þ *x* , is also a point estimate of *f x*ð Þ after *n* observations, the value of *f x*ð Þ is estimated by interpolating *μn*ð Þ *x* ; this corresponds to Gaussian regression. In other words, the measured values, *f x*ð Þ<sup>1</sup> , … ,*f x*ð Þ*<sup>n</sup>* , are interpolated to estimate the mean, *μn*ð Þ *x* . The covariance matrix, Σ*n*ð Þ *x*1:*<sup>n</sup>* , determines how fast the function values vary in between the already measured samples. It is usually required that the values de-correlate with their distance. The covariance matrix can be also estimated from the existing samples as ^*η* ¼ *argmaxηpf x* ð Þ ð Þj <sup>1</sup>:*<sup>n</sup> η* (maximum likelihood estimation, MLE) or as ^*η* ¼ *argmaxηpf x* ð Þ ð Þj <sup>1</sup>:*<sup>n</sup> η p*ð Þ*η* (maximum a posterior estimation, MAP). It is also possible to treat some parameters, *η*, as nuisance parameters and marginalize them from the likelihood assuming their prior, *p*ð Þ*η* , before employing one of the Bayesian methods for intractable distributions.

Expected improvement (EI) is the most commonly assumed acquisition function. Provided that *f* is observed without measurement noise, the current best choice is *f* ∗ *<sup>n</sup>* ¼ max f g *f x*ð Þ<sup>1</sup> , … , *f x*ð Þ*<sup>n</sup>* . The expected improvement is then defined as

$$\text{EI}\_n(\boldsymbol{\kappa}) = E\left[|f(\boldsymbol{\kappa}) - f\_n^\*|\_+\right] \tag{25}$$

where j�jþ ≥0, and the expectation is taken over by the posterior of *f x*ð Þ conditioned on the values known so far. The next best choice for sampling the function *f* is

$$\mathbf{x}\_{n+1} = \mathbf{a}\mathbf{g}\mathbf{m}\mathbf{x}\_{\mathbf{x}}\mathrm{EI}\_{n}(\boldsymbol{\kappa}).\tag{26}$$

Unlike *f*, the function EI*n*ð Þ *x* is inexpensive to evaluate, and its derivatives can be used to effectively maximize Eq. (26). The best expected improvement occurs at points far away from the previously evaluated points (the points having a large posterior variance) and at points having large posterior means; this represents an exploration–exploitation trade-off.

Knowledge gradient (KG) acquisition function selects the sampling point *x* having the largest posterior mean, *μ*<sup>∗</sup> *<sup>n</sup>* ¼ max *<sup>x</sup>μn*ð Þ *x* . It allows considering the posterior across the full domain of *f* and how it is changed by a new sample. The knowledge gradient is computed as

$$\mathbf{K} \mathbf{G}\_n(\mathbf{x}) = E \left[ \boldsymbol{\mu}\_{n+1}^\* - \boldsymbol{\mu}\_n^\* \, | \mathbf{x}\_{n+1} = \mathbf{x} \right]. \tag{27}$$

The next best sample corresponds to the maximum, that is,

$$\mathbf{x}\_{n+1} = \mathbf{a}\mathbf{g}\mathbf{m}\mathbf{x}\_{\mathbf{x}}\mathbf{K}\mathbf{G}\_{n}(\mathbf{x}).\tag{28}$$

An efficient practical implementation of KG can assume multi-start stochastic gradient descent (or ascent, when the target function is to be maximized). This leads to a two-step maximization procedure when the maximum is first searched among a collection of candidate functions, and then, the selected function is maximized separately, for example, by differentiation.

In general, the KG acquisition function tends to significantly outperform the EI method, especially when the function observations are noisy.

Entropy search (ES) acquisition function assumes that the global optimum, *x*<sup>∗</sup> , is a random variable implied by the Gaussian process, *f x*ð Þ. It performs the search for a new sample in order to obtain the largest decrease in differential entropy corresponding to the largest reduction of uncertainty about the global optimum. The ES acquisition function is defined as

$$\text{ES}\_{\boldsymbol{n}}(\boldsymbol{\pi}) = H(P\_{\boldsymbol{n}}(\boldsymbol{\pi}^\*)) - E\_{f(\boldsymbol{x})}[H(P\_{\boldsymbol{n}}(\boldsymbol{\pi}^\*|\boldsymbol{x}, f(\boldsymbol{x})))] \tag{29}$$

where *H* denotes entropy, so that the posterior across a full domain of *f* and how it is changed by a new sample are again accounted for. This is useful when the observations are noisy. However, unlike the KG method, stochastic gradients cannot be obtained for the ES acquisition function.

Finally, the predictive entropy search (PES) acquisition function rewrites Eq. (29) using mutual information. Conceptually, it is exactly the same as the ES method; however, numerical properties of these two algorithms are different.

The basic optimization problem can be augmented by considering a set of objectives, *f x*ð Þ , *s* , indexed by "fidelity", *s*, such that lower values of *s* mean higher fidelity, and *f x*ð Þ� , 0 *f x*ð Þ represents the original objective. For example, fidelity can represent the model order or the modeling granularity. There is also a cost, *c s*ð Þ, associated with fidelity, *s*; the larger the required fidelity, the larger the cost. These problems are then referred to as multi-fidelity source evaluation, in the literature. The task is to maximize the function, *f x*ð Þ , 0 , by observing a sequence of values of *f x*ð Þ , *s* at *n* points, that is, ð Þ *<sup>x</sup>*1, *<sup>s</sup>*<sup>1</sup> , … ,ð Þ *xn*, *sn* , subject to the total available budget, <sup>P</sup>*<sup>n</sup> <sup>i</sup>*¼<sup>1</sup>*c s*ð Þ*<sup>i</sup>* <sup>≤</sup>*C*total.

This problem can be further generalized by assuming that neither *f x*ð Þ , *s* nor *c s*ð Þ is monotonic in *s*; for example, keeping *s* constant, the observed *f x*ð Þ , *s* across different regions of *x* can have varying accuracy.

The problem of random environmental conditions is closely related to multi-task Bayesian optimization. It maximizes the function Ð *f x*ð Þ , *w p w*ð Þ*dw* by evaluating *f x*ð Þ , *w* for each *w* at multiple values of *x*. It is assumed that evaluating *f x*ð Þ , *w* at both *x* and *w* is expensive, but evaluating *p w*ð Þ is cheap. The values *w* represent random environmental conditions, and they act as noise in observations of *f x*ð Þ , *w* . For example, *f* can represent the average performance of a machine learning model with *w*-fold cross-validation. Another example are Bayesian neural networks having random coefficients with a certain probability distribution.

Comparing Bayesian optimization with other optimization methods, the latter usually work well only for specific problems or under specific conditions, but have high computational cost, and suffer from the curse of dimensionality, leading to a low sample efficiency and slow convergence. Bayesian optimization, on the other hand, can work very well for moderate dimensionality (the maximum dimension of about 20 is recommended in the literature), and the sampling decisions appear to be optimum despite being sequential. In addition, Bayesian optimization is considered to be derivative free but a black box method.

Despite a good performance, there is a need to provide better theoretical understanding of Bayesian optimization methods, define stopping rules, and improve learning of surrogate models, especially for non-Gaussian processes and for systems containing non-linear transformations. Bayesian optimization can be used to efficiently find hyperparameters of machine learning models [13]. Recently, maximization of a composite function, *ghx* ð Þ ð Þ , has been considered in the literature, where the inner function, *h*, is expensive to evaluate and a black box, whereas the outer function, *g*, is easy to evaluate.
