**Meet the editor**

Javier Prieto received his PhD degree in Information and Communications Technology and the Extraordinary Performance Award for Doctorate Studies in 2012 from the University of Valladolid (Spain). Since 2015, he is a lecturer and researcher at the University of Salamanca (Spain). Previously, he was with the University of Valladolid from 2009 to 2014 and with a Spanish techno-

logical center from 2007 to 2009. In 2010, he was a visiting researcher at the Massachusetts Institute of Technology (MIT), USA. Dr. Prieto serves as an associate editor for various journals. His research interests include artificial intelligence for smart cities, navigation for indoor environments, and Bayesian inference for dynamic systems.

## Contents

### **Preface XIII**


Hamid El Maroufy, El Houcine Hibbah, Abdelmajid Zyad and Taib Ziad


### Chapter 18 **Airlines Content Recommendations Based on Passengers' Choice Using Bayesian Belief Networks 349**

Chapter 8 **Bayesian Model Averaging and Compromising in**

Chapter 10 **Bayesian Modeling in Genetics and Genomics 207**

Chapter 11 **Bayesian Two-Stage Robust Causal Modeling with**

Alonso Ortega and Gorka Navarrete

Chapter 13 **Bayesian Inference and Compressed Sensing 257** Solomon A. Tesfamicael and Faraz Barzideh

Chapter 14 **Sparsity in Bayesian Signal Estimation 279**

**Problems in Robotics 299**

**Ventures 313**

**Information 325**

Dingjing Shi and Xin Tong

**Sciences 235**

Chapter 9 **Two Examples of Bayesian Evidence Synthesis with the Hierarchical Meta-Regression Approach 189**

Hafedh Ben Zaabza, Abderrahmen Ben Gara and Boulbaba Rekik

**Instrumental Variables using Student's t Distributions 221**

Chapter 12 **Bayesian Hypothesis Testing: An Alternative to Null Hypothesis Significance Testing (NHST) in Psychology and Social**

Ishan Wickramasingha, Michael Sobhy and Sherif S. Sherif

Cristiano Premebida, Francisco A. A. Souza and Diego R. Faria

Chapter 15 **Dynamic Bayesian Network for Time-Dependent Classification**

**Section 4 Applications of Bayesian Inference in Economics 311**

Chapter 17 **Recent Advances in Nonlinear Filtering with a Financial**

**Application to Derivatives Hedging under Incomplete**

Chapter 16 **A Bayesian Model for Investment Decisions in Early**

Anamaria Berea and Daniel Maxwell

Claudia Ceci and Katia Colaneri

**Section 3 Applications of Bayesian Inference in Engineering 255**

**Dose-Response Studies 167**

Steven B. Kim

**VI** Contents

Pablo Emilio Verde

Sien Chen, Wenqiang Huang, Mengxi Chen, Junjiang Zhong and Jie Cheng

## Preface

The range of Bayesian inference algorithms and their different applications has been greatly expanded since the first implementation of a Kalman filter by Stanley F. Schmidt for the Apol‐ lo program. Extended Kalman filters, unscented Kalman filters, particle filters, and belief con‐ densation filters are just some examples of these algorithms that have been applied to logistics, medical services, search and rescue operations, or automotive safety, among others. The essence of the mentioned algorithms is to explain how we should update our existing beliefs in the light of new evidence. Stephen Senn defined a Bayesian as "one who, vaguely expecting a horse and catching a glimpse of a donkey, strongly concludes he has seen a mule."

From the Bayesian perspective, both the parameter to estimate and the observations are ran‐ dom variables, against the frequentist approach where the parameter to estimate is an un‐ known deterministic value. This Bayesian point of view leads to a common resolution framework where what we infer is a density function of the parameter conditioned on the observation. Under this context, the task is to determine the posterior distribution of the de‐ sired state, from the knowledge of the prior and the likelihood, by using Bayes' rule. This setting can be modeled with a hidden Markov model (HMM), where the sequence of varia‐ bles of interest is called hidden states and the sequence from which one can obtain realiza‐ tions is called observations.

Achilles' heel of Bayesian inference is the implementation of algorithms for solving nonlin‐ ear and/or non-Gaussian system models, where the traditional Kalman approach is inaccu‐ rate. Optimal algorithms exist in some restricted cases, while researchers have proposed many practical implementations of Bayesian inference that rely on the use of approxima‐ tions. In addition, the inclusion of relevant prior information has been deeply addressed in the literature where there is no consensus about the existence of a truly non-informative pri‐ or. The election of appropriate prior information has extensively nourished the research on Bayesian inference, and it is still today an element of intense discussion.

This book takes a look at both theoretical foundations of Bayesian inference and practical implementations in the fields of life sciences, engineering, and economics. The book is or‐ ganized into four sections according to this twofold perspective.

**Section 1** is dedicated to the theoretical foundations of Bayesian inference.

In Chapter 1, the authors make a review of Bayesian inference components and its application to game theory. The chapter includes definitions of the former (prior, likelihood, and posterior distributions) and comprises several examples of the latter (Bayesian and fuzzy games).

In Chapter 2, the authors review methods for checking the modeling assumptions at each node of a directed acyclic graph that represents the understanding of the underlying struc‐

ture of a problem. The chapter shows how nodes in a graph correspond to data or parame‐ ters and directed edges between parameters correspond to conditional distributions.

In Chapter 3, the author presents a Bayesian classification method, the associated Bayes error, and its relationship with other measures and proposes an algorithm to determine prior proba‐ bility that can make to reduce Bayes error is proposed. The chapter applies the proposed algo‐ rithm in three domains, biology, medicine, and economics, through specific problems.

In Chapter 4, the author introduces the problem of high-dimensional multiple hypothesis testing by using the Bayesian approach. The chapter demonstrates the practical application of the Bayesian decision theoretic approach by means of a real example of directional hy‐ pothesis testing with skewed alternatives that use gene expression data.

In Chapter 5, the author provides formal criteria to determine the adequate sample size in the design of experiments based on combined frequentist-Bayesian or fully Bayesian ap‐ proaches. The chapter defines four power functions for sample size calculations where the Bayesian predictive power is the one that allows to add more flexibility.

In Chapter 6, the author addresses the problem of converting graphic relationships into con‐ ditional probabilities in order to construct a simple Bayesian network from a graph. The chapter applies this research in learning context in which a Bayesian network is used to as‐ sess students' knowledge.

**Section 2** is dedicated to the applications of Bayesian inference in life sciences.

In Chapter 7, the authors propose a first-order autoregressive hidden Markov model as a suitable model to characterize a marker of breast cancer disease progression. The chapter shows how this model captures the complexity and the dynamics of the evolution of breast cancer by introducing the latent states and permits to evaluate the efficacy of a treatment by their transition probabilities.

In Chapter 8, the author deals with the limitations of the Bayesian framework applied to dose-response studies relying on small samples and sparse data. The chapter addresses three practical issues in small-sample dose-response studies: model sensitivity, disagree‐ ment in prior knowledge, and conflicting perspective in decision rules.

In Chapter 9, the author illustrates the application of Bayesian inference to two meta-analy‐ ses in medical research when indirect evidence is available for analysis. The chapter presents the hierarchical meta-regression method for meta-analysis, which is an integrated approach for evidence synthesis when a multiplicity of bias, coming from indirect and dis‐ parate evidence, has to be incorporated.

In Chapter 10, the authors provide a review of statistical methods applied to animal and plant breeding programs with a particular focus on Bayesian methods. The chapter illus‐ trates the flexibility of the Bayesian approaches and their high accuracy of prediction of the breeding values.

In Chapter 11, the authors propose four types of two-stage least squares models with instru‐ mental variables to model normal and non-normal data in causal inference research. The chapter evaluates the performance of the robust method using Student's t-distributions in the four distributional 2SLS models by means of Monte Carlo simulation.

In Chapter 12, the authors present the Bayesian hypothesis testing and its application to psy‐ chology and social sciences as an alternative to traditional frequentist null hypothesis signif‐ icance testing (NHST). The chapter shows the advantages of this Bayesian approach over frequentist NHST, providing examples that support its use.

**Section 3** is dedicated to the applications of Bayesian inference in engineering.

ture of a problem. The chapter shows how nodes in a graph correspond to data or parame‐

In Chapter 3, the author presents a Bayesian classification method, the associated Bayes error, and its relationship with other measures and proposes an algorithm to determine prior proba‐ bility that can make to reduce Bayes error is proposed. The chapter applies the proposed algo‐

In Chapter 4, the author introduces the problem of high-dimensional multiple hypothesis testing by using the Bayesian approach. The chapter demonstrates the practical application of the Bayesian decision theoretic approach by means of a real example of directional hy‐

In Chapter 5, the author provides formal criteria to determine the adequate sample size in the design of experiments based on combined frequentist-Bayesian or fully Bayesian ap‐ proaches. The chapter defines four power functions for sample size calculations where the

In Chapter 6, the author addresses the problem of converting graphic relationships into con‐ ditional probabilities in order to construct a simple Bayesian network from a graph. The chapter applies this research in learning context in which a Bayesian network is used to as‐

In Chapter 7, the authors propose a first-order autoregressive hidden Markov model as a suitable model to characterize a marker of breast cancer disease progression. The chapter shows how this model captures the complexity and the dynamics of the evolution of breast cancer by introducing the latent states and permits to evaluate the efficacy of a treatment by

In Chapter 8, the author deals with the limitations of the Bayesian framework applied to dose-response studies relying on small samples and sparse data. The chapter addresses three practical issues in small-sample dose-response studies: model sensitivity, disagree‐

In Chapter 9, the author illustrates the application of Bayesian inference to two meta-analy‐ ses in medical research when indirect evidence is available for analysis. The chapter presents the hierarchical meta-regression method for meta-analysis, which is an integrated approach for evidence synthesis when a multiplicity of bias, coming from indirect and dis‐

In Chapter 10, the authors provide a review of statistical methods applied to animal and plant breeding programs with a particular focus on Bayesian methods. The chapter illus‐ trates the flexibility of the Bayesian approaches and their high accuracy of prediction of the

In Chapter 11, the authors propose four types of two-stage least squares models with instru‐ mental variables to model normal and non-normal data in causal inference research. The chapter evaluates the performance of the robust method using Student's t-distributions in

ters and directed edges between parameters correspond to conditional distributions.

rithm in three domains, biology, medicine, and economics, through specific problems.

pothesis testing with skewed alternatives that use gene expression data.

Bayesian predictive power is the one that allows to add more flexibility.

**Section 2** is dedicated to the applications of Bayesian inference in life sciences.

ment in prior knowledge, and conflicting perspective in decision rules.

the four distributional 2SLS models by means of Monte Carlo simulation.

sess students' knowledge.

X Preface

their transition probabilities.

parate evidence, has to be incorporated.

breeding values.

In Chapter 13, the authors motivate the use of Bayesian inference in compressed sensing signal processing method. The chapter provides three use cases of its applicability, such as magnetic resonance images, remote sensing, and wireless communication systems, specifi‐ cally on multiple-input multiple-output (MIMO) systems.

In Chapter 14, the authors present different methods to estimate an unknown signal from its linear measurements where the number of measurements is less than the dimension of the unknown signal. The chapter introduces the concept of signal sparsity and describes how it could be used as prior information for either regularized least squares or Bayesian signal estimation.

In Chapter 15, the authors explore the use of dynamic Bayesian networks (DBN) for timedependent classification problems in mobile robotics and present some experiments in se‐ mantic place recognition and daily activity classification. The chapter formulates the DBN as a time-dependent classification problem and gives a general expression for a DBN in terms of classifier priors and likelihoods through the time steps.

**Section 4** is dedicated to the applications of Bayesian inference in economics.

In Chapter 16, the authors propose a new Bayesian model to aid the investment decision in early-stage start-ups and ventures informed by previous academic literature on entrepre‐ neurship and venture capital investment practices. The chapter assesses this model in an anonymized experiment where reviewers with previous experience in entrepreneurship and/or investment scored a list of 20 anonymous real companies.

In Chapter 17, the authors present some results about nonlinear filtering for Markovian par‐ tially observable systems where the state and the observation processes are described by jump diffusions with correlated Brownian motions and common jump times. The chapter applies this theory to the financial problem of derivatives hedging for a trader who has lim‐ ited information on the market.

In Chapter 18, the authors illustrate how a Bayesian belief network (BBN) can enable airlines to optimize the loyalty based on dynamic recommendations of relevant contents from pre‐ dictions of passengers' choice. The chapter establishes BBN models by using the use case of China Southern Airlines with real transaction data, including passengers' basic information, history decision options, and purchase characteristics.

Overall, this book is intended as an introductory guide for the application of Bayesian infer‐ ence in the fields of life sciences, engineering, and economics, as well as a source document of fundamentals for intermediate Bayesian readers.

> **Dr. Javier Prieto Tejedor** University of Salamanca, Salamanca, Spain

**Theoretical Foundations of Bayesian Inference**

### **Chapter 1**

Provisional chapter

### **Bayesian Inference Application**

Bayesian Inference Application

Wiyada Kumam, Plern Saipara and Poom Kumam

Additional information is available at the end of the chapter Wiyada Kumam, Plern Saipara and Poom Kumam

http://dx.doi.org/10.5772/intechopen.70530 Additional information is available at the end of the chapter

Abstract

In this chapter, we were introduced the concept of Bayesian inference and application to the real world problems such as game theory (Bayesian Game) etc. This chapter was organized as follows. In Sections 2 and 3, we present Model-based Bayesian inference and the components of Bayesian inference, respectively. The last section contains some applications of Bayesian inference.

DOI: 10.5772/intechopen.70530

Keywords: statistical inference, Frequentist inference, Bayesian inference

### 1. Introduction

In statistical inference, there are two ways for interpretations of probability include Frequentist (or Classical) inference and Bayesian inference. It usually is unlike with each other in the classical nature of probability. Classical inference defines probability as the limit of an event's relative frequency for a large number of experiments and only in the sense of random experiments which are well defined. Other side, Bayesian inference can to impose probabilities to each statement when a random process is not associated. In the sense of Bayesian, probability is a way to show an individual's degree of believes in a statement. Bayesian inferences are different interpretations of probability, and also different approaches depend on those interpretations. Bayes' theorem presents the relativity about two conditional probabilities that are the reverse of anything other. The initials of the term Bayes' theorem is in honor of Reverend Thomas Bayes, and is referred to as Bayes' law (see [1]). This theorem shows the conditional probability or posterior probability of an event A after B is observed in terms of the prior probability of A, prior probability of B and the conditional probability of B given A. It is valid in all interpretations of probability. Bayes' formula is how to revise probability statements using data. The Bayes' law (or Bayes' rule) is

© 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and eproduction in any medium, provided the original work is properly cited.

$$P(A|B) = \frac{P(B|A)P(A)}{P(B)}.\tag{1}$$

The conditional probability definition is defined as follows

$$P(A \cap B) = P(A|B)P(B) = P(B|A)P(A). \tag{2}$$

For example, let a dice is thrown under a dice-box. From the standard model, all of outcomes have probability equal to 1/6. Now, the dice is lifted a bit and a random corner of the upper side is able to visible which it contains a dot. The new probability distribution of the outcomes shows as follows. Let Ai is the outcome of the throw, for i = 1 , 2 , 3 , 4 , 5 , 6 and B is the randomly chosen corner contains a dot. So, we get P(Ai) = 1/6 and P(B) = 2/3. We get the following table:


The simplest way to construct the fourth column is to multiply. For any Ai, P(Ai| B) and P(B| Ai), to sum these values and divide by this sum. This final term is said to be scaling and corresponds to the formula as

$$\sum\_{i=1}^{6} P(B|A\_i)P(A\_i) = \sum\_{i=1}^{6} P(A\_i \cap B) = P(B).$$

An simpler argument is that P(Ai| B) has to be a probability distribution, thus sum to unity. As the scaling operation is trivial, Bayes' rule is also shown as

$$P(A|B) \propto P(A)P(B|A)$$

where P(A) the prior (distribution), P(B|A) is the likelihood and P(A|B) is the posterior (distribution).

The main result of Bayesians statistics is that statistical inference may depend on the simple device posterior ∝prior ∗ likelihood. By dice-throwing example is not of controversial. The disputations about the possibility of using Bay's rule as

$$P(Truth|Data) = \frac{P(Data|Truth)P(Truth)}{P(Data)}.\tag{3}$$

So, we get

$$P(Truth) = \text{the } prior.\tag{4}$$

The second ingredient we need is data, plus a how the data associate to the truth which is nothing but the classical concept of specifying a random relationship

$$P(Data|Ttruth) = \text{the likelihood} \tag{5}$$

for all associated values of Truth. Note that P(Data| Truth) is not applied as probability distribution for different data, but as the probability of the given data for different values of Truth. Various authors do apply P(Data| Truth) for likelihood to sheer this misconstrue.

Now, noting that (replace Truth with T), probability of Data (P(Data)) can be written as

$$P(Data) = \int P(T)P(Data|T)dT\tag{6}$$

that is as a function of P(T) and P(Data| T), it is obvious that the prior and likelihood enable, using 1 to construct a new probability statement about T given the data as follows

$$P(Truth|Data) = \text{the posterior}.\tag{7}$$

The purpose of this chapter was to introduce the concept of Bayesian inference and application to the real world problem such as game theory (Bayesian Game). In this chapter was organized as follows. In Sections 2 and 3, we present Model-based Bayesian inference and the components of Bayesian inference, respectively. The last section contains some applications of Bayesian inference.

### 2. Model-based Bayesian inference

The basic of Bayesian inference is continued by Bayes' theorem. From (1), replacement B with observations y, A with the set of parameter Θ, and probabilities P with densities p, results as the following

$$p(\Theta|y) = \frac{p(y|\Theta)p(\Theta)}{p(y)}\tag{8}$$

which p(y) is the marginal likelihood of y, p(Θ) is the set prior distributions of the set of parameter Θ before y is observed, p(y|Θ) is the likelihood of y underneath a model and p(Θ|y) is the joint posterior distribution of Θ that expresses uncertainty about parameter set Θ after taking both the prior and data into system. Because there are often multiple parameters, Θ presents a set of j parameters, we have

$$
\Theta = \theta\_{1\prime}\theta\_{2\prime}\dots\theta\_{j\cdot}.
$$

The term

P Að Þ¼ <sup>j</sup><sup>B</sup> P Bð Þ <sup>j</sup><sup>A</sup> P Að Þ

For example, let a dice is thrown under a dice-box. From the standard model, all of outcomes have probability equal to 1/6. Now, the dice is lifted a bit and a random corner of the upper side is able to visible which it contains a dot. The new probability distribution of the outcomes shows as follows. Let Ai is the outcome of the throw, for i = 1 , 2 , 3 , 4 , 5 , 6 and B is the randomly chosen corner contains a dot. So, we get P(Ai) = 1/6 and P(B) = 2/3. We get the following table:

Ai P(Ai) P(B| Ai) P(Ai ∩ B) P(Ai| B)

A<sup>1</sup> 1/6 0 0 0 A<sup>2</sup> 1/6 1/2 1/12 1/8 A<sup>3</sup> 1/6 1/2 1/12 1/8 A<sup>4</sup> 1/6 1 1/6 1/4 A<sup>5</sup> 1/6 1 1/6 1/4 A<sup>6</sup> 1/6 1 1/6 1/4

The simplest way to construct the fourth column is to multiply. For any Ai, P(Ai| B) and P(B| Ai), to sum these values and divide by this sum. This final term is said to be scaling and

> X 6

P Að Þ¼ <sup>i</sup> ∩ B P Bð Þ:

P Data ð Þ : (3)

i¼1

An simpler argument is that P(Ai| B) has to be a probability distribution, thus sum to unity. As

P Að Þ jB ∝P Að ÞP Bð Þ jA

where P(A) the prior (distribution), P(B|A) is the likelihood and P(A|B) is the posterior

The main result of Bayesians statistics is that statistical inference may depend on the simple device posterior ∝prior ∗ likelihood. By dice-throwing example is not of controversial. The dispu-

P Truth ð Þ¼ <sup>j</sup>Data P Data ð Þ <sup>j</sup>Truth P Truth ð Þ

The conditional probability definition is defined as follows

corresponds to the formula as

(distribution).

4 Bayesian Inference

So, we get

X 6

P Bð Þ jAi P Að Þ¼<sup>i</sup>

i¼1

the scaling operation is trivial, Bayes' rule is also shown as

tations about the possibility of using Bay's rule as

P Bð Þ : (1)

P Að Þ¼ ∩ B P Að Þ jB P Bð Þ¼ P Bð Þ jA P Að Þ: (2)

$$p(y) = \int p(y|\Theta)p(\Theta)d\Theta\tag{9}$$

determines the marginal likelihood (or the prior predictive distribution) of y which it was introduced by Jeffreys [2], and may be set to c where c is an unknown constant. This distribution shows what y should be similar to given the model, before y has been observed. Only the prior probabilities and the model's likelihood function are applied for p(y). The presence of p(y) normalizes the joint posterior distribution, p(Θ|y) guarantee it is a proper distribution and integrates to 1. From replacement p(y) with a constant of proportionality c, the Bayes' theorem becomes to

$$p(\Theta|y) = \frac{p(y|\Theta)p(\Theta)}{c}.\tag{10}$$

We get

$$p(\Theta|y) \propto p(y|\Theta)p(\Theta) \tag{11}$$

when ∝ is proportional to.

This formulation (11) be shown as the unnormalized joint posterior being proportional to the likelihood multiply with the prior. Howsoever, the aim of this model is often not to concluding the non-normalized joint posterior distribution, however to concluding the marginal distributions of the parameters. The set of all Θ can partitioned as

$$\Theta = \{\Phi, \Lambda\}\tag{12}$$

when the interest sub-vector denote by Φ and the complementary sub-vector of Θ denoted by Λ, usually called to as a vector of nuisance parameters. For a Bayesian scope, the presence of nuisance parameters does not pose any formal, theoretical problems. A nuisance parameter is a parameter that exists in the joint posterior distribution of a model, though it is not a interest parameter. The marginal posterior distribution of φ, the interest parameter, can be shown as

$$p(\phi|y) = \int p(\phi, \Lambda|y)d\Lambda. \tag{13}$$

In model-based Bayesian inference, Bayes' theorem is applied to approximate the nonnormalized joint posterior distribution, and lastly the user can evaluate and make inferences by the marginal posterior distributions.

### 3. The components of Bayesian inference

In this section, we presents about the components of Bayesian inference which contains the prior distributions, the likelihood or likelihood function and the joint posterior distribution as follows.


### 3.1. Prior distribution

p yð Þ¼

becomes to

6 Bayesian Inference

We get

follows.

when ∝ is proportional to.

tions of the parameters. The set of all Θ can partitioned as

by the marginal posterior distributions.

3. The components of Bayesian inference

ð

determines the marginal likelihood (or the prior predictive distribution) of y which it was introduced by Jeffreys [2], and may be set to c where c is an unknown constant. This distribution shows what y should be similar to given the model, before y has been observed. Only the prior probabilities and the model's likelihood function are applied for p(y). The presence of p(y) normalizes the joint posterior distribution, p(Θ|y) guarantee it is a proper distribution and integrates to 1. From replacement p(y) with a constant of proportionality c, the Bayes' theorem

<sup>p</sup>ð Þ¼ <sup>Θ</sup>j<sup>y</sup> p yð Þ <sup>j</sup><sup>Θ</sup> <sup>p</sup>ð Þ <sup>Θ</sup>

This formulation (11) be shown as the unnormalized joint posterior being proportional to the likelihood multiply with the prior. Howsoever, the aim of this model is often not to concluding the non-normalized joint posterior distribution, however to concluding the marginal distribu-

when the interest sub-vector denote by Φ and the complementary sub-vector of Θ denoted by Λ, usually called to as a vector of nuisance parameters. For a Bayesian scope, the presence of nuisance parameters does not pose any formal, theoretical problems. A nuisance parameter is a parameter that exists in the joint posterior distribution of a model, though it is not a interest parameter. The marginal posterior distribution of φ, the interest parameter, can be shown as

ð

In model-based Bayesian inference, Bayes' theorem is applied to approximate the nonnormalized joint posterior distribution, and lastly the user can evaluate and make inferences

In this section, we presents about the components of Bayesian inference which contains the prior distributions, the likelihood or likelihood function and the joint posterior distribution as

<sup>p</sup> <sup>φ</sup>j<sup>y</sup> � � <sup>¼</sup>

p yð Þ jΘ pð Þ Θ dΘ (9)

<sup>c</sup> : (10)

pð Þ Θjy ∝ p yð Þ jΘ pð Þ Θ (11)

Θ ¼ f g Φ; Λ (12)

<sup>p</sup> <sup>φ</sup>; <sup>Λ</sup>j<sup>y</sup> � �dΛ: (13)

The prior distribution is a main concept of Bayesian and shows the information about an uncertain Θ that is merged with the probability distribution of new data to yield the posterior distribution which in turn is applied for future inferences and decisions about Θ. The existence of a prior distribution for any problem can justified by some axioms of decision theory; which we now focus for how to set up a prior distribution for every given application. Generally, Θ will be a vector, but for easiness we will point as on p(Θ).

By well-identified and large sample sizes, suitable alternatives of p(Θ) will have minor effects on posterior inferences. This definition might look like to be circular, but in practice one can check the dependence on p(Θ) by a sensitivity analysis: comparing posterior inferences under different suitable alternatives of p(Θ).

If the sample size is small, or available data provide only indirect information about the parameters of interest, then p(Θ) becomes more important. In various cases, nevertheless, models can be set up hierarchically, such that clusters of parameters have shared p(Θ), which can themselves be approximated from data. Prior probability distributions have belonged to one of two kinds as informative and uninformative priors. In this section, four kinds of priors which include informative, weakly informative, least informative, and uninformative, are shown according to information and the aim in the use of the prior.

#### 3.1.1. Informative prior

If prior information is obtainable about Θ, it should be included in p(Θ). If the current model is homologous to a previous model, and the current model is goal to be an adjusted version dependent on more current data, then the posterior distribution of Θ from the previous model maybe used as p(Θ) for the current model.

Now, every version of a model is not start from scratch, based only on the current data, but the cumulative effects of all data, past and current, can be taken into system. To sure the current data do not dominate the prior, in 2000, Ibrahim and Chen [3] presented the power prior which it is a class of informative prior distribution that takes early data and results into system. If the current data is very homologous to the previous data, then the precision of the posterior distribution increases when including more information from previous models. If the current data differs tremendously, then the posterior distribution of Θ maybe in the tails of the prior distribution for Θ, therefore p(Θ) contributes less density in its tails.

Sometimes informative prior is not ready to be applied, for example when it resides in other person, as in an expert. For this way, their human personal beliefs of the probability for the event must be elicited into the form of a suitable probability density function which this process is said to be prior elicitation.

### 3.1.2. Weakly informative prior

Weakly informative prior (in the short term: WIP) use prior information for regularization and stabilization, providing sufficient prior information to prevent results that contradict our knowledge for example an algorithmic failure to explore the state space. Other aim is for WIPs to use less prior information than is really available. WIPs should provide some of the useful of prior information while avoiding some of the risk from using information which does not exist. WIPs are the most common priors in practice and are liked by subjective Bayesians.

Selecting WIPs may be cumbersome. WIPs distributions should change with the sample size, since a model should have sufficient prior information to learn from the data, but the prior information must also be weak sufficient to learn from the data.

In practice, this is an example of WIPs. It is favor, for well reasons, to center and scale all continuous predictors [4]. Though centering and scaling predictors is not talked about here, but it should be clear that the potential range of the posterior distribution of θ for a centered and scaled predictor should be small. A favor WIPs for a centered and scaled predictor may be θ � N ð Þ 0; 10; 000 where θ is normal distribution agreeable to a mean of 0 and a variance of 10,000. Here, the density for θ is nearly flat. Nonetheless, the fact that it is not perfectly at yields well properties for numerical estimation algorithms. In both Bayesian and Frequentist inference, it is possible for numerical estimation algorithms to become stuck in regions of at density which become more common as sample size decreases or model complexity increases. Numerical estimation algorithms in Frequentist inference function as though a at prior were used, thus numerical estimation algorithms in Frequentist inference become stuck more frequently than numerical estimation algorithms in Bayesian inference. Prior distributions that are not completely at allow sufficient information for the numerical estimation algorithm to continue to diagnose the goal density, the posterior distribution.

After updating a model in which WIPs exist, the user should be investigating the posterior. If the posterior contradicts knowledge, then the WIPs must be revised by including information that will make the posterior consistent with knowledge [4]. A favor purpose Bayesian criticism against WIPs is that there is no precise mathematical form to derive the optimal WIPs for a given model and data.

#### 3.1.2.1. Vague priors

A vague prior, is said to be a diffuse prior which difficult to define, after considering WIPs. In 2005, Lambert, Sutton, Burton, Abrams and Jones introduce the first formal move from vague to WIPs. After conjugate priors were introduced by Raiffa and Schlaifer in 1961 which most applied Bayesian has applied vague priors, parameterized to estimate the idea of uninformative priors.

Normally, a vague prior is a conjugate prior together with a large size parameter. Howsoever, if the sample size is small then vague priors may be problems. All most problems about vague priors and small sample size are implicated with scale rather than location. The problem can be particularly acute in random-effects models which it is used rather loosely in this here to imply exchangeable, hierarchical and multilevel structures. A vague prior is defined as commonly being a conjugate prior that is intent to estimate an uninformative prior and without two goals of regularization and stabilization.

### 3.1.3. Least informative prior

data differs tremendously, then the posterior distribution of Θ maybe in the tails of the prior

Sometimes informative prior is not ready to be applied, for example when it resides in other person, as in an expert. For this way, their human personal beliefs of the probability for the event must be elicited into the form of a suitable probability density function which this

Weakly informative prior (in the short term: WIP) use prior information for regularization and stabilization, providing sufficient prior information to prevent results that contradict our knowledge for example an algorithmic failure to explore the state space. Other aim is for WIPs to use less prior information than is really available. WIPs should provide some of the useful of prior information while avoiding some of the risk from using information which does not exist. WIPs are the most common priors in practice and are liked by subjective Bayesians.

Selecting WIPs may be cumbersome. WIPs distributions should change with the sample size, since a model should have sufficient prior information to learn from the data, but the prior

In practice, this is an example of WIPs. It is favor, for well reasons, to center and scale all continuous predictors [4]. Though centering and scaling predictors is not talked about here, but it should be clear that the potential range of the posterior distribution of θ for a centered and scaled predictor should be small. A favor WIPs for a centered and scaled predictor may be θ � N ð Þ 0; 10; 000 where θ is normal distribution agreeable to a mean of 0 and a variance of 10,000. Here, the density for θ is nearly flat. Nonetheless, the fact that it is not perfectly at yields well properties for numerical estimation algorithms. In both Bayesian and Frequentist inference, it is possible for numerical estimation algorithms to become stuck in regions of at density which become more common as sample size decreases or model complexity increases. Numerical estimation algorithms in Frequentist inference function as though a at prior were used, thus numerical estimation algorithms in Frequentist inference become stuck more frequently than numerical estimation algorithms in Bayesian inference. Prior distributions that are not completely at allow sufficient information for the numerical estimation algorithm to

After updating a model in which WIPs exist, the user should be investigating the posterior. If the posterior contradicts knowledge, then the WIPs must be revised by including information that will make the posterior consistent with knowledge [4]. A favor purpose Bayesian criticism against WIPs is that there is no precise mathematical form to derive the optimal WIPs for a

A vague prior, is said to be a diffuse prior which difficult to define, after considering WIPs. In 2005, Lambert, Sutton, Burton, Abrams and Jones introduce the first formal move from vague

distribution for Θ, therefore p(Θ) contributes less density in its tails.

information must also be weak sufficient to learn from the data.

continue to diagnose the goal density, the posterior distribution.

process is said to be prior elicitation.

3.1.2. Weakly informative prior

8 Bayesian Inference

given model and data.

3.1.2.1. Vague priors

Least informative priors (for short term LIP) is applied here to describe a class of prior in which the aim is to minimize the amount of subjective information content, and to apply a prior that is determined only by the model and observed data. The rationale for using LIPs is often called to let the data speak for themselves. LIPs are preferred by objective Bayesians. LIPs are contains Flat Priors [12], Hierarchical Prior [4], Jeffreys Prior [2], MAXENT [5] and Reference Priors [6–8] etc.

### 3.1.4. Uninformative prior

Traditionally, most of the above descriptions of prior distributions were classified as uninformative priors. However, uninformative priors do not really exist (see in [9]) and all priors are informative in some ways. Moreover, there have been various names associated with uninformative priors including diffuse, minimal, non-informative, objective, reference, uniform, vague, and perhaps weakly informative etc.

### 3.1.5. Proper and improper priors

It is important for the prior distribution to be proper. A prior distribution, p(θ), is improper when Ð p(θ)dθ = ∞ .

Before, an unbounded uniform prior distribution is an inappropriate prior distribution since p(θ)∝ 1, for θ∈ (�∞, ∞). An inappropriate prior distribution may be cause an inappropriate posterior distribution. If the posterior distribution is inappropriate, then inferences are invalid.

To determine the propriety of a joint posterior distribution, the marginal likelihood should be finite for any y. Again, the marginal likelihood is p(y) = Ð p(y|Θ)p(Θ)dΘ. Although inappropriate prior distributions can be applied, it is good practice to avoid them.

### 3.2. Likelihood

To completely for the definition of a Bayesian, both the prior distributions and the likelihood must be estimated or completely specified. The likelihood or p(y|Θ), contains the available information provided by the sample. The likelihood is p yð Þ¼ <sup>j</sup><sup>Θ</sup> <sup>Q</sup><sup>n</sup> <sup>i</sup>¼<sup>1</sup> p yi <sup>j</sup><sup>Θ</sup> � �:

The data y effect to the posterior distribution p(Θ| y) only through the likelihood p(Θ| y). In this way, Bayesian inference believes the likelihood principle which states that for a given sample of data, any two probability models p(Θ| y) that have the same likelihood yield the same inference for Θ.

### 3.3. Posterior distribution

Recent theoretical and applied overviews of Bayesian statistics, including many examples and uses of posterior distributions, see [10–12]. The posterior distributions for decision-making about home radon exposure are discussed in [13].

The posterior distribution summarizes the current state of knowledge about all the uncertain quantities in a Bayesian analysis. Analytically, the posterior density is the product of the prior density and the likelihood. In a complicated analysis, the joint posterior distribution can be summarized by a set of L simulation draws of the vector of uncertain quantities w<sup>1</sup> , w<sup>2</sup> , … , wJ, as illustrated in the following matrix:


The marginal posterior distribution for any unknown quantity wl can be summarized by its column of L simulation draws. In many examples it is not necessary to construct the entire table ahead of time; rather, one creates the L vectors of posterior simulations for the parameters of the model and then uses these to construct posterior simulations for other unknown quantities of interest, as necessary.

### 4. Application to games theory

In this section, we present the application of Bayesian inference to the real world problems such as Bayesian Game as follows.

### 4.1. The classical games

The basic contents of the n-person game was presented by John Forbes Nash [14] in 1950. Also, he first shows the existence of equilibrium for this model when the player's preferences are representable by continuous quasi-concave utilities and the sets of strategy are simplex. The definition of an n-person game can be written as below.

### Definition 4.1

The data y effect to the posterior distribution p(Θ| y) only through the likelihood p(Θ| y). In this way, Bayesian inference believes the likelihood principle which states that for a given sample of data, any two probability models p(Θ| y) that have the same likelihood yield the

Recent theoretical and applied overviews of Bayesian statistics, including many examples and uses of posterior distributions, see [10–12]. The posterior distributions for decision-making

The posterior distribution summarizes the current state of knowledge about all the uncertain quantities in a Bayesian analysis. Analytically, the posterior density is the product of the prior density and the likelihood. In a complicated analysis, the joint posterior distribution can be summarized by a set of L simulation draws of the vector of uncertain quantities w<sup>1</sup> , w<sup>2</sup> , … , wJ,

> l w<sup>1</sup> w<sup>2</sup> … wJ 1. . … . 2 . … … . …… … … . L . . … .

The marginal posterior distribution for any unknown quantity wl can be summarized by its column of L simulation draws. In many examples it is not necessary to construct the entire table ahead of time; rather, one creates the L vectors of posterior simulations for the parameters of the model and then uses these to construct posterior simulations for other unknown quan-

In this section, we present the application of Bayesian inference to the real world problems

The basic contents of the n-person game was presented by John Forbes Nash [14] in 1950. Also, he first shows the existence of equilibrium for this model when the player's preferences are representable by continuous quasi-concave utilities and the sets of strategy are simplex. The

same inference for Θ.

10 Bayesian Inference

3.3. Posterior distribution

about home radon exposure are discussed in [13].

as illustrated in the following matrix:

tities of interest, as necessary.

4. Application to games theory

definition of an n-person game can be written as below.

such as Bayesian Game as follows.

4.1. The classical games

The normal form of an <sup>n</sup>� person game is Xi ð Þ ;ri <sup>n</sup> <sup>i</sup>¼<sup>1</sup>, where for each <sup>i</sup> <sup>∈</sup>{1, 2, … , <sup>n</sup>}, the set of individual strategies of player i denoted by Xi which Xi is a non-empty set and ri is the preference relation on X ≔ Qi∈IXi of player i.

The individual preferences ri are usually represented by utility functions, i.e. for each i∈ {1, 2, … , <sup>n</sup>} there exist a real valued function ui : <sup>X</sup> <sup>≔</sup> <sup>Q</sup>i∈IXi!<sup>R</sup> such that:

xriy ⇔ uiðxÞ ≥ uiðyÞ, ∀x;y∈ X:

Then the normal form of n� person game is transformed to Xi ð Þ ; ui n i¼1.

The solution of this game is called Nash equilibrium as below.

### Definition 4.2

The Nash equilibrium for the game Xi ð Þ ; ui <sup>n</sup> <sup>i</sup>¼<sup>1</sup> is a point <sup>x</sup>∗∈ <sup>X</sup> which satisfies for each <sup>i</sup>∈{1, 2, … , n} : ui(x<sup>∗</sup> ) ≥ ui(x<sup>∗</sup> , xi) for each xi∈ Xi.

The following theorem offers sufficient conditions for the existence of Nash equilibrium.

### Theorem 4.3

Let <sup>Γ</sup> <sup>¼</sup> Xi ð Þ ; ui <sup>n</sup> <sup>i</sup>¼<sup>1</sup> be a n-person game and denoted by f the real-valued function on X � X defined by f xð Þ¼ ; <sup>y</sup> <sup>Σ</sup><sup>n</sup> <sup>i</sup>¼<sup>1</sup>ui <sup>x</sup>�<sup>i</sup>; yi � �. Let us assume that


Then, Γ has an equilibrium.

Proof. See in [34].

Next, we present some examples of Nash equilibrium for two persons game as follows.

### Example 4.4

The battle of the sexes game has two Nash equilibrium (MT, FT), (MS, FS) with (3, 2) and (2, 3), where "Male like playing tennis" denoted by MT, "Male like shopping" denoted by MS, "Female like playing tennis" denoted by FT and "Female like shopping" denoted by FS, see in Figure 1.

### Example 4.5

The oligopoly behavior game is a unique Nash equilibrium (Aa, Ba) where "A coffee shop use a strategy for don't advertising" denoted by Ad, "A coffee shop use a strategy for advertising"

Figure 1. The battle of the sexes game.

Figure 2. The oligopoly behavior game.

denoted by Aa, "A coffee shop use a strategy for do not advertising" denoted by Bd, and "A coffee shop use a strategy for advertising" denoted by Ba, see in Figure 2.

#### 4.2. The Bayesian games

For a long time, we have been supposed that everything in the game was normal knowledge for everyone playing. However, real players may have private information about their own payoffs, their type or preferences, etc. The way to modeling this situation of asymmetrical information is by recurring to the concept was defined by Harsanyi in 1967. The key is to introduce a move by the nature, which changes the uncertainty by converting an asymmetrical information problem into an imperfect information problem. The concept is the nature moves determining players' types, a concept that collects all the private information relevant them (i.e. payoffs, preferences, beliefs of another players, etc.).

#### Definition 4.6

The normal form of Bayesian games with incomplete information include:


It is important to discuss some parts of the definition. Players' types comprise all relevant information about some player's private characteristics. The type of θ<sup>i</sup> is only observed by player i who uses this information both to make decisions and to update itself beliefs about the likelihood of opponents' types.

Combining actions and types for each player it is possible to create the strategies. Strategies will be given by si : Θi!Ai, with elements si(θi) where Θ<sup>i</sup> is the type space and Ai is the action space. A strategy may determine different actions to different types. Lastly, utilities are computed by each player by taking expectations over types using itself own conditional beliefs about opponents' types. Hence, if player i uses the pure strategy si, other players use the strategies si and player i's type is θi, the expected utility can be presented as follows

$$E u\_i(\mathbf{s}\_i | \mathbf{s}\_{-i}, \boldsymbol{\theta}\_i) = \sum\_{\boldsymbol{\Theta}\_{-i} \in \boldsymbol{\Theta}\_{-i}} u\_i(\mathbf{s}\_i, \mathbf{s}\_{-i}(\boldsymbol{\Theta}\_{-i}), \boldsymbol{\Theta}\_i, \boldsymbol{\Theta}\_{-i}) p(\boldsymbol{\Theta}\_{-i} | \boldsymbol{\Theta}\_i).$$

A Bayesian Nash Equilibrium (for short term: BNE) is basically the same concept than a Nash Equilibrium with the addition that players need to take expectations over opponents' types as follows.

### Definition 4.7

denoted by Aa, "A coffee shop use a strategy for do not advertising" denoted by Bd, and "A

For a long time, we have been supposed that everything in the game was normal knowledge for everyone playing. However, real players may have private information about their own payoffs, their type or preferences, etc. The way to modeling this situation of asymmetrical information is by recurring to the concept was defined by Harsanyi in 1967. The key is to introduce a move by the nature, which changes the uncertainty by converting an asymmetrical information problem into an imperfect information problem. The concept is the nature moves determining players' types, a concept that collects all the private information relevant them

coffee shop use a strategy for advertising" denoted by Ba, see in Figure 2.

The normal form of Bayesian games with incomplete information include:

(i.e. payoffs, preferences, beliefs of another players, etc.).

2. the set of finite action for each player ai∈ Ai;

4.2. The Bayesian games

Figure 2. The oligopoly behavior game.

Figure 1. The battle of the sexes game.

12 Bayesian Inference

Definition 4.6

1. the players i ∈{1, 2, …, I};

A Bayesian Nash Equilibrium is a Nash Equilibrium of a Bayesian Game, i.e. Eui si ð Þ js�<sup>1</sup>; θ<sup>i</sup> ≥ Eui s<sup>0</sup> i js�<sup>i</sup>; θ<sup>i</sup> � � for all s<sup>0</sup> <sup>i</sup> ∈Si and for all types θ<sup>i</sup> occurring with positive probability.

The following theorem for the existence of Bayesian Nash Equilibrium.

#### Theorem 4.8

Every finite Bayesian Games has a Bayesian Nash Equilibrium.

#### Example 4.9

Consider the Bayesian games as follows:


For any of the Bayesian games, we will find all BNE. All equilibrium in mixed behavioral strategies can be written as.

### Matrix I:


Matrix II:


#### 4.2.1. Pure strategy BNE

First, we will deflate the case of incomplete information problem as a static extended game with all of possible strategies: Γb. It can be presented follow Harsanyi, that the Nash Equilibrium of Γb is the same equilibrium of the imperfect game presented. The idea is to deflate a game such that all the ways the game can follow is considered in the extended game Γb.

The first step is to define the strategies for all player.

Since he does not know in which matrix the game is played, so, COL has only two strategies which contain L and R.

ROW knows in which Matrix the game occurs, and the strategies are UU, UD, DU and DD where UD is played U in Matrix I and D in Matrix II.

The probability knowledge, the nature locates the game in any matrix. The new extended game Γb can be shown as:


Remember that DU is a dominated strategy for ROW. After displacement that possibility, the game has 3 pure Nash Equilibrium as follows {(UU; L); (UD; R); (DD; R)}.

#### 4.2.2. Mixed strategy BNE

Sequent to obtain the mixed strategies we will make another kind of analysis and try to repeat the three pure BNE obtained before.

Suppose the probabilities of playing each action are as displayed in the matrices as below, where y is the probability COL plays L, if the game is in Matrix I then x is the probability ROW plays U and if the game is in Matrix II then z is the probability ROW plays U.

### 4.2.3. Player's best respones

Matrix I:

14 Bayesian Inference

Matrix II:

4.2.1. Pure strategy BNE

which contain L and R.

game Γb can be shown as:

4.2.2. Mixed strategy BNE

the three pure BNE obtained before.

First, we will deflate the case of incomplete information problem as a static extended game with all of possible strategies: Γb. It can be presented follow Harsanyi, that the Nash Equilibrium of Γb is the same equilibrium of the imperfect game presented. The idea is to deflate a

U (0, 0) (0, 0) D (0, 0) (2, 2)

U (1, 1) (0, 0) D (0, 0) (0, 0)

L R

L R

Since he does not know in which matrix the game is played, so, COL has only two strategies

ROW knows in which Matrix the game occurs, and the strategies are UU, UD, DU and DD

The probability knowledge, the nature locates the game in any matrix. The new extended

DU (0, 0) (0, 0) DD (0, 0) (1, 1)

L R

� � (0, 0)

� � (1, 1)

Remember that DU is a dominated strategy for ROW. After displacement that possibility, the

Sequent to obtain the mixed strategies we will make another kind of analysis and try to repeat

game has 3 pure Nash Equilibrium as follows {(UU; L); (UD; R); (DD; R)}.

UU <sup>1</sup>

UD <sup>1</sup>

game such that all the ways the game can follow is considered in the extended game Γb.

The first step is to define the strategies for all player.

where UD is played U in Matrix I and D in Matrix II.

• In Matrix I: we get ROW's best response as follows

ROW would play U, x = 1, if 1y + 0(1 � y) > 0, then y > 0, which can be concluded as:


ROW would play D, z = 0 if 0 < 2(1 � y) then y < 1 which can be concluded as:


COL would play L, y = 1 if

$$\frac{1}{2}[1\mathbf{x} + \mathbf{0}(1-\mathbf{x})] + \frac{1}{2}[0\mathbf{z} + \mathbf{0}(1-\mathbf{z})] > \frac{1}{2}[0\mathbf{x} + \mathbf{0}(1-\mathbf{x})] + \frac{1}{2}[0\mathbf{z} + \mathbf{2}(1-\mathbf{z})]$$

then x > 2(1 � z) which can be summarized as:

e. if x = 2(1 � z), then y ∈[0, 1]; f. if x > 2(1 � z), then y = 1; g. if x < 2(1 � z), then y = 0.

Next, we can check each the possibilities in order to find the Nash Equilibrium, such as those strategies stable for any players. Let us start by checking COL's strategies since there are less combinations.

### 4.2.4. Mixed equilibrium

Case 1:

If y = 0, we have b. x∈ [0, 1] and c. z = 0. Here, we want to check this is a equilibrium from COL's point of view. By g., we can see that if z = 0, then x < 2 which always hold and that y = 0.

This Nash equilibrium supports two of the three pure BNE found before: (DD, R), which is the same as y = 0, x = 0 and z = 0 and (UD, R) which is the same as y = 0, x = 1 and z = 0.

Thus, we get Nash equilibrium of y = 0, x∈ [0, 1] and z = 0.

There are many BNE in which column plays R and row plays xU + (1 � x)D, when x ∈[0, 1] if Matrix I occurs and D if Matrix II occurs.

### Case 2:

If y = 0, we have d. z∈ [0, 1] and from a. x = 1.

From f., we can see that when x = 1, then it should be the case that z ≥ <sup>1</sup> <sup>2</sup> in order to be true that y = 1. Hence, these BNE are restricted to y = 1, z ∈ <sup>1</sup> <sup>2</sup> ; <sup>1</sup> � � and <sup>x</sup> = 1.

This BNE supports the third pure Nash Equilibrium found before: (UU, L), which is the same as y = 1, x = 1 and z = 1.

There are many BNE in which column plays L and row plays U if Matrix I occurs and zU + (1 � <sup>z</sup>)D, where <sup>z</sup><sup>∈</sup> <sup>1</sup> <sup>2</sup> ; <sup>1</sup> � � if Matrix II occurs.

Case 3:

If y ∈(0, 1), we have a. x = 1 and c. z = 0. By e., we can see that in order y belongs to [0, 1] it should be the case that x = (1 � z). However it is impossible this equality to hold if both z = 0 and x = 1.

Therefore, the case if y∈ (0, 1) is not a Bayesian Nash equilibrium.

#### 4.3. Abstract economy model

Later, the existence of social equilibrium was proved Debreu [15]. Also Arrow and Debreu [16] proved the existence of Walrasian equilibrium. The classical abstract economy game introduced by Shafer and Sonnenschein [17] or Borglin and Keiding [18] consists of a finite set of agents, each characterized by certain constraints and preferences, explained by correspondences. Following many previous authors ideas, they studied the existence of equilibrium for generalized games (see, for example, [19–27] and the references therein). Now, we show some definitions of an abstract economy model and equilibrium of this model as follows. Let the set of agents be the finite set {1, 2, … , n}. For each i∈{1, 2, …, n} let Xi be a non-empty set.

### Definition 4.10

An abstract economy <sup>Γ</sup> <sup>¼</sup> Xi; Ai ð Þ ; Pi <sup>n</sup> <sup>i</sup>¼<sup>1</sup> is defined as a family of <sup>n</sup> ordered triplets (Xi, Ai, Pi), where for each i∈I:


### Definition 4.11

An equilibrium for Γ is a point x∗∈ Qi∈IXi which satisfies for each i∈ {1, 2, … , n}:

$$\mathbf{1}. \quad \mathbf{x}^\* \in A\_i(\mathbf{x}^\*);$$

2. Ai(x<sup>∗</sup> ) ∩ Pi(x<sup>∗</sup> ) = ∅.

#### Theorem 4.12

Let <sup>Γ</sup> <sup>¼</sup> Xi; Ai ð Þ ; Pi <sup>n</sup> <sup>i</sup>¼<sup>1</sup> be an abstract economy which satisfies, for each i<sup>∈</sup> {1, 2, … , <sup>n</sup>}:


Then, Γ has an equilibrium.

Proof. See in [34].

Case 2:

16 Bayesian Inference

Case 3:

x = 1.

as y = 1, x = 1 and z = 1.

+ (1 � <sup>z</sup>)D, where <sup>z</sup><sup>∈</sup> <sup>1</sup>

4.3. Abstract economy model

An abstract economy <sup>Γ</sup> <sup>¼</sup> Xi; Ai ð Þ ; Pi <sup>n</sup>

Definition 4.10

Definition 4.11

1. x∗∈ Ai(x<sup>∗</sup>

Theorem 4.12

Let <sup>Γ</sup> <sup>¼</sup> Xi; Ai ð Þ ; Pi <sup>n</sup>

2. Ai(x<sup>∗</sup>

);

) = ∅.

) ∩ Pi(x<sup>∗</sup>

1. Ai :

2. Pi :

where for each i∈I:

If y = 0, we have d. z∈ [0, 1] and from a. x = 1.

y = 1. Hence, these BNE are restricted to y = 1, z ∈ <sup>1</sup>

From f., we can see that when x = 1, then it should be the case that z ≥ <sup>1</sup>

<sup>2</sup> ; <sup>1</sup> � � if Matrix II occurs.

Therefore, the case if y∈ (0, 1) is not a Bayesian Nash equilibrium.

<sup>Q</sup>i∈<sup>I</sup> Xi!2Xi is constraint correspondence and

<sup>Q</sup>i∈<sup>I</sup> Xi!2Xi is preference correspondence.

<sup>2</sup> ; <sup>1</sup> � � and <sup>x</sup> = 1.

This BNE supports the third pure Nash Equilibrium found before: (UU, L), which is the same

There are many BNE in which column plays L and row plays U if Matrix I occurs and zU

If y ∈(0, 1), we have a. x = 1 and c. z = 0. By e., we can see that in order y belongs to [0, 1] it should be the case that x = (1 � z). However it is impossible this equality to hold if both z = 0 and

Later, the existence of social equilibrium was proved Debreu [15]. Also Arrow and Debreu [16] proved the existence of Walrasian equilibrium. The classical abstract economy game introduced by Shafer and Sonnenschein [17] or Borglin and Keiding [18] consists of a finite set of agents, each characterized by certain constraints and preferences, explained by correspondences. Following many previous authors ideas, they studied the existence of equilibrium for generalized games (see, for example, [19–27] and the references therein). Now, we show some definitions of an abstract economy model and equilibrium of this model as follows. Let the set

of agents be the finite set {1, 2, … , n}. For each i∈{1, 2, …, n} let Xi be a non-empty set.

An equilibrium for Γ is a point x∗∈ Qi∈IXi which satisfies for each i∈ {1, 2, … , n}:

<sup>i</sup>¼<sup>1</sup> be an abstract economy which satisfies, for each i<sup>∈</sup> {1, 2, … , <sup>n</sup>}:

<sup>i</sup>¼<sup>1</sup> is defined as a family of <sup>n</sup> ordered triplets (Xi, Ai, Pi),

<sup>2</sup> in order to be true that

### 4.4. Fuzzy games

The first concept of a fuzzy set was introduced by Zadeh [28] in 1965. Fuzzy set theory has been shown to be a gainful tool to describe situations in which the data are imprecise or vague. The theory of fuzzy sets has become a well framework for studying results concerning fuzzy equilibrium existence for abstract fuzzy economies. The first study of a fuzzy abstract economy (or a fuzzy game) has been studied by Kim and Lee in [29], they shown the existence of the equilibrium for 1-person fuzzy game. Also Kim and Lee [29] shown the existence of equilibrium for generalized games when the constraints or preferences are vague due to the agent's behavior. In 2009, Patriche [30] studied the Bayesian abstract economy game and proved the existence of equilibrium for an abstract economy game with differential information and a measure space of agents. However, the existence of random fuzzy equilibrium for fuzzy game has not been studied so far. In 2013, Patriche [31] defined the Bayesian abstract economy game and proved the existence of the Bayesian fuzzy equilibrium for this game. Also, Patriche [32] defined the new Bayesian abstract fuzzy economy game and proved the existence of the Bayesian fuzzy equilibrium for this game which it is characterized by a private information set, an action fuzzy mapping, a random fuzzy constraint one and a random fuzzy preference mapping. Recently, Patriche [33] defined the fuzzy games and applications to systems of generalized quasi-variational inequalities problem. The Bayesian fuzzy equilibrium concept is an extension of the deterministic equilibrium. She also generalized and extended the former deterministic models introduced by Debreu [15], Shafer and Sonnenschein [17] and Patriche [34]. Very recently, Saipara and Kumam [35] introduced the model of general Bayesian abstract fuzzy economy for product measurable spaces, and proved the existence for Bayesian fuzzy equilibrium of this model as follows.

For each i∈ I, let Ω<sup>i</sup> ð Þ ; Z<sup>i</sup> be a measurable space, ð Þ Ω; Z be the product measurable space where Ω≔Q <sup>i</sup>∈<sup>I</sup>Ωi, Z≔ ⊗ <sup>i</sup> <sup>∈</sup><sup>I</sup>Z<sup>i</sup> and μ is a probability measure on ð Þ Ω; Z . Let Y denote the strategy or commodity space, where Y is a separable Banach space.

Let I be a non-empty finite set (the set of agents). For each i∈ I, let Xi : Ωi!F(Y) be a fuzzy mapping, and let zi∈ (0, 1].

Let LXi = {xi∈ S(Xi(�))zi : xi is Σi-measurable}. Denote by LX = <sup>Q</sup>i∈ILXi and by the set <sup>Q</sup><sup>i</sup> 6¼ jLXj . An element xi of LXi is called a strategy for agent i. The typical element of LXi is denoted by xi and that of (Xi(ωi))zi by xi(ωi) (or xi). We can define a general Bayesian abstract fuzzy economy model of product measurable spaces as follow.

### Definition 4.13

A general Bayesian abstract fuzzy economy model of product measurable spaces is defined as follows:

$$\Gamma = \left( \left( (\Omega\_i, \mathcal{Z}\_i)\_{i \in I}, \mu \right), \left( X\_i, \Sigma\_i, (A\_i, a\_i), \left( P\_i, p\_i \right), z\_i \right)\_{i \in I} \right)\_{\prime \prime}$$

where I is non-empty finite set (the set of agents) and:


The Bayesian fuzzy equilibrium for a general Bayesian abstract fuzzy economy model of product measurable spaces is defined as follows.

### Definition 4.14

A Bayesian fuzzy equilibrium for Γ is a strategy profile x~<sup>∗</sup> ∈LX such that for all i∈ I,


#### Theorem 4.15

Let I be a non-empty finite set. Let the family

<sup>Γ</sup> <sup>¼</sup> <sup>Ω</sup><sup>i</sup> ð Þ ; <sup>Z</sup><sup>i</sup> <sup>i</sup>∈<sup>I</sup>; <sup>μ</sup> ; Xi;Σi; Ai ð Þ ; ai ; Pi; pi ; zi i∈I be a general Bayesian abstract economy model of product spaces satisfy (a)-(j). Then, there exists a Bayesian fuzzy equilibrium for Γ.

For each i∈I, the following conditions are sastisfied:


#### Proof. See in [35].

that of (Xi(ωi))zi by xi(ωi) (or xi). We can define a general Bayesian abstract fuzzy economy

A general Bayesian abstract fuzzy economy model of product measurable spaces is defined as follows:

; zi

i∈ I

,

be a general Bayesian abstract economy model

: <sup>Ω</sup>i!2<sup>Y</sup> is a non-empty convex weakly compact-valued

: <sup>Ω</sup>i!2<sup>Y</sup> is <sup>∑</sup>i� lower measurable;

pið Þ <sup>~</sup><sup>x</sup> <sup>⊂</sup>

<sup>Γ</sup> <sup>¼</sup> <sup>Ω</sup><sup>i</sup> ð Þ ; <sup>Z</sup><sup>i</sup> <sup>i</sup><sup>∈</sup> <sup>I</sup>; <sup>μ</sup> ; Xi;Σi; Ai ð Þ ; ai ; Pi; pi

b. Σ<sup>i</sup> is a sub σ-algebra of Z ¼ ⊗ <sup>i</sup><sup>∈</sup> <sup>I</sup>Zi, which denotes the private information of agent i; c. for each ωi∈ Ω<sup>i</sup> , Ai(ωi, �) : LX!F(Y) is the random fuzzy constraint mapping of agent i; d. for each ωi∈ Ω<sup>i</sup> , Pi(ωi, �) : LX!F(Y) is the random fuzzy preference mapping of agent i;

e. ai : LX!(0, 1] is a random fuzzy constraint function, and pi : LX!(0, 1] is a random fuzzy

The Bayesian fuzzy equilibrium for a general Bayesian abstract fuzzy economy model of

f. zi∈(0, 1] is such that for all <sup>ω</sup><sup>i</sup> ð Þ ; <sup>x</sup> <sup>∈</sup> <sup>Ω</sup><sup>i</sup> � LX, Ai <sup>ω</sup><sup>i</sup> ð Þ ð Þ ; <sup>~</sup><sup>x</sup> aið Þ <sup>~</sup><sup>x</sup> <sup>⊂</sup>ð Þ Xið Þ <sup>ω</sup><sup>i</sup> zi and Pi <sup>ω</sup><sup>i</sup> ð Þ ð Þ ; <sup>~</sup><sup>x</sup>

A Bayesian fuzzy equilibrium for Γ is a strategy profile x~<sup>∗</sup> ∈LX such that for all i∈ I,

; zi

of product spaces satisfy (a)-(j). Then, there exists a Bayesian fuzzy equilibrium for Γ.

i∈I

c. For each <sup>ω</sup><sup>i</sup> ð Þ ; <sup>~</sup><sup>x</sup> <sup>∈</sup> <sup>Ω</sup><sup>i</sup> � LX, Ai <sup>ω</sup><sup>i</sup> ð Þ ð Þ ; <sup>~</sup><sup>x</sup> ai <sup>~</sup>xÞð is convex and has a non-empty interior in the relative

model of product measurable spaces as follow.

where I is non-empty finite set (the set of agents) and:

preference function of agent i;

product measurable spaces is defined as follows.

ii. Ai <sup>ω</sup>i; <sup>x</sup>~<sup>∗</sup> ð Þ ð Þ ai <sup>x</sup>~<sup>∗</sup> ð Þ <sup>∩</sup> Pi <sup>ω</sup>i; <sup>x</sup>~<sup>∗</sup> ð Þ ð Þ pi <sup>x</sup>~<sup>∗</sup> ð Þ <sup>¼</sup> <sup>∅</sup> <sup>μ</sup> � <sup>a</sup>:e:.

;

i. <sup>x</sup>~<sup>∗</sup>ð Þ <sup>ω</sup><sup>i</sup> <sup>∈</sup>cl Ai <sup>ω</sup>i; <sup>x</sup>~<sup>∗</sup> ð Þ ð Þ ai <sup>x</sup>~<sup>∗</sup> ð Þ <sup>μ</sup> � <sup>a</sup>:e:;

Let I be a non-empty finite set. Let the family

<sup>Γ</sup> <sup>¼</sup> <sup>Ω</sup><sup>i</sup> ð Þ ; <sup>Z</sup><sup>i</sup> <sup>i</sup>∈<sup>I</sup>; <sup>μ</sup> ; Xi;Σi; Ai ð Þ ; ai ; Pi; pi

a. Xi : Ωi! F(Y) is such that ωi!Xi(ωi)zi

b. Xi : Ωi! F(Y) is such that ωi!Xi(ωi)zi

norm topology of (Xi(ωi))zi

For each i∈I, the following conditions are sastisfied:

and integrably bounded correspondence;

ð Þ Xið Þ ω<sup>i</sup> zi

Definition 4.14

Theorem 4.15

.

a. Xi : Ωi! F(Y) is a action (strategy) fuzzy mapping of agent i;

Definition 4.13

18 Bayesian Inference

Moreover, in 1960, Fichera and Stampacchia first introduced the variational inequalities problem, this issue has been widely studied. Next, the basic concept of variational inequalities for fuzzy mappings was first introduced by Chang and Zhu [36] in 1989. In the topic of variational inequalities problem, there are many mathematicians who studied this topic (see, for example, [37, 38]). In 1993, the concept of a random variational inequality was introduced by Noor and Elsanousi [39]. Recently, Patriche [31] used the model of the Bayesian abstract fuzzy economy to prove the existence of solution for the two types of random quasi-variational inequalities with random fuzzy mappings.

### 5. Conclusion

The main objectives of this chapter was introduced the concept of Bayesian inference and application to some real world problems. In this chapter, we were presented about the basic concept of Bayesian inference which it can be application to the Bayesian game and a general Bayesian abstract fuzzy economy game or a fuzzy game. For application to Bayesian game, we were shown the solution of Bayesian Nash Equilibrium (BNE) for a Bayesian game with examples. Finally, we were shown the existence of Bayesian fuzzy equilibrium for a fuzzy game.

### Acknowledgements

This project was supported by the Theoretical and Computation Science (TaCS) Center under Computational and Applied Science for Smart Innovation Cluster (CLASSIC), Faculty of Science, KMUTT. Moreover, Poom Kumam was supported by the Thailand Research Fund (TRF) and the King Mongkut's University of Technology Thonburi (KMUTT) under the TRF Research Scholar Award (Grant No. RSA6080047).

### Author details

Wiyada Kumam<sup>1</sup> , Plern Saipara<sup>2</sup> and Poom Kumam<sup>3</sup> \*

\*Address all correspondence to: poom.kum@kmutt.ac.th


### References


were shown the solution of Bayesian Nash Equilibrium (BNE) for a Bayesian game with examples. Finally, we were shown the existence of Bayesian fuzzy equilibrium for a fuzzy

This project was supported by the Theoretical and Computation Science (TaCS) Center under Computational and Applied Science for Smart Innovation Cluster (CLASSIC), Faculty of Science, KMUTT. Moreover, Poom Kumam was supported by the Thailand Research Fund (TRF) and the King Mongkut's University of Technology Thonburi (KMUTT) under the TRF

\*

[1] Stigler S. Who discovered Bayes's theorem. The American Statistician. 1983;37(4):290-296 [2] Jeffreys H. Theory of Probability. 3rd ed. Oxford, England: Oxford University Press; 1961 [3] Ibrahim J, Chen M. Power prior distributions for regression models. Statistical Science.

[4] Gelman A. Scaling regression inputs by dividing by two standard deviations. Statistics in

[5] Jaynes E. Prior probabilities. IEEE Transactions on Systems Science and Cybernetics.

[6] Berger J, Bernardo J, Dongchu S. The formal definition of reference priors. Annals of

[7] Bernardo J. Reference posterior distributions for Bayesian inference (with discussion).

[8] Bernardo J. Reference analysis. In: Dey D, Rao C, editors. Handbook of Statistics. Vol. 25.

Journal of the Royal Statistical Society, B. 1979;41:113-147

game.

20 Bayesian Inference

Acknowledgements

Author details

Wiyada Kumam<sup>1</sup>

References

2000;15:46-60

1968;4(3):227-241

Medicine. 2008;27:2865-2873

Statistics. 2009;37(2):905-938

Amsterdam: Elsevier; 2005. p. 17-90

Research Scholar Award (Grant No. RSA6080047).

\*Address all correspondence to: poom.kum@kmutt.ac.th

, Plern Saipara<sup>2</sup> and Poom Kumam<sup>3</sup>

1 Rajamangala University of Technology Thanyaburi (RMUTT), Thailand 2 Rajamangala University of Technology Lanna Nan (RMUTL), Thailand 3 King Mongkut's University of Technology Thonburi (KMUTT), Thailand


Provisional chapter

### **Node-Level Conflict Measures in Bayesian Hierarchical Models Based on Directed Acyclic Graphs** Node-Level Conflict Measures in Bayesian Hierarchical

DOI: 10.5772/intechopen.70058

Jørund I. Gåsemyr and Bent Natvig

Additional information is available at the end of the chapter Jørund I. Gåsemyr and Bent Natvig

Models Based on Directed Acyclic Graphs

http://dx.doi.org/10.5772/intechopen.70058 Additional information is available at the end of the chapter

### Abstract

[26] Ding XP, Feng HL. Fixed point theorems and existence of equilibrium points of

[27] Wang L, Cho YJ, Huang NJ. The robustness of generalized abstract fuzzy economies in

[29] Kim WK, Lee KH. Fuzzy fixed point and existence of equilibria of fuzzy games. Journal

[30] Patriche M. Bayesian abstract economy with a measure space of agents. Abstract and

[31] Patriche M. Equilibriumof Bayesian fuzzy economies and quasi-variational inequalities with random fuzzy mappings. Journal of Inequalities and Applications. 2013; 374. Article

[32] Patriche M. Existence of equilibrium for an abstract economy with private information and a countable space of actions. Mathematical Reports. 2013;15(65)(3):233-242

[33] Patriche M. Fuzzy games with a countable space of actions and applications to systems of generalized quasi-variational inequalities. Fixed Point Theory and Applications.

[34] Patriche M. Equilibrium in Games and Competitive Economies. Bucharest: The Publish-

[35] Saipara P, Kumam P. Fuzzy games for a general Bayesian abstract fuzzy economy model of product measurable spaces. Mathematical Methods in the Applied Sciences. 2015;39(16):

[36] Chang SS, Zhu YG. On variational inequalities for fuzzy mappings. Fuzzy Sets and

[37] Noor MA. Variational inequalities for fuzzy mappings III. Fuzzy Sets and Systems.

[38] Park JY, Lee SY, Jeong JU. Completely generalized strongly quasivariational inequalities

[39] Noor MA, Elsanousi SA. Iterative algorithms for random variational inequalities.

for fuzzy mapping. Fuzzy Sets and Systems. 2000;110:91-99

Panamerican Mathematical Journal. 1993;3:39-50

<sup>F</sup>-majorized mappings in FC-spaces. Nonlinear

noncompact abstract economies for L<sup>∗</sup>

of Fuzzy Mathematics. 1998;6:193-202

ing House of the Romanian Academy; 2011

Applied Analysis. 2009;2009:1-11

ID 58E35

22 Bayesian Inference

2014;2014(124)

4810-4819

Systems. 1989;32:359-367

2000;110:101-108

Analysis: Theory, Methods & Applications. 2010;72:65-76

[28] Zadeh LA. Fuzzy sets. Information and Control. 1965;8:338-353

generalized convex spaces. Fuzzy Sets and Systems. 2011;176:56-63

Over the last decades, Bayesian hierarchical models defined by means of directed, acyclic graphs have become an essential and widely used methodology in the analysis of complex data. Simulation-based model criticism in such models can be based on conflict measures constructed by contrasting separate local information sources about each node in the graph. An initial suggestion of such a measure was not well calibrated. This shortcoming has, however, to a large extent been rectified by subsequently proposed alternative mutually similar tail probability-based measures, which have been proved to be uniformly distributed under the assumed model under various circumstances, and in particular, in quite general normal models with known covariance matrices. An advantage of this is that computationally costly precalibration schemes needed for some other suggested methods can be avoided. Another advantage is that noninformative prior distributions can be used when performing model criticism. In this chapter, we describe the basic framework and review the main uniformity results.

Keywords: cross-validation, data splitting, information contribution, MCMC, model criticism, pivotal quantity, preexperimental distribution, p-value

### 1. Introduction

Over the last decades, Bayesian hierarchical models have become an essential and widely used methodology in the analysis of complex data. Computational techniques such as Markow Chain Monte Carlo (MCMC) methods make it possible to treat very complex models and data structures. Analysis of such models gives intuitively appealing Bayesian inference based on posterior probability distributions for the parameters.

In the construction of such models, an understanding of the underlying structure of the problem can be represented by means of directed acyclic graphs (DAGs), with nodes in the graph

© The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and eproduction in any medium, provided the original work is properly cited.

corresponding to data or parameters, and directed edges between parameters representing conditional distributions. However, a perfect understanding of the underlying structure is usually an unachievable goal, and there is always a danger of constructing inadequate models. Box [1] suggests a pattern for the model building process where an initial candidate model is assessed for adequacy, and if necessary modified and elaborated on, leading to a new candidate that again is checked for adequacy, and so on. As a tool in this model criticism process, Ref. [1] suggests using the prior predictive distribution of some checking function or test statistic as a reference for the observed value of this checking function, resulting in a prior predictive p-value. This requires an informative and realistic prior distribution, which is not always available or even desirable. Indeed, as pointed out in Ref. [2], in an early phase of the model building process, it is often convenient to use noninformative or even improper priors and thus avoid costly and time-consuming elicitation of prior information. Noninformative priors may be used also for the inference because relevant prior information is unavailable.

There exist many other methods for checking the overall fit of the model or an aspect of the model of special interest, based on locating a test statistic or a discrepancy measure in some kind of a reference distribution. The posterior predictive p-value (ppp) of Ref. [3] uses the posterior distribution as reference and does not require informative priors. But this method uses data twice and can as a result be very conservative [2, 4–6]. Hjort et al. [5] suggest remedying this by using the ppp value as a test statistic in a prior predictive test. The computation of the resulting calibrated cppp-value is, however, very computer intensive in the general case, and again realistic, informative priors are needed. A node-level discrepancy measure suggested in Ref. [7] is subject to the same limitations. The partial posterior predictive p-value of Ref. [4] avoids double use of data and allows noninformative priors but may be difficult to compute and interpret in hierarchical models.

Comparison with other candidate models through a technique for model comparison or model choice, such as predictive methods, maximum posterior probability, Bayes factors or an information criterion, can also serve as tools for checking model adequacy indirectly when alternative candidate models exist.

In this chapter, we will, however, focus on methods for criticizing models in the absence of any particular alternatives. We will review methods for checking the modeling assumptions at each node of the DAG. The aim is to identify parts or building blocks of the model that are in discordance with reality, which may be in need of adjustment or further elaboration. O'Hagan [8] regards any node in the graph as receiving information from two disjoint subsets of the neighboring nodes. This information is represented as a conditional probability density or a likelihood or as a combination of these two kinds of information sources. Adopting the same basic perspective, our aim is to check for inconsistency between such subsets. The suggestion in Ref. [8] is to normalize these information sources to have equal height 1 and to regard the height of the curves at the point of intersection as a measure of conflict. However, as shown in Ref. [2], this measure tends to be quite conservative. Dahl et al. [9] demonstrated that it is also poorly calibrated, with false warning probabilities that vary substantially between models. Dahl et al. [9] also identified the different sources of inaccuracy and modified the measure of Ref. [8] to an approximately χ<sup>2</sup> -distributed quantity under the assumed model by instead normalizing the information sources to probability densities. In Ref. [10], these densities were instead used to define tail probability-based conflict measures. Gåsemyr and Natvig [10] showed that these measures are uniformly distributed in quite general hierarchical normal models with fixed variances/covariances. In Ref. [11], such uniformity results were proved in various situations involving nonnormal and nonsymmetric distributions. These uniformity results indicate that the measures of Refs. [9] and [10] have comparable interpretations across different models. Therefore, they can be used without computationally costly precalibration schemes, such as the one suggested in Ref. [5]. Gåsemyr [12] focuses on some situations where the conflict measure approach can be directly compared to the calibration method of Ref. [5] and shows that the less computer-intensive conflict measure approach performs at least as well in these situations. Moreover, the conflict measure approach can be applied in models using noninformative prior distributions.

Focusing on the special problem of identifying outliers among the second-level parameters in a random-effects model, Ref. [13] defines similar conflict measures. In this setting, the groupspecific means are the nodes of interest. In some models, there exist sufficient statistics for these means. Then, outlier detection at the group level can also be based on cross validation, measuring the tail probability beyond the observed value of the statistic in the posterior predictive distribution given data from the other groups. In this context, the conflict measure approach can be viewed as an extension of cross-validation to situations where sufficient statistics do not exist. Also in Ref. [13] applications to the examination of exceptionally high hospital mortality rates and to results from a vaccination program are given. In Ref. [14], this methodology is used to check for inconsistency in multiple treatment comparison of randomized clinical trials. Presanis et al. [15] apply these conflict measures in complex cases of medical evidence synthesis.

### 2. Directed acyclic graphs and node-specific conflict

corresponding to data or parameters, and directed edges between parameters representing conditional distributions. However, a perfect understanding of the underlying structure is usually an unachievable goal, and there is always a danger of constructing inadequate models. Box [1] suggests a pattern for the model building process where an initial candidate model is assessed for adequacy, and if necessary modified and elaborated on, leading to a new candidate that again is checked for adequacy, and so on. As a tool in this model criticism process, Ref. [1] suggests using the prior predictive distribution of some checking function or test statistic as a reference for the observed value of this checking function, resulting in a prior predictive p-value. This requires an informative and realistic prior distribution, which is not always available or even desirable. Indeed, as pointed out in Ref. [2], in an early phase of the model building process, it is often convenient to use noninformative or even improper priors and thus avoid costly and time-consuming elicitation of prior information. Noninformative priors may be used also for the inference because relevant prior information is unavailable.

There exist many other methods for checking the overall fit of the model or an aspect of the model of special interest, based on locating a test statistic or a discrepancy measure in some kind of a reference distribution. The posterior predictive p-value (ppp) of Ref. [3] uses the posterior distribution as reference and does not require informative priors. But this method uses data twice and can as a result be very conservative [2, 4–6]. Hjort et al. [5] suggest remedying this by using the ppp value as a test statistic in a prior predictive test. The computation of the resulting calibrated cppp-value is, however, very computer intensive in the general case, and again realistic, informative priors are needed. A node-level discrepancy measure suggested in Ref. [7] is subject to the same limitations. The partial posterior predictive p-value of Ref. [4] avoids double use of data and allows noninformative priors but may be

Comparison with other candidate models through a technique for model comparison or model choice, such as predictive methods, maximum posterior probability, Bayes factors or an information criterion, can also serve as tools for checking model adequacy indirectly when alterna-

In this chapter, we will, however, focus on methods for criticizing models in the absence of any particular alternatives. We will review methods for checking the modeling assumptions at each node of the DAG. The aim is to identify parts or building blocks of the model that are in discordance with reality, which may be in need of adjustment or further elaboration. O'Hagan [8] regards any node in the graph as receiving information from two disjoint subsets of the neighboring nodes. This information is represented as a conditional probability density or a likelihood or as a combination of these two kinds of information sources. Adopting the same basic perspective, our aim is to check for inconsistency between such subsets. The suggestion in Ref. [8] is to normalize these information sources to have equal height 1 and to regard the height of the curves at the point of intersection as a measure of conflict. However, as shown in Ref. [2], this measure tends to be quite conservative. Dahl et al. [9] demonstrated that it is also poorly calibrated, with false warning probabilities that vary substantially between models. Dahl et al. [9] also identified the different sources of inaccuracy and modified the


difficult to compute and interpret in hierarchical models.

tive candidate models exist.

24 Bayesian Inference

measure of Ref. [8] to an approximately χ<sup>2</sup>

### 2.1. Directed acyclic graphs and Bayesian hierarchical models

An example of a DAG discussed extensively in Ref. [8] is the random-effects model with normal random effects and normal error terms defined by

$$Y\_{i,j} \sim \mathcal{N}(\lambda\_i, \sigma^2), \lambda\_i \sim \mathcal{N}(\mu, \tau^2), j = 1, \dots, n\_i, i = 1, \dots, m. \tag{1}$$

In general, we identify the nodes or vertices of the graph with the unknown parameters θ and the observed data y, the latter appearing as bottom nodes and being the realizations of the random vector Y. In the Bayesian model, the parameters, the components of θ, are also considered as random variables. In general, if there is a directed edge from node a to node b, then a is a parent of b, and b is a child of a. We denote by Ch(a) the set of child nodes of a, and by Pa(b) the set of parent nodes of b. More generally, b is a descendant of a if there is a directed path from a to b. The set of descendants of a is denoted by Desc(a) and, for convenience, is defined to contain a itself. The directed edges encode conditional independence assumptions, indicating that, given its parents, a node is assumed to be independent of all other nondescendants. Hence, writing θ = (ν, μ), with μ representing the vector of top-level nodes, the joint density of (Y, θ) = (Y, ν, μ) is

$$p(\mathbf{y}, \mathbf{v}, \boldsymbol{\mu}) = \prod\_{\mathbf{y} \in \mathbf{y}} p(\mathbf{y} | \mathbf{Pa}(\mathbf{y})) \prod\_{\mathbf{v} \in \mathbf{v}} p(\mathbf{v} | \mathbf{Pa}(\mathbf{v})) \pi(\boldsymbol{\mu}), \tag{2}$$

where π(μ) is the prior distribution of μ. The posterior distribution π(θ|y) is the basis for the inference.

This setup can be generalized in various directions. The nodes may be allowed to represent vectors, at both the parameter and the data levels [10]. Instead of DAGs, one may consider chain graphs, as described in Ref. [16], with undirected edges representing mutual dependence as in Markov random fields. Scheel et al. [17] introduce a graphical diagnostic for model criticism in such models.

#### 2.2. Information contributions

The representation of a Bayesian hierarchical model in terms of a DAG is often meant to reflect an understanding of the underlying structure of the problem. By looking for a conflict associated with the different nodes in the DAG, we may therefore put our understanding of this structure to test. We may also identify parts of the model that need adjustment.

The idea put forward in Ref. [8] is that for each node λ in a DAG one may in general think of each neighboring node as providing information about λ and that it is of interest to consider the possibility of conflict between different sources of information. For instance, one may want to contrast the local prior information provided by the factor p(λ|Pa(λ)) with the likelihood information source formed by multiplying the factors p(γ|Pa(γ)) for all child nodes γ ∈ Ch(λ). The full conditional distribution of λ given all the observed and unobserved variables in the DAG, i.e.,

$$\pi(\lambda|(\mathbf{y},\boldsymbol{\theta})\_{-\lambda}) \propto p(\lambda|\mathbf{Pa}(\lambda)) \prod\_{\boldsymbol{\gamma} \in \mathrm{Ch}(\lambda)} p(\boldsymbol{\gamma}|\mathbf{Pa}(\boldsymbol{\gamma})),\tag{3}$$

is determined by these two types of factors. Here, (y, θ)�<sup>λ</sup> denotes the vector of all components of (y, θ) except for λ.

Dahl et al. [9] normalize the product Y γ∈ ChðλÞ pðγjPaðγÞÞ to a probability density function denoted

by fc(λ), the likelihood or child node information contribution, whereas the local prior density is denoted by fp(λ) and called the prior or parent node information contribution. These information contributions are integrated with respect to posterior distributions for the unknown nuisance parameters to form integrated information contribution (iic) denoted by gc and gp. In this construction, a key to avoid the conservatism of the measure suggested in Ref. [8] is to prevent dependence between the two information sources by introducing a suitable data splitting Y = (Yp, Yc) and condition the parameters of fp on y<sup>p</sup> and the parameters of fc on yc.

Definition 1 For a given parameter node λ, denoted by β<sup>p</sup> the vector whose components are Pa(λ), and by β<sup>c</sup> the vector whose components are

Node-Level Conflict Measures in Bayesian Hierarchical Models Based on Directed Acyclic Graphs http://dx.doi.org/10.5772/intechopen.70058 27

$$\cup\_{\gamma \in \text{Ch}(\lambda)} (\{\gamma\} \cup \text{Pa}(\gamma)) - \{\lambda\} = \text{Ch}(\lambda) \cup [\text{Pa}(\text{Ch}(\lambda)) - \{\lambda\}] \tag{4}$$

Let Y = (Yp, Yc) be a splitting of the data Y. Define the densities fp, fc, the prior respectively likelihood information contributions, by

$$f\_p(\lambda; \mathcal{B}\_p) = p(\lambda | \mathcal{B}\_p), \quad f\_c(\lambda; \mathcal{B}\_c) \propto \prod\_{\gamma \in \text{Ch}(\lambda)} p(\gamma | Pa(\gamma)) \tag{5}$$

Define the integrated information contribution densities gp, gc by

nondescendants. Hence, writing θ = (ν, μ), with μ representing the vector of top-level nodes, the

where π(μ) is the prior distribution of μ. The posterior distribution π(θ|y) is the basis for the

This setup can be generalized in various directions. The nodes may be allowed to represent vectors, at both the parameter and the data levels [10]. Instead of DAGs, one may consider chain graphs, as described in Ref. [16], with undirected edges representing mutual dependence as in Markov random fields. Scheel et al. [17] introduce a graphical diagnostic for model

The representation of a Bayesian hierarchical model in terms of a DAG is often meant to reflect an understanding of the underlying structure of the problem. By looking for a conflict associated with the different nodes in the DAG, we may therefore put our understanding of this

The idea put forward in Ref. [8] is that for each node λ in a DAG one may in general think of each neighboring node as providing information about λ and that it is of interest to consider the possibility of conflict between different sources of information. For instance, one may want to contrast the local prior information provided by the factor p(λ|Pa(λ)) with the likelihood information source formed by multiplying the factors p(γ|Pa(γ)) for all child nodes γ ∈ Ch(λ). The full conditional distribution of λ given all the observed and unobserved variables in the DAG, i.e.,

is determined by these two types of factors. Here, (y, θ)�<sup>λ</sup> denotes the vector of all components

by fc(λ), the likelihood or child node information contribution, whereas the local prior density is denoted by fp(λ) and called the prior or parent node information contribution. These information contributions are integrated with respect to posterior distributions for the unknown nuisance parameters to form integrated information contribution (iic) denoted by gc and gp. In this construction, a key to avoid the conservatism of the measure suggested in Ref. [8] is to prevent dependence between the two information sources by introducing a suitable data splitting

Definition 1 For a given parameter node λ, denoted by β<sup>p</sup> the vector whose components are Pa(λ), and

γ∈ ChðλÞ

structure to test. We may also identify parts of the model that need adjustment.

<sup>π</sup>ðλjðy,θÞ�λ<sup>Þ</sup> <sup>∝</sup>pðλjPaðλÞÞ <sup>Y</sup>

γ∈ ChðλÞ

Y = (Yp, Yc) and condition the parameters of fp on y<sup>p</sup> and the parameters of fc on yc.

Y ν ∈ν

pðνjPaðνÞÞπðμÞ; (2)

pðγjPaðγÞÞ; (3)

pðγjPaðγÞÞ to a probability density function denoted

pðyjPaðyÞÞ

<sup>p</sup>ðy, <sup>ν</sup>, <sup>μ</sup>Þ ¼ <sup>Y</sup>

y∈y

joint density of (Y, θ) = (Y, ν, μ) is

criticism in such models.

of (y, θ) except for λ.

Dahl et al. [9] normalize the product Y

by β<sup>c</sup> the vector whose components are

2.2. Information contributions

inference.

26 Bayesian Inference

$$\mathbf{g}\_p(\lambda) = \begin{bmatrix} f\_p(\lambda; \mathfrak{P}\_p) \pi(\mathfrak{P}\_p|\mathbf{y}\_p) d\mathfrak{P}\_{p'} & \mathbf{g}\_c(\lambda) = \int f\_c(\lambda; \mathfrak{P}\_c) \pi(\mathfrak{P}\_c|\mathbf{y}\_c) d\mathfrak{P}\_c \end{bmatrix} \tag{6}$$

and denote by Gp, Gc the corresponding cumulative distribution functions.

Note that β<sup>c</sup> may contain data nodes. The second integral in Eq. (6) is then taken only with respect to the random components of βc, i.e., the parameters in βc. If β<sup>c</sup> contains no parameters, then gc and fc coincide. Definition 1 may also be extended to the case when λ is a vector, corresponding to a subset of parameter nodes.

Combining the set of information sources linked to a specific node in different ways leads to a modification of Definition 1 where β<sup>c</sup> does not contain all child nodes of λ, the others being instead included in β<sup>p</sup> together with their parent nodes. In this way, different types of conflict about the node may be revealed. This is natural, e.g., in the context of outlier detection among independent observations with a common mean. Note that β<sup>p</sup> and β<sup>c</sup> may then be overlapping, containing common coparents with λ. The setup is illustrated in Figure 1 in the case when the

Figure 1. Part of a DAG showing information sources about λ.

set of common components, by abuse of notation denoted by β<sup>p</sup> ∩ βc, is empty. For the general setup, Definition 1 is modified as follows.

Definition 2 Let γ be a vector whose components are a subset of Ch(λ), and define β<sup>c</sup> as in Eq. (4). Denote by γ<sup>1</sup> the rest of the child nodes of λ, and let β<sup>p</sup> consist of γ<sup>1</sup> together with its parent nodes in the same way as in Eq. (4), as well as Pa(λ). The information contributions are then given by

$$(f\_p(\lambda; \mathcal{B}\_p) \propto p(\mathcal{Y}\_1 | Pa(\mathcal{Y}\_1) p(\lambda | Pa(\lambda)), \tag{7}$$

$$f\_c(\lambda; \mathcal{B}\_c) \propto p(\mathbf{y} | Pa(\mathbf{y})). \tag{8}$$

In Eq. (7), p λjPaðλÞ is replaced by the prior density π(λ) if λ is a top-level parameter. The corresponding iic densities are defined by Eq. (6) as before.

#### 2.3. Node-specific conflict measures

The conflict measure c<sup>2</sup> <sup>λ</sup> of Ref. [9] is defined as

$$\sigma\_{\lambda}^{2} = (E^{\mathcal{G}\_{\mathbb{P}}}(\lambda) - E^{\mathcal{G}\_{\mathbb{e}}}(\lambda))^{2} / (\mathsf{var}^{\mathcal{G}\_{\mathbb{P}}}(\lambda) + \mathsf{var}^{\mathcal{G}\_{\mathbb{e}}}(\lambda)) \tag{9}$$

The χ<sup>2</sup> <sup>1</sup>-distribution is the reference distribution for this measure. For the conflict measures of Ref. [10], the uniform distribution on [0, 1] is the reference distribution. They focus on tail behavior but are based on the same iic distributions. The general distribution of information sources given in Definition 2 is also introduced in Ref. [10]. For a given pair Gp, Gc of iic distributions, let λ� <sup>p</sup> and λ� <sup>c</sup> be independent samples from Gp and Gc, respectively. Let G be the cumulative distribution function for δ ¼ λ� <sup>p</sup> � λ� <sup>c</sup>. Define

$$
\sigma\_{\lambda}^{3+} = \mathbf{G}(\mathbf{0}), \quad \sigma\_{\lambda}^{3-} = \overline{\mathbf{G}}(\mathbf{0}) \stackrel{\text{def}}{=} \mathbf{1} - \mathbf{G}(\mathbf{0}) \tag{10}
$$

and

$$
\sigma\_{\lambda}^{3} = 1 - 2\min(G(0), \overline{G}(0)) = 2|G(0) - 1/2| \,. \tag{11}
$$

The c<sup>3</sup><sup>þ</sup> <sup>λ</sup> -measure and the <sup>P</sup> conf <sup>λ</sup> measure of Ref. [13] are very similar. The latter measure is aimed at detecting outlying groups or units in a three-level hierarchical model, with the second-level parameters being location parameters for group-specific data. However, the measure is interpreted as a p value, with small values indicative of conflict. Gåsemyr and Natvig [10] also defines a measure based on defining a tail area in terms of the density g of G, namely

$$c^4\_\lambda = P^G(\mathcal{g}(\delta) > \mathcal{g}(0)),\tag{12}$$

applicable also when λ is a vector.

Example 1. To illustrate the theory, consider the random-effects model 1, with the variance parameters σ<sup>2</sup> , τ<sup>2</sup> assumed known, and with μ having the improper prior π(μ) = 1. For simplicity, assume ni = n for all i. Suspecting the mth group of representing an outlier, let λ=λ<sup>m</sup> be the node of interest. Define the data splitting Yp, Y<sup>c</sup> by letting Y<sup>c</sup> ¼ Y<sup>m</sup> ¼ ðYm,1, …, Ym,nÞ, and let β<sup>c</sup> ¼ yc, β<sup>p</sup> ¼ μ. Denoting the normal density function by φ, it is easy to see that gcðλÞ ¼ <sup>f</sup> <sup>c</sup>ðλÞ ¼ <sup>φ</sup>ðλ; yc, <sup>σ</sup><sup>2</sup>=nÞ. Furthermore, <sup>f</sup> <sup>p</sup>ðλ; <sup>μ</sup>Þ ¼ <sup>φ</sup>ðλ; <sup>μ</sup>, <sup>τ</sup><sup>2</sup>Þ. Given <sup>y</sup>p, <sup>μ</sup> has the density πðμjypÞ ¼ φðμÞ; X<sup>m</sup>�<sup>1</sup> <sup>i</sup>¼<sup>1</sup> yi=ð<sup>m</sup> � <sup>1</sup>Þ,ð1=ð<sup>m</sup> � <sup>1</sup>ÞÞτ<sup>2</sup> þ ð1=ðnð<sup>m</sup> � <sup>1</sup>ÞÞÞσ<sup>2</sup> Þ. By a standard argument gpðλÞ ¼ <sup>ð</sup> f <sup>p</sup>ðλ; μÞπðμjypÞdμ ¼ φðλ; Xm�1 i¼1 yi <sup>=</sup>ð<sup>m</sup> � <sup>1</sup>Þ,ð<sup>1</sup> <sup>þ</sup> <sup>1</sup>=ð<sup>m</sup> � <sup>1</sup>ÞÞτ<sup>2</sup> þ ð1=ðnð<sup>m</sup> � <sup>1</sup>ÞÞÞσ<sup>2</sup>Þ:

It follows that gðδÞ ¼ φðδÞ; X<sup>m</sup>¼<sup>1</sup> <sup>i</sup>¼<sup>1</sup> yi <sup>=</sup>ð<sup>m</sup> � <sup>1</sup>Þ � <sup>y</sup>c,ðm=ð<sup>m</sup> � <sup>1</sup>ÞÞðτ<sup>2</sup> <sup>þ</sup> <sup>σ</sup><sup>2</sup> =nÞ. The conflict measures (Eqs. (9), (10), (11), and (12)) can hence be calculated analytically, with no simulation needed in this case.

In a simulation study of the c<sup>2</sup> <sup>λ</sup>-measure in Ref. [9] using a warning level equal to the 95% quantile of the χ<sup>2</sup> <sup>1</sup>-distribution, a false warning probability of close to 5% is obtained for a normal random-effects model with unknown variance parameters as in Eq. (1) and also in similar random-effects models with heavy-tailed t- and uniformly distributed random effects. Also with respect to detection power, this measure performs well when compared to a calibrated version of the measure given in Ref. [8], if an optimal data splitting is used. Refs. [10] and [11] prove preexperimental uniformity of the conflict measures in various situations, i.e., their distributions as functions of a Y which is distributed according to the assumed model are uniform, regardless of the true value of the basic parameter. Another way of stating this is that we obtain a proper p-value by subtracting these measures from 1. These results are reviewed in Section 5 of the present chapter.

#### 2.4. Integrated information contributions as posterior distributions

In most cases, the conflict measures of Refs. [9] and [10] are based on simulated samples from Gp and Gc. Definitions 1 and 2 suggest obtaining such samples by running an MCMC algorithm to generate posterior samples of the unknown parameters in β<sup>p</sup> and β<sup>c</sup> and then generate samples λ� <sup>p</sup> and λ� <sup>c</sup> from the respective information contributions for each such parameter sample. If the information contributions are standard probability densities, this procedure is straightforward. If not, one may instead often use the fact that, under certain conditions on the data splitting, the distributions Gp and Gc are posterior distributions conditional on y<sup>p</sup> and yc, respectively, the latter based on the improper prior π(λ) = 1, independently of the coparents.

Theorem 1 Suppose that the data splitting satisfies

set of common components, by abuse of notation denoted by β<sup>p</sup> ∩ βc, is empty. For the general

Definition 2 Let γ be a vector whose components are a subset of Ch(λ), and define β<sup>c</sup> as in Eq. (4). Denote by γ<sup>1</sup> the rest of the child nodes of λ, and let β<sup>p</sup> consist of γ<sup>1</sup> together with its parent nodes in the

ðλÞÞ<sup>2</sup>

<sup>p</sup> � λ�

<sup>1</sup>-distribution is the reference distribution for this measure. For the conflict measures of Ref. [10], the uniform distribution on [0, 1] is the reference distribution. They focus on tail behavior but are based on the same iic distributions. The general distribution of information sources given in Definition 2 is also introduced in Ref. [10]. For a given pair Gp, Gc of iic

<sup>c</sup>. Define

<sup>λ</sup> ¼ Gð0Þ¼

aimed at detecting outlying groups or units in a three-level hierarchical model, with the second-level parameters being location parameters for group-specific data. However, the measure is interpreted as a p value, with small values indicative of conflict. Gåsemyr and Natvig [10] also defines a measure based on defining a tail area in terms of the density g of G,

f <sup>p</sup>ðλ; βpÞ∝ pðγ1jPaðγ1ÞpðλjPaðλÞÞ; (7)

is replaced by the prior density π(λ) if λ is a top-level parameter. The

<sup>c</sup> be independent samples from Gp and Gc, respectively. Let G be the

<sup>λ</sup> ¼ 1 � 2minðGð0Þ, Gð0ÞÞ ¼ 2jGð0Þ � 1=2j: (11)

<sup>λ</sup> measure of Ref. [13] are very similar. The latter measure is

<sup>λ</sup> <sup>¼</sup> <sup>P</sup><sup>G</sup>ðgðδ<sup>Þ</sup> <sup>&</sup>gt; <sup>g</sup>ð0ÞÞ; (12)

f <sup>c</sup>ðλ; βcÞ∝ pðγjPaðγÞÞ: (8)

<sup>=</sup>ðvarGp <sup>ð</sup>λÞ þ varGc <sup>ð</sup>λÞÞ (9)

def1 � <sup>G</sup>ð0<sup>Þ</sup> (10)

same way as in Eq. (4), as well as Pa(λ). The information contributions are then given by

setup, Definition 1 is modified as follows.

corresponding iic densities are defined by Eq. (6) as before.

c 2

<sup>p</sup> and λ�

cumulative distribution function for δ ¼ λ�

<sup>λ</sup> -measure and the <sup>P</sup> conf

applicable also when λ is a vector.

<sup>λ</sup> of Ref. [9] is defined as

<sup>λ</sup> ¼ ðEGp <sup>ð</sup>λÞ � <sup>E</sup>Gc

c<sup>3</sup><sup>þ</sup>

c 3 <sup>λ</sup> <sup>¼</sup> <sup>G</sup>ð0Þ, c<sup>3</sup>�

c 4

In Eq. (7), p

28 Bayesian Inference

The χ<sup>2</sup>

and

The c<sup>3</sup><sup>þ</sup>

namely

 λjPaðλÞ 

The conflict measure c<sup>2</sup>

distributions, let λ�

2.3. Node-specific conflict measures

$$\mathbf{Y}\_{\mathcal{c}} = \mathbf{Y} \cap [\mathbb{L}\_{\mathcal{V} \in \operatorname{Ch}(\lambda) \cap \operatorname{\mathcal{S}}\_{\varepsilon}} \operatorname{Desc}(\mathcal{\gamma})], \quad \mathbf{Y}\_{\mathcal{p}} = \mathbf{Y} - \mathbf{Y}\_{\mathcal{c}}, \tag{13}$$

the latter expression by abuse of notation meaning the components of Y not present in Yc. Assume λ and the coparents Pa ChðλÞ ∩ β<sup>p</sup> � λ are independent. We then have

$$\mathbf{g}\_p(\lambda) = \pi(\lambda|\mathbf{y}\_p)$$

and, specifying as prior density

$$\begin{aligned} \pi(\lambda|Pa(\text{Ch}(\lambda)\cap \mathcal{Y}\_c) - \lambda) &= 1, \\ \mathbf{g}\_c(\lambda) &= \pi(\lambda|\mathbf{y}\_c). \end{aligned} \tag{14}$$

The proof is given in Appendix A in the online supporting information for Ref. [11]. Specializing to the standard setup of Definition 1, where ChðλÞ⊆βc, we see that the requirement for Eq. (13) to hold is that Y<sup>c</sup> consists of all data descendant nodes of λ. In Ref. [9], this splitting was compared with two other splittings for c<sup>2</sup> <sup>λ</sup> and found to be optimal with respect to detection power. This measure was also found to be a well-calibrated measure under this splitting.

### 3. Noninvariance and reparametrizations

The iic distributions and the corresponding conflict measures are parametrization dependent. Based on experience so far, the conflict measures seem to be fairly robust to changes in parametrization. However, this noninvariance can be handled in a theoretically satisfactory way under certain circumstances.

Let φ be the parameter, in a standard parametrization, corresponding to a specific node in the DAG. Suppose for simplicity that Y<sup>c</sup> ¼ ChðφÞ. Assume that there exists a sufficient statistic Yc and an alternative parametrization λ, being a strictly monotonic function λ(φ), such that Yc – λ is a pivotal quantity, i.e., the density for Yc given λ is of the form

$$p(y\_c|\lambda) = f\_{Y\_c}(y\_c|\lambda) = f\_0(y\_c - \lambda) \tag{15}$$

for some known density function f0. Such a parametrization will be considered as a canonical or reference parametrization if it exists, as opposed to the standard parametrization involving φ. Accordingly, the conflict measures given in Eqs. (9)–(12) are preferably based on this reference parametrization.

By Theorem 1, samples λ� <sup>c</sup> from Gc may be obtained by MCMC as posterior samples from πðλjycÞ when the splitting satisfies Eq. (13) and the prior for λ satisfies Eq. (14), i.e., equals 1. According to an argument given in Section 1.3 of Ref. [18], such a prior expresses noninformativity for likelihoods of the form (Eq. (15)). Computationally, we may, however, use the standard parametrization. When generating φ� <sup>c</sup> as posterior samples from π(φ|Yc), the prior density |dλ/dφ| for φ must be used. Then, we may calculate λ� <sup>c</sup> ¼ λðφ� <sup>c</sup>Þ. To represent the iic distribution Gp(λ), we may calculate λ� <sup>p</sup> ¼ λðφ� <sup>p</sup>Þfor samples φ� <sup>p</sup> from πðφjypÞ according to the given model. Now, the c<sup>4</sup> <sup>λ</sup>-measure can be estimated from (Eq. (12)), using a kernel density estimate of g(δ) based on corresponding samples δ� ¼ λ� <sup>p</sup> � λ� <sup>c</sup> . However, if we limit attention to the c<sup>3</sup> <sup>λ</sup>-measure (Eq. (11)) and its one-sided versions (Eq. (10)), we may use the samples from πðφjycÞ and πðφjypÞ directly. To see this, note that the condition λ� <sup>p</sup> ≥ λ� <sup>c</sup> is equivalent to the condition φ� <sup>p</sup> ≥φ� <sup>c</sup> (assuming that λ is increasing as a function of φ). Hence, the probability G (0) that λ� <sup>p</sup> � λ� <sup>c</sup> ≤ 0 can be estimated as the proportion of sample values for which φ� <sup>p</sup> ≤ φ� c.

### 4. Extensions to deterministic nodes: Relation to cross-validation, prediction and hypothesis testing

### 4.1. Cross-validation and data node conflict

the latter expression by abuse of notation meaning the components of Y not present in Yc. Assume λ and

gpðλÞ ¼ πðλjypÞ

πðλjPaðChðλÞ ∩ βcÞ � λÞ ¼ 1;

The proof is given in Appendix A in the online supporting information for Ref. [11]. Specializing to the standard setup of Definition 1, where ChðλÞ⊆βc, we see that the requirement for Eq. (13) to hold is that Y<sup>c</sup> consists of all data descendant nodes of λ. In Ref. [9], this splitting was compared

The iic distributions and the corresponding conflict measures are parametrization dependent. Based on experience so far, the conflict measures seem to be fairly robust to changes in parametrization. However, this noninvariance can be handled in a theoretically satisfactory

Let φ be the parameter, in a standard parametrization, corresponding to a specific node in the DAG. Suppose for simplicity that Y<sup>c</sup> ¼ ChðφÞ. Assume that there exists a sufficient statistic Yc and an alternative parametrization λ, being a strictly monotonic function λ(φ), such that Yc – λ

for some known density function f0. Such a parametrization will be considered as a canonical or reference parametrization if it exists, as opposed to the standard parametrization involving φ. Accordingly, the conflict measures given in Eqs. (9)–(12) are preferably based on this

πðλjycÞ when the splitting satisfies Eq. (13) and the prior for λ satisfies Eq. (14), i.e., equals 1. According to an argument given in Section 1.3 of Ref. [18], such a prior expresses noninformativity for likelihoods of the form (Eq. (15)). Computationally, we may, however, use the

<sup>p</sup> ¼ λðφ�

measure was also found to be a well-calibrated measure under this splitting.

gcðλÞ ¼ <sup>π</sup>ðλjycÞ: (14)

ðycjλÞ ¼ f <sup>0</sup>ðyc � λÞ (15)

<sup>c</sup> as posterior samples from π(φ|Yc), the prior

<sup>c</sup>Þ. To represent the iic

<sup>p</sup> from πðφjypÞ according to the

<sup>c</sup> ¼ λðφ�

<sup>c</sup> from Gc may be obtained by MCMC as posterior samples from

<sup>λ</sup>-measure can be estimated from (Eq. (12)), using a kernel density

<sup>p</sup>Þfor samples φ�

<sup>λ</sup> and found to be optimal with respect to detection power. This

� λ are independent. We then have

the coparents Pa

30 Bayesian Inference

and, specifying as prior density

with two other splittings for c<sup>2</sup>

way under certain circumstances.

reference parametrization. By Theorem 1, samples λ�

given model. Now, the c<sup>4</sup>

standard parametrization. When generating φ�

distribution Gp(λ), we may calculate λ�

ChðλÞ ∩ β<sup>p</sup>

3. Noninvariance and reparametrizations

is a pivotal quantity, i.e., the density for Yc given λ is of the form

density |dλ/dφ| for φ must be used. Then, we may calculate λ�

pðycjλÞ ¼ f Yc

The model variables Y are represented by the bottom nodes in the DAG describing the hierarchical model. The framework can be extended to also cover conflict concerning these nodes. In this way, cross-validation can be viewed as a special case of the conflict measure approach.

Let Yc be an element in the vector Y of observable random variables. We define the prior iic density gp(yc) exactly as in Eq. (6), with λ replaced by yc. The Dirac measure at the observed value yc represents a degenerate iic information contribution about Yc. This leads to the following definitions:

$$c\_{y\_c}^{3+} = \mathbb{G}\_p(y\_c), \quad c\_{y\_c}^{3-} = \overline{\mathbb{G}}\_p(y\_c), \tag{16}$$

$$c\_{y\_c}^3 = 1 - 2\min(\mathcal{G}\_p(y\_c), \overline{\mathcal{G}}\_p(y\_c)),\tag{17}$$

$$\mathbf{c}\_{\mathcal{Y}\_{\varepsilon}}^{4} = P^{\mathbb{S}\_{p}}(\mathbb{g}\_{p}(\mathcal{Y}\_{\varepsilon}) \ge \mathbb{g}\_{p}(\mathcal{Y}\_{\varepsilon})).\tag{18}$$

The measures (Eqs. (16)–(18)) are called data node conflict measures. To see that these definitions are consistent with Eqs. (10)–(12), note that λ� <sup>p</sup> corresponds to Yc, and λ� <sup>c</sup> is deterministic and corresponds to yc. We define X = Yc – yc, corresponding to δ. We then have gðxÞ ¼ gpðx þ ycÞ. Hence,

$$G(0) = \int\_{-\infty}^{0} g(\mathbf{x})d\mathbf{x} = \int\_{-\infty}^{y\_c} g\_p(y)dy = G\_p(y\_c),$$

and accordingly, Gð0Þ ¼ GpðycÞ. It follows that Eqs. (16) and (17) are special cases of Eqs. (10) and (11). Moreover,

$$P^\pounds(\mathcal{g}(X)\succeq\mathcal{g}(0)) = P^{\pounds\_p}(\mathcal{g}\_p(Y\_c)\succeq\mathcal{g}\_p(\mathcal{y}\_c)),$$

showing that Eq. (18) is a special case of Eq. (12).

Furthermore, this correspondence between the data node conflict measures (Eqs. (16) and (17)) and the parameter node conflict measures (Eqs. (10) and (11)) can be used to motivate these latter measures. We will treat the c 3+ measure as an example. Consider again a parameter node λ. If λ were actually observable and known to take the value λc, the data node version of the c 3+ measure could be used to measure deviations toward the right tail of Gp as

$$\mathcal{G}\_p(\lambda\_c) = \int\_{-\infty}^{\lambda\_c} \mathcal{g}\_p(\lambda) d\lambda = \int\_{-\infty}^0 \mathcal{g}\_p(\delta + \lambda\_c) d\delta.$$

Now λ is in reality not known, but we can take the expectation of this conflict with respect to the distribution Gc, which reflects the uncertainty about λ when influence from data y<sup>p</sup> is removed. The result is the following theorem:

#### Theorem 2

$$E^{G\_\epsilon}(G\_p(\lambda) = c\_\lambda^{3+} .$$

Proof:

$$\begin{aligned} E^{\mathbb{G}\_{\varepsilon}}(\mathbb{G}\_{p}(\lambda) &= \int\_{-\infty}^{\infty} \mathbb{g}\_{\varepsilon}(\lambda) \left( \int\_{-\infty}^{0} \mathbb{g}\_{p}(\delta + \lambda) d\delta \right) d\lambda = \int\_{-\infty}^{0} \left( \int\_{-\infty}^{\infty} \mathbb{g}\_{p}(\delta + \lambda) \mathbb{g}\_{c}(\lambda) d\lambda \right) d\delta \\ &= \int\_{-\infty}^{0} \mathbb{g}(\delta) d\delta = G(0) = c\_{\lambda}^{3+} \end{aligned}$$

by Eq. (10).

#### 4.2. Cross-validation and sufficient statistics

Suppose the node λ of interest is the parent of the subvector Y<sup>c</sup> of Y. Suppose also that Yc is a sufficient statistic for Yc. Evidently then, the measures c<sup>3</sup><sup>þ</sup> <sup>λ</sup> and c<sup>3</sup><sup>þ</sup> Yc address the same kind of possible conflict in the model. The following theorem, proved in Ref. [11], states that the two measures agree under certain conditions. This is a generalization of a result in Ref. [13], which also unnecessarily assumed symmetry for the conditional density of Yc.

Theorem 3 Suppose the conditional density for the scalar variable Yc given the parameter λ is of the form f Yc ðyjλÞ ¼ f 2 c,0ðy � λÞ. Then,

$$c\_{Y\_\varepsilon}^{3+} = c\_\lambda^{3+}.$$

When a sufficient statistic exists, the cross-validatory p-value is considered by Ref. [13] as the gold standard, and the aim of their construction is to provide a measure which is generally applicable and matches cross-validation when a sufficient statistic exists.

#### 4.3. Prediction

As mentioned in Section 2, the c <sup>4</sup> measure can be used to assess conflict concerning vectors of nodes. Applying this at the data node level, we may assess the quality of predictions of a subvector Y<sup>c</sup> of Y based on a complementary subvector yp of observations. The relevant measure is given by Eq. (18), with Yc replaced by the vector Yc. This is particularly well suited to models where data accumulate as time evolves. Such a conflict measure can be used to assess the overall quality of the model. It can also be used as a tool for model comparison and model choice.

### 4.4. Hypothesis testing

3+

λ. If λ were actually observable and known to take the value λc, the data node version of the c

gpðλÞdλ ¼

Now λ is in reality not known, but we can take the expectation of this conflict with respect to the distribution Gc, which reflects the uncertainty about λ when influence from data y<sup>p</sup> is

ðGpðλÞ ¼ c

3þ <sup>λ</sup> :

dλ ¼ ð0 �∞

ð0 �∞

gpðδ þ λcÞdδ:

ð∞ �∞

<sup>λ</sup> and c<sup>3</sup><sup>þ</sup>

<sup>4</sup> measure can be used to assess conflict concerning vectors of

gpðδ þ λÞgcðλÞdλ � �

dδ

Yc address the same kind of

measure could be used to measure deviations toward the right tail of Gp as

ð<sup>λ</sup><sup>c</sup> �∞

EGc

gpðδ þ λÞdδ � �

> 3þ λ

Suppose the node λ of interest is the parent of the subvector Y<sup>c</sup> of Y. Suppose also that Yc is a

possible conflict in the model. The following theorem, proved in Ref. [11], states that the two measures agree under certain conditions. This is a generalization of a result in Ref. [13], which

Theorem 3 Suppose the conditional density for the scalar variable Yc given the parameter λ is of the

When a sufficient statistic exists, the cross-validatory p-value is considered by Ref. [13] as the gold standard, and the aim of their construction is to provide a measure which is generally

nodes. Applying this at the data node level, we may assess the quality of predictions of a subvector Y<sup>c</sup> of Y based on a complementary subvector yp of observations. The relevant

c 3þ Yc <sup>¼</sup> <sup>c</sup><sup>3</sup><sup>þ</sup> λ :

ð0 �∞

gðδÞdδ ¼ Gð0Þ ¼ c

also unnecessarily assumed symmetry for the conditional density of Yc.

applicable and matches cross-validation when a sufficient statistic exists.

GpðλcÞ ¼

removed. The result is the following theorem:

ð∞ �∞ gcðλÞ

4.2. Cross-validation and sufficient statistics

c,0ðy � λÞ. Then,

sufficient statistic for Yc. Evidently then, the measures c<sup>3</sup><sup>þ</sup>

¼ ð0 �∞

Theorem 2

32 Bayesian Inference

by Eq. (10).

form f Yc

4.3. Prediction

ðyjλÞ ¼ f

2

As mentioned in Section 2, the c

EGc

ðGpðλÞ ¼

Proof:

Suppose the top-level nodes μ appearing in Eq. (2) are assumed fixed and known according to the model, so that π(μ) is a Dirac measure at these fixed values of the components of μ. Hence, the DAG has deterministic nodes both at the top and at the bottom, namely the vectors μ and y, respectively. We may then check for a conflict concerning a component λ of μ by introducing a random version <sup>λ</sup><sup>~</sup> of <sup>λ</sup> and contrast the corresponding gcðλ~<sup>Þ</sup> with the fixed value <sup>λ</sup>. The random λ~ has the same children and coparents as λ, and the vector βc, the information contribution <sup>f</sup> <sup>c</sup>ðλ~; <sup>β</sup>c<sup>Þ</sup> and the iic density gc are defined as in Eqs. (4), (5) and (6). The respective conflict measures are defined as in Eqs. (16)–(18) with yc replaced by λ and Gp and gp replaced by Gc and gc. If the model is rejected when the conflict exceeds a certain predefined warning level, this corresponds to a formal Bayesian test of the hypothesis <sup>λ</sup><sup>~</sup> <sup>¼</sup> <sup>λ</sup>. Using the conflict measure (Eq. (18)), we may put the whole vector μ to test in this way.

### 5. Preexperimental uniformity of the conflict measures

In this section, we review some results concerning the distribution of the conflict measures. If c is one of the measures (Eqs. (10), (11), (12), (16), (17) or (18)), then preexperimentally, i.e., prior to observing the data y, c is a random variable taking a value in [0, 1]. A large value of c indicates a possible conflict in the model, and uniformity of c corresponds to 1 – c being a proper p-value. This does not mean that we propose a formal hypothesis testing procedure for model criticism, possibly even adjusted for multiple testing, nor that we think that a fixed significance level represents an appropriate criterion signaling the need for changing the model. A relatively large value of c may be accepted if there are convincing arguments for believing in a particular modeling aspect, while a less extreme value of c may indicate a need for adjustments in modeling aspects that are considered questionable for other reasons. But the terms "relatively large" and "less extreme" must refer to a meaningful common scale. In our view, uniformity of the conflict measure under all sources of uncertainty is the natural ideal criterion for being a well-calibrated conflict measure, the fulfillment of which ensures comparable assessment of the level of conflict across models. This means that we aim for preexperimental uniformity in cases where the prior distribution is highly noninformative, and also, as discussed in the following subsection, in cases where an informative prior represents part of the randomness in the data-generating process (aleatory uncertainty) rather than subjective (epistemic) uncertainty about the location of a fixed but unknown λ. In this chapter, we limit attention to situations where exact uniformity is achieved. The pivotality condition (Eq. (15)) turns out to be a key assumption needed to obtain such exact results. Refs. [10] and [12] provide some examples where exact uniformity is achieved in other cases.

### 5.1. Data-prior conflict

Consider the model

$$\mathbf{Y} \sim F\_{\mathbf{Y}}(\mathbf{y}|\lambda), \lambda \sim F\_{\lambda}(\lambda),$$

where F<sup>λ</sup> is an arbitrary informative prior distribution. Here, we think of this prior distribution as representing aleatory rather than epistemic uncertainty. The corresponding densities are denoted by f<sup>Y</sup> and fλ. If contrasting the prior density with the likelihood f <sup>Y</sup>ðyjλÞ indicates a conflict between the prior and likelihood information contributions, we consider this a dataprior conflict. The following theorem, proved in Ref. [11], deals with this kind of conflict. Note that in this situation, the Y<sup>p</sup> part of the data splitting is empty.

Theorem 4 Suppose the conditional density for the scalar variable Y given the parameter λ is of the form f <sup>Y</sup>ðyjλÞ ¼ f <sup>0</sup>ðy � λÞ and that λ is generated from an arbitrary informative prior density fλ(λ). Then, the data-prior conflict measures about λ are preexperimentally uniformly distributed for both the c3 λ- and c<sup>4</sup> <sup>λ</sup>-measures.

The theorem obviously applies to the location parameter of normal and t-distributions with fixed variance parameters, as well as the location parameter in the skew normal distribution [19]. If the vector Y consists of IID normal variables, the theorem also applies to the location parameter, using as scalar variable the sufficient statistic Y. If the n components of Y are IID exponentially distributed with failure rate λ, their sum is a sufficient statistic that is gamma distributed with shape parameter n and scale parameter λ. We may then use the fact that for a variable Y which is gamma distributed with known shape parameter and unknown scale parameter λ, the quantity logðYÞ � logðλÞ is a pivotal statistic, and uniformity is obtained by combining Theorem 4 with the approach of Section 3. In the standard parametrization, the appropriate prior distribution is πðλÞ ¼ 1=λÞ. Details are given in Ref. [11], which also deals with the gamma, inverse gamma, Weibull and lognormal distributions in a similar way.

#### 5.2. Data-data conflict

Suppose all components of Y have distributions determined by the same parameter λ. Suppose we want to contrast information contributions from separate parts of Y about λ and define the splitting ðYp, YcÞ accordingly. Focusing on this kind of possible conflict, we assume complete prior ignorance about λ and accordingly assume that λ has the improper prior πðλÞ ¼ 1. Hence, recalling Eqs. (7) and (8), we contrast the information in f <sup>c</sup>ðλ; YcÞ with that in f <sup>p</sup>ðλ; YpÞ. We use the term data-data conflict in this context, since there is no prior information incorporated in fp, and the two information contributions play symmetric roles. However, as a particular application, one may think of Y<sup>c</sup> as a scalar variable representing a possible outlier.

The following theorem is proved in Ref. [11].

Theorem 5 Suppose that the conditional densities for the scalar variables Yp and Yc given the parameter λ are of the form f Yp ðyjλÞ ¼ f p,0ðy � λÞ, f Yc ðyjλÞ ¼ f c, <sup>0</sup>ðy � λÞ.

Assume λ has the improper prior πðλÞ ¼ 1. Then, the data-data conflict measures about λ are preexperimentally uniformly distributed for both the c<sup>3</sup> λ- and c<sup>4</sup> <sup>λ</sup>-measures.

Theorem 5 can be applied if the components of Y<sup>c</sup> and Y<sup>p</sup> are normally or lognormally distributed with known variance parameter, exponentially distributed, or gamma, inverse gamma or Weibull with known shape parameter, since pivotal quantities based on sufficient statistics exist for these distributions.

### 5.3. Normal hierarchical models with fixed covariance matrices

5.1. Data-prior conflict

<sup>λ</sup>-measures.

Y � FYðyjλÞ, λ � FλðλÞ;

where F<sup>λ</sup> is an arbitrary informative prior distribution. Here, we think of this prior distribution as representing aleatory rather than epistemic uncertainty. The corresponding densities are denoted by f<sup>Y</sup> and fλ. If contrasting the prior density with the likelihood f <sup>Y</sup>ðyjλÞ indicates a conflict between the prior and likelihood information contributions, we consider this a dataprior conflict. The following theorem, proved in Ref. [11], deals with this kind of conflict. Note

Theorem 4 Suppose the conditional density for the scalar variable Y given the parameter λ is of the form f <sup>Y</sup>ðyjλÞ ¼ f <sup>0</sup>ðy � λÞ and that λ is generated from an arbitrary informative prior density fλ(λ). Then, the data-prior conflict measures about λ are preexperimentally uniformly distributed for both the

The theorem obviously applies to the location parameter of normal and t-distributions with fixed variance parameters, as well as the location parameter in the skew normal distribution [19]. If the vector Y consists of IID normal variables, the theorem also applies to the location parameter, using as scalar variable the sufficient statistic Y. If the n components of Y are IID exponentially distributed with failure rate λ, their sum is a sufficient statistic that is gamma distributed with shape parameter n and scale parameter λ. We may then use the fact that for a variable Y which is gamma distributed with known shape parameter and unknown scale parameter λ, the quantity logðYÞ � logðλÞ is a pivotal statistic, and uniformity is obtained by combining Theorem 4 with the approach of Section 3. In the standard parametrization, the appropriate prior distribution is πðλÞ ¼ 1=λÞ. Details are given in Ref. [11], which also deals with the gamma, inverse gamma, Weibull and lognormal distributions in a

Suppose all components of Y have distributions determined by the same parameter λ. Suppose we want to contrast information contributions from separate parts of Y about λ and define the splitting ðYp, YcÞ accordingly. Focusing on this kind of possible conflict, we assume complete prior ignorance about λ and accordingly assume that λ has the improper prior πðλÞ ¼ 1. Hence, recalling Eqs. (7) and (8), we contrast the information in f <sup>c</sup>ðλ; YcÞ with that in f <sup>p</sup>ðλ; YpÞ. We use the term data-data conflict in this context, since there is no prior information incorporated in fp, and the two information contributions play symmetric roles. However, as a particular application, one may think of Y<sup>c</sup> as a scalar variable representing a

Theorem 5 Suppose that the conditional densities for the scalar variables Yp and Yc given the

ðyjλÞ ¼ f c, <sup>0</sup>ðy � λÞ.

ðyjλÞ ¼ f p,0ðy � λÞ, f Yc

that in this situation, the Y<sup>p</sup> part of the data splitting is empty.

Consider the model

34 Bayesian Inference

c3 λ- and c<sup>4</sup>

similar way.

5.2. Data-data conflict

possible outlier.

parameter λ are of the form f Yp

The following theorem is proved in Ref. [11].

Allowing for each y and ν appearing in Eq. (2) to be interpreted as vectors of nodes, we now assume that each conditional distribution in the decomposition (Eq. (2)) is multinormal with fixed and known covariance matrices. The random-effects model (Eq. (1)) is a simple example of this. We also assume that the top-level parameter vector μ has the improper prior 1 and that each linear mapping PaðνÞ ! EðνjPaðνÞÞ has full rank.

Now let λ be any node in the model description. It is standard to verify that, regardless of how the vector of neighboring and coparent nodes β is decomposed into βp, containing PaðλÞ, and βc, the densities f <sup>p</sup>ðλ; βpÞ and f <sup>c</sup>ðλ; βcÞ of Eqs. (5) and (8) are multinormal with fixed covariance matrices. Furthermore, this is true also for the iic densities gp and gc of Eq. (6), regardless of the data splitting. It follows that the density g of the difference δ between independent samples from gp and gc is multinormal with expectation EGðδÞ ¼ <sup>E</sup>Gp <sup>ð</sup>λÞ � EGc ðλÞ and covariance matrix cov<sup>G</sup>ðδÞ ¼ covGp <sup>ð</sup>λÞ þ <sup>E</sup>Gc <sup>ð</sup>λÞ. It follows that <sup>δ</sup> � <sup>E</sup><sup>G</sup>ðδ<sup>Þ</sup> t cov<sup>G</sup>ðδ<sup>Þ</sup> �1 <sup>δ</sup> � <sup>E</sup><sup>G</sup>ðδ<sup>Þ</sup> is χ<sup>2</sup> -distributed with n ¼ dimðλÞ degrees of freedom, and the probability under G that gðδÞ < gð0Þ is easily seen to be Ψ<sup>n</sup> <sup>E</sup><sup>G</sup>ðδ<sup>Þ</sup> t cov<sup>G</sup>ðδ<sup>Þ</sup> �1 <sup>E</sup><sup>G</sup>ðδ<sup>Þ</sup> , where Ψ<sup>n</sup> is the cumulative distribution function for the χ<sup>2</sup> ndistribution. The preexperimental uniformity of this quantity is proved in Ref. [10].

Theorem 6 Consider a hierarchical normal model as described above.


If λ in (i) or Yc in (ii) are one dimensional, then G is symmetric and unimodal, and therefore, the respective c 3 -measures are defined and coincide with the c 4 -measures. Gåsemyr et al. [10] also show that in that case the c 3+- and c <sup>3</sup>�-measures are uniformly distributed preexperimentally.

Example 2. Consider the following DAG model, a regression model with randomly varying regression coefficients.

$$Y\_{i,j} \sim \mathcal{N}(\mathbf{X}\_{i,j}^{\ell} \boldsymbol{\xi}\_{i\prime} \boldsymbol{\sigma}^2), \boldsymbol{\xi}\_i \sim \mathcal{N}(\boldsymbol{\xi}, \boldsymbol{\Omega}), j = 1, \dots, n, i = 1, \dots, m, \pi(\boldsymbol{\xi}) \propto 1. \tag{19}$$

The m units could be groups of individuals, with yi,j the measurement for a group member with individual covariate vector Xi,j, or individuals with the successive yi,j representing repeated measurements over time. In this model, we could check for a possible exceptional behavior of the mth unit by means of the conflict measure c<sup>4</sup> <sup>ξ</sup><sup>m</sup> . With a data splitting for which Y<sup>c</sup> ¼ Y<sup>m</sup> ¼ ðYm, <sup>1</sup>, …, Ym,nÞ the conditions for Theorem 6, part (i), are satisfied if dimðξÞ ≤ n, and the measure is preexperimentally uniformly distributed.

### 6. Concluding remarks

The assumption of fixed covariance matrices in the previous subsection is admittedly quite restrictive. In general, the presence of unknown nuisance parameters, such as parameters describing the covariance matrices in a normal model, makes the derivation of exact uniformity at least difficult and often impossible. Promising approximate results are reported in Ref. [9] for the closely related c<sup>2</sup> <sup>λ</sup> measure. Further empirical studies are needed in order to examine to what extent the conflict measures are approximately uniformly distributed in other situations. As an informal tool to be used in conjunction with subject matter insight, the conflict measure approach does not require exact uniformity in order to be useful.

### Author details

Jørund I. Gåsemyr\* and Bent Natvig

\*Address all correspondence to: gaasemyr@math.uio.no

University of Oslo, Norway

### References


[7] Dey D, Gelfand A, Swartz T, Vlachos P. A simulation-intensive approach for checking hierarchical models. Test. 1998;7:325-346

repeated measurements over time. In this model, we could check for a possible exceptional

Y<sup>c</sup> ¼ Y<sup>m</sup> ¼ ðYm, <sup>1</sup>, …, Ym,nÞ the conditions for Theorem 6, part (i), are satisfied if dimðξÞ ≤ n,

The assumption of fixed covariance matrices in the previous subsection is admittedly quite restrictive. In general, the presence of unknown nuisance parameters, such as parameters describing the covariance matrices in a normal model, makes the derivation of exact uniformity at least difficult and often impossible. Promising approximate results are reported in Ref.

to what extent the conflict measures are approximately uniformly distributed in other situations. As an informal tool to be used in conjunction with subject matter insight, the conflict

[1] Box GEP. Sampling and Bayes' inference in scientific modelling and robustness (with discussion and rejoinder). Journal of the Royal Statistical Society. Series A. 1980;143:383-430

[2] Bayarri MJ, Castellanos ME. Bayesian checking of the second levels of hierarchical

[3] Gelman A, Meng X-L, Stern H. Posterior predictive assessment of model fitness via realized discrepancies (with discussion and rejoinder). Statistica Sinica. 1996;6:733-807 [4] Bayarri MJ, Berger JO. P values in composite null models (with discussion). The Journal

[5] Hjort NL, Dahl FA, Steinbakk GH. Post-processing posterior predictive p-values. The

[6] Dahl FA. On the conservativeness of posterior predictive p-values. Statistics and Proba-

measure approach does not require exact uniformity in order to be useful.

<sup>λ</sup> measure. Further empirical studies are needed in order to examine

<sup>ξ</sup><sup>m</sup> . With a data splitting for which

behavior of the mth unit by means of the conflict measure c<sup>4</sup>

and the measure is preexperimentally uniformly distributed.

6. Concluding remarks

36 Bayesian Inference

[9] for the closely related c<sup>2</sup>

Jørund I. Gåsemyr\* and Bent Natvig

University of Oslo, Norway

\*Address all correspondence to: gaasemyr@math.uio.no

models. Statistical Science. 2007;22:322-343

bility Letters. 2006;76:1170-1174

of the American Statistical Association. 2000;95:1127-1142

Journal of the American Statistical Association. 2006;101:1157-1174

Author details

References


Provisional chapter

## **Classifying by Bayesian Method and Some Applications**

DOI: 10.5772/intechopen.70052

Classifying by Bayesian Method and Some Applications

### Tai Vovan

Additional information is available at the end of the chapter Tai Vovan

http://dx.doi.org/10.5772/intechopen.70052 Additional information is available at the end of the chapter

#### Abstract

This chapter sums up and proposes some results related to classification problem by Bayesian method. We present the classification principle, Bayes error, and establish its relationship with other measures. The determination for Bayes error in reality for one and multi-dimensions is also considered. Based on training set and the object that we need to classify, an algorithm to determine the prior probability that can make to reduce Bayes error is proposed. This algorithm has been performed by the MATLAB procedure that can be applied well with real data. The proposed algorithm is applied in three domains: biology, medicine, and economics through specific problems. With different characteristics of applied data sets, the proposed algorithm always gives the best results in comparison to the existing ones. Furthermore, the examples show the feasibility and potential application in reality of the researched problem.

Keywords: Bayesian method, classification, error, prior, application

### 1. Introduction

Classification problem is one of the main subdomains of discriminant analysis and closely related to many fields in statistics. Classification is to assign an element to the appropriate population in a set of known populations based on certain observed variables. It is an important development direction of multivariate statistics and has applications in many different fields [25, 27]. Recently, this problem is interested by many statisticians in both theories and applied areas [14–18, 22–25]. According to Tai [22], we have four main methods to solve the classification problem: Fisher method [6, 12], logistic regression method [8], support vector machine (SVM) method [3], and Bayesian method [17]. Because Bayesian method does not require normal condition for data and can classify for two and more populations it has many advantages [22–25]. Therefore, it has been used by many scientists in their researches.

© 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and eproduction in any medium, provided the original work is properly cited.

Given k populations {wi}, with probability density functions (pdfs) and the prior probabilities respectively {fi} and {qi}, i = 1, 2,…, k, where qi ∈ ð0; 1Þ, X k i¼1 qi ¼ 1: Pham–Gia et al. [17] used the

maximum function of pdfs as a tool to study about Bayesian method and obtained important results. The classification principle and Bayes error were established based on the gmax(x) = max{q1f1(x), q2f2(x), …, qkfk(x)}. The relationship between the upper and lower bounds of the Bayes error and the L<sup>1</sup> —distance of the pdfs and the overlap coefficient of the pdfs—were established. The function gmax(x) played a very important role in the classification problem by Bayesian method and Pham–Gia et al. [17] continued to do research on it. Using the MATLAB software, Pham–Gia et al. [18] succeeded in identifying gmax(x) for some cases of the bivariate normal distribution. With similar development, Tai [22] has proposed the L<sup>1</sup> —distance of the {qifi(x)}—and established its relationship with Bayes error. This distance is also used to calculate Bayes error as well as to classify new element. This research has been applied in classifying ability to repay debt of bank customers. However, we think that the survey of two Bayesian approach relevant research was not yet completed. There are some relations between Bayes error and other statistical measures.

Bayesian method has many advantages. However, to our knowledge, the field of applications of this method in practice is narrower than other methods. We can find many applications in banking and medicine using Fisher method, SVM method, logistic method [1, 3, 8, 12]. Recently, all statistics software can effectively and quickly process the classification of large data sets and multivariate statistics using either three of the methods mentioned above, whereas the Bayesian method does not have this advantage. The cause of this problem is the ambiguity in determining prior probability, in estimating pdfs, and the complexity in calculating Bayes error. Although all these issues have been discussed by many authors, the optimal methods have yet to be found [22, 25]. In this chapter, we consider to estimate the pdf and to calculate Bayes error to apply in reality. We will present the problem on how to determine the prior probability in this chapter. In case of noninformation, we normally choose prior probabilities by uniform distribution. If we have some types of past data or training set, the prior probabilities are estimated either by Laplace method: qi = (ni + n/k)/(N + n) or by the frequencies of the sample: qi = ni/N, where ni and N are the number of elements in the ith population and training set, respectively, n is the number of dimensions, and k is the number of groups. The above-mentioned approaches have been studied and applied by many authors [14, 15, 22, 25]. We will also propose an algorithm to determine prior probability based on the training set, classified objective, and fuzzy cluster analysis. The proposed algorithm is applied in some specific problems of biology, medicine, and economics and has advantages over existing approaches. All calculations are performed by MATLAB procedures.

The next section of this chapter is structured as follows. Section 2 presents the classification principle and Bayes error. Some results of the Bayes error are also established in this section. Section 3 resolves the related problems in real application of the Bayes method. There are estimation of pdfs and determination of Bayes error in case of one dimension and multidimension. This section also proposes an algorithm to determine prior probability. Section 4 applies the proposed algorithm in real problems and compares outcome results to those obtained using existing approaches. Section 5 concludes this chapter.

### 2. Classifying by Bayesian method

Given k populations {wi}, with probability density functions (pdfs) and the prior probabilities

maximum function of pdfs as a tool to study about Bayesian method and obtained important results. The classification principle and Bayes error were established based on the gmax(x) = max{q1f1(x), q2f2(x), …, qkfk(x)}. The relationship between the upper and lower bounds of the

established. The function gmax(x) played a very important role in the classification problem by Bayesian method and Pham–Gia et al. [17] continued to do research on it. Using the MATLAB software, Pham–Gia et al. [18] succeeded in identifying gmax(x) for some cases of the bivariate

{qifi(x)}—and established its relationship with Bayes error. This distance is also used to calculate Bayes error as well as to classify new element. This research has been applied in classifying ability to repay debt of bank customers. However, we think that the survey of two Bayesian approach relevant research was not yet completed. There are some relations between Bayes

Bayesian method has many advantages. However, to our knowledge, the field of applications of this method in practice is narrower than other methods. We can find many applications in banking and medicine using Fisher method, SVM method, logistic method [1, 3, 8, 12]. Recently, all statistics software can effectively and quickly process the classification of large data sets and multivariate statistics using either three of the methods mentioned above, whereas the Bayesian method does not have this advantage. The cause of this problem is the ambiguity in determining prior probability, in estimating pdfs, and the complexity in calculating Bayes error. Although all these issues have been discussed by many authors, the optimal methods have yet to be found [22, 25]. In this chapter, we consider to estimate the pdf and to calculate Bayes error to apply in reality. We will present the problem on how to determine the prior probability in this chapter. In case of noninformation, we normally choose prior probabilities by uniform distribution. If we have some types of past data or training set, the prior probabilities are estimated either by Laplace method: qi = (ni + n/k)/(N + n) or by the frequencies of the sample: qi = ni/N, where ni and N are the number of elements in the ith population and training set, respectively, n is the number of dimensions, and k is the number of groups. The above-mentioned approaches have been studied and applied by many authors [14, 15, 22, 25]. We will also propose an algorithm to determine prior probability based on the training set, classified objective, and fuzzy cluster analysis. The proposed algorithm is applied in some specific problems of biology, medicine, and economics and has advantages over existing

The next section of this chapter is structured as follows. Section 2 presents the classification principle and Bayes error. Some results of the Bayes error are also established in this section. Section 3 resolves the related problems in real application of the Bayes method. There are estimation of pdfs and determination of Bayes error in case of one dimension and multidimension. This section also proposes an algorithm to determine prior probability. Section 4 applies the proposed algorithm in real problems and compares outcome results to those obtained using

normal distribution. With similar development, Tai [22] has proposed the L<sup>1</sup>

approaches. All calculations are performed by MATLAB procedures.

existing approaches. Section 5 concludes this chapter.

X k

qi ¼ 1: Pham–Gia et al. [17] used the

—distance of the

i¼1

—distance of the pdfs and the overlap coefficient of the pdfs—were

respectively {fi} and {qi}, i = 1, 2,…, k, where qi ∈ ð0; 1Þ,

Bayes error and the L<sup>1</sup>

40 Bayesian Inference

error and other statistical measures.

The classification problem by Bayesian method has been presented in many documents [15, 16, 27], where the classification principle and the Bayes error are established based on Bayes theorem. In this section, we present them via the maximum function of qifi(x), i = 1, 2, …, k that they have advantages over existing approaches in real application [17, 18, 21–25]. This section also establishes the upper and lower bounds of the Bayes error and the relationships of Bayes error with other measures in statistical pattern recognition.

#### 2.1. Classification principle and Bayes error

Given k populations w1, w2, …, wk with qi ∈ (0;1) and fi(x) are the prior probability and pdf of ith population, respectively, i = 1, 2, …, k. According to Pham–Gia et al. [17], element x<sup>0</sup> will be assigned to w<sup>i</sup> if

$$\mathbf{g}\_i(\mathbf{x}\_0) = \mathbf{g}\_{\text{max}}(\mathbf{x}\_0), \ i = 1, 2, \dots, k \tag{1}$$

where gi ðxÞ ¼ qi f i <sup>ð</sup>xÞ, gmaxðxÞ ¼ max <sup>q</sup>1<sup>f</sup> <sup>1</sup>ðxÞ, q2<sup>f</sup> <sup>2</sup>ðxÞ, …, qkf <sup>k</sup>ðx<sup>Þ</sup> � �:

Bayes error is given by the formula:

$$P e\_{1,2,\ldots,k}^{(q)} = \sum\_{i=1}^{k} \int\_{\mathbb{R}^n \backslash R\_i^n} q f\_i d\mathbf{x} = 1 - \sum\_{i=1}^{k} \int\_{\mathbb{R}\_i^n} q f\_i(\mathbf{x}) d\mathbf{x},\tag{2}$$

where Rn <sup>i</sup> ¼ xjqi f i ðxÞ > qj f j <sup>ð</sup>xÞ, <sup>∀</sup><sup>i</sup> 6¼ j, i, j <sup>¼</sup> <sup>1</sup>, <sup>2</sup>, …, k n o,ðqÞ¼ðq1, q2, …, qkÞ:

From Eq. (2), we can prove the following result:

$$\begin{split} &Pe^{(q)}\_{1,2,\ldots,k} = \sum\_{j=1}^{k} \int\_{\mathbb{R}^{n}\backslash\mathbb{R}^{n}\_{j}} q\_{j} f\_{j}(\mathbf{x}) d\mathbf{x} \\ & \quad = \sum\_{j=1}^{k} \left[ \int\_{\mathbb{R}^{n}} q\_{j} f\_{j}(\mathbf{x}) d\mathbf{x} - \int\_{\mathbb{R}^{n}\_{j}} \max\_{1 \le l \le k} \{q\_{l} f\_{l}(\mathbf{x})\} d\mathbf{x} \right] \\ & \quad = \int\_{\mathbb{R}^{n}} \sum\_{j=1}^{k} q\_{j} f\_{j}(\mathbf{x}) d\mathbf{x} - \sum\_{j=1}^{k} \int\_{\mathbb{R}^{n}\_{j}} \max\_{1 \le l \le k} \{q\_{l} f\_{l}(\mathbf{x})\} d\mathbf{x} \\ & \quad = 1 - \int\_{\mathbb{R}^{n}} \max\_{1 \le l \le k} \{q\_{l} f\_{l}(\mathbf{x})\} d\mathbf{x} \end{split}$$

or

$$P e\_{1,2,\ldots,k}^{(q)} = 1 - \int\_{\mathbb{R}^n} g\_{\max}(\mathbf{x}) d\mathbf{x}.\tag{3}$$

The correct probability is determined by Ce<sup>ð</sup>q<sup>Þ</sup> <sup>1</sup>, <sup>2</sup>,…, <sup>k</sup> <sup>¼</sup> <sup>1</sup> � Pe<sup>ð</sup>q<sup>Þ</sup> <sup>1</sup>,2,…, <sup>k</sup>:

For k = 2, we have

$$\operatorname{Re}\_{1,2}^{(q,1-q)} = \int\_{\mathbb{R}^r} \min \{ qf\_1(\mathbf{x}), (1-q)f\_2(\mathbf{x}) \} d\mathbf{x} = \lambda\_{1,2}^{(q,1-q)} = \frac{1}{2} \left[ 1 - \| qf\_{1'}(1-q)f\_2 \| \_1 \right],\tag{4}$$

where

λ<sup>ð</sup>q,1�q<sup>Þ</sup> <sup>1</sup>,<sup>2</sup> is the overlap area measure of qf1(x) and (1�q)f2(x) and kqf <sup>1</sup>,ð1 � qÞf <sup>2</sup>k<sup>1</sup> ¼ ð Rn jqf <sup>1</sup>ðxÞ� ð1 � qÞf <sup>2</sup>ðxÞjdx:

#### 2.2. Some results about Bayes error

Theorem 1. Let <sup>f</sup>i(x), <sup>i</sup> =1, 2, …, k, k <sup>≥</sup> 3 be <sup>k</sup> pdfs defined on <sup>R</sup>n, n <sup>≥</sup> <sup>1</sup>, qi <sup>∈</sup>ð0; <sup>1</sup>Þ: We have the relationships of Bayes error with other measures as follow:

$$\begin{aligned} \text{i.i.} \qquad &Pe\_{1,2,\dots,k}^{(q)} \le 1 - \frac{1}{k-1} \left(1 - \prod\_{j=1}^{k} q\_j^{a\_j} D\_T \{f\_1, f\_2, \dots, f\_k\}^a \right), \end{aligned} \tag{5}$$

$$\text{iii.} \tag{6}$$

$$\text{P}\mathbf{e}\_{1,2,\dots,k}^{(q)} \leq \sum\_{i$$

$$\begin{aligned} \text{iii.} \qquad \left\{ (k-1) - \sum\_{i} \sum\_{j} \| \mathbf{g}\_{i\cdot} \mathbf{g}\_{j} \|\_{1} \right\} / k &\leq P e^{(q)}\_{1,2,\ldots,k} \leq 1 - (1/2) \max\_{i$$

$$\mathbf{i} \mathbf{v}.\tag{8}$$

$$\mathbf{i} \mathbf{v}.\tag{9}$$

$$\mathbf{i} \mathbf{v}.\tag{9}$$

where

$$\alpha = (\alpha\_1, \alpha\_2, \dots, \alpha\_k); \ \alpha\_j, \beta \in (0, 1), \sum\_{j=1}^k \alpha\_j = 1, \ i, j = 1, 2, \dots, k, \text{ and }$$

$$D\_T \{f\_1, f\_{2^\*}, \dots, f\_k\}^\alpha = \int\_{\mathbb{R}^n} \prod\_{j=1}^k \left[f\_j(\mathbf{x})\right]^{\alpha\_j} d\mathbf{x} \text{ is affinity of Tousasaint [26].}$$

#### Proof:

The correct probability is determined by Ce<sup>ð</sup>q<sup>Þ</sup>

ð

Rn

2.2. Some results about Bayes error

ii. Pe<sup>ð</sup>q<sup>Þ</sup>

i

iv. <sup>0</sup> <sup>≤</sup> Pe<sup>ð</sup>q<sup>Þ</sup>

X j kgi ,gj k1

i. Pe<sup>ð</sup>q<sup>Þ</sup>

iii. <sup>ð</sup><sup>k</sup> � <sup>1</sup>Þ �<sup>X</sup>

α ¼ ðα1, α2, …, αkÞ; αj, β∈ð0, 1Þ,

ð

Y k

j¼1 f j ðxÞ h i<sup>α</sup><sup>j</sup>

Rn

8 < :

where

DT f <sup>1</sup>,f <sup>2</sup>,…,f <sup>k</sup> � �<sup>α</sup> <sup>¼</sup>

relationships of Bayes error with other measures as follow:

<sup>1</sup>, <sup>2</sup>,…, <sup>k</sup> <sup>≤</sup> <sup>1</sup> � <sup>1</sup>

<sup>1</sup>,2,…, <sup>k</sup> ≤

9 = ;

X k

j¼1

For k = 2, we have

42 Bayesian Inference

where

λ<sup>ð</sup>q,1�q<sup>Þ</sup>

ð1 � qÞf <sup>2</sup>ðxÞjdx:

Pe<sup>ð</sup>q, <sup>1</sup>�q<sup>Þ</sup> <sup>1</sup>, <sup>2</sup> ¼ <sup>1</sup>, <sup>2</sup>,…, <sup>k</sup> <sup>¼</sup> <sup>1</sup> � Pe<sup>ð</sup>q<sup>Þ</sup>

min qf <sup>1</sup>ðxÞ,ð<sup>1</sup> � <sup>q</sup>Þ<sup>f</sup> <sup>2</sup>ðx<sup>Þ</sup> � �dx <sup>¼</sup> <sup>λ</sup><sup>ð</sup>q, <sup>1</sup>�q<sup>Þ</sup>

<sup>1</sup>,<sup>2</sup> is the overlap area measure of qf1(x) and (1�q)f2(x) and kqf <sup>1</sup>,ð1 � qÞf <sup>2</sup>k<sup>1</sup> ¼

Theorem 1. Let <sup>f</sup>i(x), <sup>i</sup> =1, 2, …, k, k <sup>≥</sup> 3 be <sup>k</sup> pdfs defined on <sup>R</sup>n, n <sup>≥</sup> <sup>1</sup>, qi <sup>∈</sup>ð0; <sup>1</sup>Þ: We have the

<sup>k</sup> � <sup>1</sup> <sup>1</sup> � <sup>Y</sup>

X i<j q β i q 1�β <sup>j</sup> DT f <sup>i</sup>

=k ≤ Pe<sup>ð</sup>q<sup>Þ</sup>

0 @

k

j¼1 q αj

<sup>j</sup> DT f <sup>1</sup>,f <sup>2</sup>,…,f <sup>k</sup> � �<sup>α</sup>

<sup>i</sup><<sup>j</sup> <sup>k</sup>gi

,gj k1 n o

� �, (8)

,f j � �ðβ,1�β<sup>Þ</sup>

<sup>1</sup>, <sup>2</sup>,…, <sup>k</sup> ≤ 1 � ð1=2Þmax

<sup>1</sup>, <sup>2</sup>,…, <sup>k</sup> ≤ max<sup>i</sup> qi

α<sup>j</sup> ¼ 1, i, j = 1, 2, …, k, and

dx is affinity of Toussaint [26].

<sup>1</sup>,2,…, <sup>k</sup>:

1 � kqf <sup>1</sup>,ð1 � qÞf <sup>2</sup>k<sup>1</sup>

1

� �, (4)

ð

jqf <sup>1</sup>ðxÞ�

Rn

A, (5)

, (6)

� min <sup>i</sup> qi

� �, (7)

<sup>1</sup>, <sup>2</sup> <sup>¼</sup> <sup>1</sup> 2 i. For each j = 1,2,…,k, we have

$$\left(\sum\_{j=1}^k q\_j f\_j\right)^{\alpha\_i} \ge (q\_i f\_i)^{\alpha\_i}, i = 1, 2, \dots, k.$$

Therefore,

$$\left(\sum\_{j=1}^{k} q\_j f\_j\right)^{a\_1 + a\_2 + \dots + a\_k} \ge \prod\_{j=1}^{k} \left(q\_j f\_j\right)^{a\_j} \Leftrightarrow \sum\_{j=1}^{k} q\_j f\_j \ge \prod\_{j=1}^{k} \left(q\_j f\_j\right)^{a\_j}.\tag{9}$$

On the other hand,

$$\left(\min\_{1 \le j \le k} \left\{ q\_j f\_j \right\} \right)^{\alpha\_1} \le \left(q\_1 f\_1 \right)^{\alpha\_1}, \dots, \dots, \left(\min\_{1 \le j \le k} \left\{ q\_j f\_j \right\} \right)^{\alpha\_k} \le \left(q\_k f\_k \right)^{\alpha\_k}.$$

So

$$\left(\min\_{1 \le j \le k} \left\{ q\_j f\_j \right\} \right)^{\alpha\_1 + \dots + \alpha\_k} \le \prod\_{j=1}^k \left( q\_j f\_j \right)^{\alpha\_j}.$$

or

$$\min\_{1 \le j \le k} \left\{ q\_j f\_j \right\} \le \prod\_{j=1}^k \left( q\_j f\_j \right)^{a\_j}.\tag{10}$$

Combining Eqs. (9) and (10), we obtain

$$0 \le \sum\_{j=1}^k q\_j f\_j - \prod\_{j=1}^k \left( q\_j f\_j \right)^{\alpha\_j} \le \sum\_{j=1}^k q\_j f\_j - \min\_{1 \le j \le k} \left\{ q\_j f\_j \right\}.$$

Because X k j¼1 qj f <sup>j</sup> � min <sup>1</sup> <sup>≤</sup> <sup>j</sup> <sup>≤</sup> <sup>k</sup> qj f j n o includes (k�1) terms, we have

$$\sum\_{j=1}^k q\_j f\_j - \min\_{1 \le j \le k} \left\{ q\_j f\_j \right\} \le (k - 1) \max\_{1 \le j \le k} \left\{ q\_j f\_j \right\}.$$

Thus,

$$0 \le \sum\_{j=1}^k q\_j f\_j - \prod\_{j=1}^k \left( q\_j f\_j \right)^{\alpha\_j} \le (k-1) \max\_{1 \le j \le k} \left\{ q\_j f\_j \right\}.$$

Integrating the above relation, we obtain:

$$1 - \prod\_{j=1}^{k} q\_j^{\alpha\_j} D\_T \left( f\_1, f\_2, \dots, f\_k \right)^a \preceq (k-1) \int\_{\mathbb{R}^n} \mathbf{g}\_{\text{max}}(\mathbf{x}) d\mathbf{x}.\tag{11}$$

$$\text{Using } \int\_{\mathbb{R}^n} \mathbf{g}\_{\text{max}}(\mathbf{x}) = 1 - P e^{(q)}\_{1,2,\ldots,k} \text{ for Eq. (11), we have Eq. (5).}$$

ii. From Eq. (2), we have

$$\begin{split} \operatorname{Pe}\_{1,2,\dots,k}^{(q)} &= \sum\_{j=1}^{k} \int\_{\mathbb{R}^n} q\_j f\_j(\mathbf{x}) d\mathbf{x} \\ &= \sum\_{j=1}^{k} \sum\_{j\neq i} \int\_{\mathbb{R}^n\_j} \min\left\{ q\_j f\_i(\mathbf{x}), \, q\_j f\_j(\mathbf{x}) \right\} d\mathbf{x} \\ &= \sum\_{i$$

Since

$$\left[\min\left\{q\_{\boldsymbol{\vartheta}}f\_{i}(\mathbf{x}),q\_{\boldsymbol{\vartheta}}f\_{j}(\mathbf{x})\right\}\right]^{\boldsymbol{\beta}}\leq \left(q\_{\boldsymbol{\vartheta}}f\_{i}\right)^{\boldsymbol{\beta}}\text{ and }\left[\min\left\{q\_{\boldsymbol{\vartheta}}f\_{i}(\mathbf{x}),q\_{\boldsymbol{\vartheta}}f\_{j}(\mathbf{x})\right\}\right]^{1-\boldsymbol{\beta}}\leq \left(q\_{\boldsymbol{\vartheta}}f\_{i}\right)^{1-\boldsymbol{\beta}}\text{ }\left[\text{since }\left\{q\_{\boldsymbol{\vartheta}}\right\}\text{ and }\left\{q\_{\boldsymbol{\vartheta}}\right\}\text{ are independent of }\left(q\_{\boldsymbol{\vartheta}},q\_{\boldsymbol{\vartheta}}\right)\text{ and }q\_{\boldsymbol{\vartheta}}\in\mathbb{R}^{2}\text{ and }\left[q\_{\boldsymbol{\vartheta}},q\_{\boldsymbol{\vartheta}}\right]\_{q\_{\boldsymbol{\vartheta}}\in\mathbb{R}^{2}}\text{ and }q\_{\boldsymbol{\vartheta}}\in\mathbb{R}^{2}$$

then

$$\min \left\{ q\_{i} f\_{i}(\mathbf{x}), q\_{j} f\_{j}(\mathbf{x}) \right\} \le \left( q\_{i} f\_{i} \right)^{\beta} \left( q\_{j} f\_{j} \right)^{1-\beta}.$$

Integrating the above inequality, we obtain:

$$\operatorname{Pe}\_{1,2,\dots,k}^{(q)} \le \sum\_{i$$

iii. We have

$$\int\_{\mathbb{R}^n} \max\{g\_1(\mathbf{x}), g\_2(\mathbf{x}), \dots, g\_k(\mathbf{x})\} d\mathbf{x} \ge \max\_{i$$

On the other hand,

$$\begin{split} \max\_{i$$

Hence,

Integrating the above relation, we obtain:

<sup>1</sup> � <sup>Y</sup> k

Pe<sup>ð</sup>q<sup>Þ</sup>

<sup>1</sup>, <sup>2</sup>,…, <sup>k</sup> <sup>¼</sup> <sup>X</sup>

k

ð

Rn\Rn j qj f j ðxÞdx

X j6¼i

ð

Rn i

n o

f j ðxÞ

� �<sup>1</sup>�<sup>β</sup> � �

max <sup>g</sup>1ðxÞ, g2ðxÞ, …, gkðx<sup>Þ</sup> � �dx <sup>≥</sup> max

9 = ; <sup>¼</sup> max i<j

ð

min qi f i ðxÞ, qj f j ðxÞ

n o

f i ðxÞ, qj f j ðxÞ

≤ qi f i � �<sup>β</sup> qj

> dx ≤ X i<j q β i q 1�β <sup>j</sup> DT f <sup>i</sup>

i<j ð

1 2 kgi , gj k<sup>1</sup> þ 1 2 ðqi þ qj Þ

1 2 kgi ,gj k1 � �

1 2 kgi ,gj k1 � �

≥ max i<j

≥ max i<j

Rn

max gi

� �

þ min i<j

þ min

1 2 ðqi þ qj Þ

� �

<sup>i</sup><<sup>j</sup> <sup>ð</sup>q1, q2, …, qk<sup>Þ</sup> � �:

ðxÞ, gj ðxÞ n o

h i n o <sup>1</sup>�<sup>β</sup>

f j � �<sup>1</sup>�<sup>β</sup>

:

,f j � �ðβ,1�β<sup>Þ</sup>

dx

dx:

Rn j

min qi f i ðxÞ, qj f j ðxÞ

� �<sup>β</sup> and min qi

j¼1

<sup>¼</sup> <sup>X</sup> k

<sup>¼</sup> <sup>X</sup> i<j

≤ qi f i

min qi f i ðxÞ, qj f j ðxÞ

qi f i <sup>ð</sup>x<sup>Þ</sup> � �<sup>β</sup> qj

Integrating the above inequality, we obtain:

ð

Rn i

X i<j

j¼1

<sup>g</sup>maxðxÞ ¼ <sup>1</sup> � Pe<sup>ð</sup>q<sup>Þ</sup>

Using ð

44 Bayesian Inference

Since

then

iii. We have

min qi f i ðxÞ, qj f j ðxÞ

Pe<sup>ð</sup>q<sup>Þ</sup> <sup>1</sup>,2,…, <sup>k</sup> ≤

ð

Rn

ð

max{gi

ðxÞ, gj

ðxÞ}dx

Rn

On the other hand,

8 < :

max i<j

h i n o <sup>β</sup>

Rn

ii. From Eq. (2), we have

j¼1 q αj

<sup>j</sup> DT f <sup>1</sup>,f <sup>2</sup>,…,f <sup>k</sup>

� �<sup>α</sup> <sup>≤</sup> <sup>ð</sup><sup>k</sup> � <sup>1</sup><sup>Þ</sup>

<sup>1</sup>,2,…, <sup>k</sup> for Eq. (11), we have Eq. (5).

ð

gmaxðxÞdx: (11)

Rn

n o

dx

≤ qi f i � �<sup>1</sup>�<sup>β</sup> ,

dx:

$$\int\_{\mathbb{R}^n} g\_{\text{max}}(\mathbf{x}) d\mathbf{x} \ge \frac{1}{2} \max\_{i$$

$$\begin{aligned} \text{We also have } & \sum\_{i$$

Therefore,

$$\max\{\mathbf{g}\_{1'}\mathbf{g}\_{2'}\cdots\mathbf{g}\_k\} \le \frac{1}{k} \sum\_{i$$

gj

Since

$$\int\_{\mathbb{R}^n} g\_i(\mathbf{x})d\mathbf{x} = q\_i \text{ and } \sum\_{i=1}^k q\_i = 1 \text{, the inequality Eq. (13) becomes:}$$

$$\int\_{\mathbb{R}^n} g\_{\text{max}}(\mathbf{x})d\mathbf{x} \le \frac{1}{k} \sum\_{i$$

$$\text{Replacing } \underset{\mathbb{R}^n}{\text{Re}} \mathbf{x}(\mathbf{x}) = 1 - \mathrm{Pe}\_{1,2,\ldots,k}^{(q)} \text{ to Eqs. (12) and (14), we have Eq. (7).}$$

iv. We have

$$q\_j f\_i(\mathbf{x}) \le \max \left\{ q\_1 f\_1(\mathbf{x}), q\_2 f\_2(\mathbf{x}), \dots, q\_k f\_k(\mathbf{x}) \right\} \le \sum\_{i=1}^k q\_i f\_i(\mathbf{x}) \text{ for all } i = 1, \dots, k.$$

Integrating the above relation, we obtain:

$$\eta\_i \le \int\_{\mathbb{R}^n} g\_{\max}(\mathbf{x}) d\mathbf{x} \le 1.$$

Above inequality is true for all i = 1,…,k, so

$$\max\{q\_i\} \le \int\_{\mathbb{R}^n} \mathbf{g}\_{\max}(\mathbf{x}) d\mathbf{x} \le 1.$$

Replacing <sup>ð</sup> Rn <sup>g</sup>maxðxÞ ¼ <sup>1</sup> � Pe<sup>ð</sup>q<sup>Þ</sup> <sup>1</sup>, <sup>2</sup>,…, <sup>k</sup> in above relation, we have Eq. (8).

From the result of Eqs. (5) and (6), with α<sup>1</sup> ¼ α<sup>2</sup> ¼ … ¼ α<sup>k</sup> ¼ 1=k, , we have the relationship between Bayes error and affinity of Matusita [11]. Especially, when k = 2, we have the relationship between Pe<sup>ð</sup>q,1�q<sup>Þ</sup> <sup>1</sup>,<sup>2</sup> and Hellinger's distance.

In addition, we also have the relation between Bayes error and overlap coefficients as well as L1 –distance of {g1(x), g2(x), …, gk(x)} (see Ref. [22]). For special case: q<sup>1</sup> = q<sup>2</sup> = … = q<sup>k</sup> = 1/k, we had established expressions about relations between Bayes error and L<sup>1</sup> –distance of {f1(x), f2(x), …, fk(x)}, Pe<sup>ð</sup>1=k<sup>Þ</sup> <sup>1</sup>,2,…, <sup>k</sup> and Pe<sup>ð</sup>1=ðkþ1ÞÞ <sup>1</sup>, <sup>2</sup>,…, <sup>k</sup>þ<sup>1</sup> (see Ref. [17]).

### 3. Related problems in applying of Bayesian method

To apply Bayesian method in reality, we have to resolve three main problems: (i) Determine prior probability, (ii) compute Bayes error, and (iii) estimate pdfs. In this section, we propose an algorithm to solve for (i) based on fuzzy cluster analysis and classified objective that can reduces Bayes error in comparing with traditional approaches. For (ii), Bayes error is established by closed expression for general case and determine it by an algorithm to find maximum function of gi(x), i = 1, 2, …, k for one dimension case. The quasi-Monte Carlo method is proposed to compute Bayes error in this section. For (iii), we review the problem to estimate pdfs by kernel function method where the bandwidth parameter and kernel function are specified.

#### 3.1. Prior probability

In the <sup>n</sup>-dimensions space, given <sup>N</sup> populations <sup>N</sup>ð0<sup>Þ</sup> <sup>¼</sup> <sup>W</sup><sup>ð</sup>0<sup>Þ</sup> <sup>1</sup> , W<sup>ð</sup>0<sup>Þ</sup> <sup>2</sup> , …, W<sup>ð</sup>0<sup>Þ</sup> N n o with data set <sup>Z</sup> <sup>=</sup> [zij]nxN. Let matrix U ¼ ½μik� <sup>c</sup>�<sup>n</sup>, where <sup>μ</sup>ik is probability of the <sup>k</sup>th element belonging to <sup>w</sup>i. We have μik ∈ ½0, 1� and satisfies the following conditions:

$$\sum\_{i=1}^{c} \mu\_{ik} = 1,\ 0 < \sum\_{k=1}^{N} \mu\_{ik} < N,\ 1 \le i \le c,\ 1 \le k \le N.$$

We call

$$M\_{\rm z\varepsilon} = \left\{ \mathcal{U} = \left[ \mu\_{\vec{\kappa}} \right]\_{\rm c\varepsilon N} \middle| \mu\_{\vec{\kappa}} \in [0, 1], \forall i, k; \sum\_{i=1}^{c} \mu\_{\vec{\kappa}} = 1, \forall k; 0 < \sum\_{k=1}^{N} \mu\_{ik}, \forall i \right\} \tag{15}$$

be fuzzy partitioning space of k populations,

D2 ikA ¼ kzk � <sup>v</sup>ik<sup>2</sup> <sup>A</sup> ¼ ðzk � viÞ TAðzk � <sup>v</sup>i<sup>Þ</sup> is the matrix whose element <sup>d</sup><sup>2</sup> ik is the square of distance from the object zk to the ith representative population. This representative is computed by the following formula:

$$\mathbf{v}\_{i} = \frac{\sum\_{k=1}^{N} (\mu\_{ik})^{m} \mathbf{z}\_{k}}{\sum\_{k=1}^{N} (\mu\_{ik})^{m}}, \ 1 \le i \le c,\tag{16}$$

where m ∈ [1,∞) is the fuzziness parameter.

Given the data set Z including c known populations w1, w2, …, wc. Assume x<sup>0</sup> is an object that we need to classify. To identify the prior probabilities when classifying x0, we propose the following prior probability by fuzzy clustering (PPC) algorithm:

#### Algorithm 1. Determining prior probability by fuzzy clustering (PPC)

Input: The data set Z = ½zij� <sup>n</sup>�<sup>N</sup>of <sup>c</sup> populations {w1, <sup>w</sup>2, …, <sup>w</sup>c}, <sup>x</sup>0, <sup>ε</sup>, m and the initial partition matrix <sup>U</sup> <sup>¼</sup> <sup>U</sup><sup>ð</sup>0<sup>Þ</sup> ¼ ½μij� <sup>c</sup>�Nþ<sup>1</sup>,

where μij = 1 if the jth object belongs to the wi and μij = 0 for the opposite, i ¼ 1,c;j ¼ 1,N, μij ¼ 1=c for j = N + 1. Output: The prior probability <sup>μ</sup><sup>i</sup>ðNþ1Þ, i <sup>¼</sup> <sup>1</sup>, <sup>2</sup>,…c:

Repeat:

In addition, we also have the relation between Bayes error and overlap coefficients as well as

To apply Bayesian method in reality, we have to resolve three main problems: (i) Determine prior probability, (ii) compute Bayes error, and (iii) estimate pdfs. In this section, we propose an algorithm to solve for (i) based on fuzzy cluster analysis and classified objective that can reduces Bayes error in comparing with traditional approaches. For (ii), Bayes error is established by closed expression for general case and determine it by an algorithm to find maximum function of gi(x), i = 1, 2, …, k for one dimension case. The quasi-Monte Carlo method is proposed to compute Bayes error in this section. For (iii), we review the problem to estimate pdfs by kernel function method where the bandwidth parameter and kernel function

<sup>1</sup> , W<sup>ð</sup>0<sup>Þ</sup>

<sup>μ</sup>ik <sup>¼</sup> <sup>1</sup>, <sup>∀</sup>k; <sup>0</sup> <sup>&</sup>lt; <sup>X</sup>

<sup>c</sup>�<sup>n</sup>, where <sup>μ</sup>ik is probability of the <sup>k</sup>th element belonging to <sup>w</sup>i. We

μik < N, 1 ≤ i ≤ c, 1 ≤ k ≤ N:

i¼1

( )

TAðzk � <sup>v</sup>i<sup>Þ</sup> is the matrix whose element <sup>d</sup><sup>2</sup>

tance from the object zk to the ith representative population. This representative is computed

n o

<sup>2</sup> , …, W<sup>ð</sup>0<sup>Þ</sup> N

N

k¼1

, 1 ≤ i ≤ c, (16)

μik, ∀i

ik is the square of dis-

had established expressions about relations between Bayes error and L<sup>1</sup>

<sup>1</sup>, <sup>2</sup>,…, <sup>k</sup>þ<sup>1</sup> (see Ref. [17]).

3. Related problems in applying of Bayesian method

In the <sup>n</sup>-dimensions space, given <sup>N</sup> populations <sup>N</sup>ð0<sup>Þ</sup> <sup>¼</sup> <sup>W</sup><sup>ð</sup>0<sup>Þ</sup>

� �

<sup>μ</sup>ik <sup>¼</sup> <sup>1</sup>, <sup>0</sup> <sup>&</sup>lt; <sup>X</sup>

v<sup>i</sup> ¼

X N

k¼1 ðμikÞ <sup>m</sup>z<sup>k</sup>

X N

k¼1 ðμikÞ m

N

k¼1

cxNjμik <sup>∈</sup>½0, <sup>1</sup>�, <sup>∀</sup>i, k;X<sup>c</sup>

have μik ∈ ½0, 1� and satisfies the following conditions:

Xc i¼1

Mzc ¼ U ¼ μik

be fuzzy partitioning space of k populations,

<sup>A</sup> ¼ ðzk � viÞ

–distance of {g1(x), g2(x), …, gk(x)} (see Ref. [22]). For special case: q<sup>1</sup> = q<sup>2</sup> = … = q<sup>k</sup> = 1/k, we

–distance of {f1(x), f2(x),

with data set Z =

(15)

L1

…, fk(x)}, Pe<sup>ð</sup>1=k<sup>Þ</sup>

46 Bayesian Inference

are specified.

We call

D2

ikA ¼ kzk � <sup>v</sup>ik<sup>2</sup>

by the following formula:

3.1. Prior probability

[zij]nxN. Let matrix U ¼ ½μik�

<sup>1</sup>,2,…, <sup>k</sup> and Pe<sup>ð</sup>1=ðkþ1ÞÞ

Find the representative object of wi: v<sup>i</sup> ¼ XN k¼1 ðμikÞ <sup>m</sup>z<sup>k</sup> XN ðμikÞ m , 1 ≤ i ≤ c

k¼1 Compute the matrix ½Dik� <sup>c</sup>�Nþ<sup>1</sup> (the pairwise distance between objects and representative objects). Update the new partition matrix U(new) by the following principle:

If Dik > 0 for all i ¼ 1, 2,…, c; k ¼ 1, 2,…, N þ 1 then

$$\begin{aligned} \mu\_{\mathbb{k}}(^{\text{new}}) &= \frac{1}{\sum\_{j=1}^{c} (D\_{\mathbb{k}}/D\_{\mathbb{k}})^{2/(m-1)}}, i \neq j = 1, 2, \dots, c \\ \text{Else}, \mu\_{\mathbb{k}}^{(\text{new})} &= 0 \\ \text{End}; \\ \text{Compute } S &= \|\boldsymbol{U}^{(\text{new})} - \boldsymbol{U}\| = \max\_{\mathbb{k}} \left( \left| \mu\_{\mathbb{k}}^{(\text{new})} - \mu\_{\mathbb{k}} \right| \right) \\ \boldsymbol{U} &= \boldsymbol{U}^{(\text{new})} \\ 1. \boldsymbol{\xi} &< \boldsymbol{0}. \end{aligned}$$

Until S < ε; The prior probability <sup>μ</sup><sup>i</sup>ðNþ1Þ, i <sup>¼</sup> <sup>1</sup>, <sup>2</sup>, …<sup>c</sup> (the final column of the matrix <sup>U</sup>);

In the above algorithm, we have:


At the end of the PPC algorithm, we obtain the prior probabilities of x<sup>0</sup> based on the last column of the partition matrix <sup>U</sup> <sup>ð</sup>μ<sup>i</sup>ðNþ1Þ, i <sup>¼</sup> <sup>1</sup>, <sup>2</sup>, …cÞ. The PPC algorithm helps us determine the prior probabilities via the closeness degree between the classified object and the populations. Each object will receive its suitable prior probabilities.

In this chapter, Bayesian method with prior probabilities calculated by the uniform distribution approach, the ratio of samples approach, the Laplace approach, and the proposed PPC algorithm approach are respectively called BayesU, BayesR, BayesL, and BayesC.

Example 1. Given the studied marks (scale 10 grading system) of 20 students. Among them, nine students have marks that are lower than 5 (w1: fail the exam) and 11 students have marks that are higher than 5 (w2: pass the exam). The data are given in Table 1.

Assume that we need to classify the ninth object, x<sup>0</sup> = 4.3, to one in two populations. Using the PPC algorithm, we have the following final partition matrix:

0:957 0:973 0:981 0:993 1 0:997 0:997 0:830 0:321 0:290 0:158 0:1 0:1 0:01 0:009 0:037 0:045 0:054 0:062 0:724 <sup>0</sup>:043 0:027 0:019 0:007 0 0:003 0:003 0:170 0:679 0:710 0:842 0:9 0:9 0:99 0:991 0:963 0:955 0:946 0:938 0:<sup>276</sup>

This matrix shows the prior probabilities when assigning the ninth object to w<sup>1</sup> and w<sup>2</sup> are 0.724 and 0.276, respectively. Meanwhile, the prior probabilities determined by BayesU, BayesR, and BayesL are (0.5; 0.5), (0.421; 0.579), and (0.429; 0.571), respectively.

From the data in Table 1, we might estimate the pdfs f1(x) and f2(x) and compute the values q1f1(x) and q2f2(x), where q<sup>1</sup> and q<sup>2</sup> are the calculated prior probabilities. The results of classifying x<sup>0</sup> by four approaches: BayesU, BayesR, BayesL, and BayesC are given in Table 2.

Because the actual population of x<sup>0</sup> is w1, only BayesC gives the true result. The Bayes error of BayesC is also the smallest. Thus, in this example, the proposed method improves the drawback of the traditional method in determining the prior probabilities.


Table 1. The studied marks of 20 students and the actual classifications.


Table 2. The results when classifying the ninth object.

#### 3.2. Determining Bayes error

Theorem 2. Let fi(x), i =1, 2, …, k, k ≥ 3 be k pdfs defined on Rn , n ≥ 1 and let qi ∈ (0;1),

$$\begin{cases} R\_1^n = \left\{ \mathbf{x} \in \mathbb{R}^n : q\_1 f\_1(\mathbf{x}) > q\_j f\_j(\mathbf{x}), 2 \le j \le k \right\}, \\ R\_k^n = \left\{ \mathbf{x} \in \mathbb{R}^n : q\_k f\_k(\mathbf{x}) > q\_j f\_j(\mathbf{x}), 1 \le j \le k \right\}, \\ R\_l^n = \left\{ \mathbf{x} \in \mathbb{R}^n : q\_j f\_j(\mathbf{x}) > q\_j f\_j(\mathbf{x}), 1 \le i \le k, 2 \le l \le k - 1, i \ne l \right\}. \end{cases} \tag{17}$$

The Bayes error is determined by

$$\operatorname{Pe}\_{1,2,\ldots,k}^{(q)} = 1 - \int\_{\mathbb{R}\_1^n} q\_1 f\_1(\mathbf{x}) d\mathbf{x} - \sum\_{l=2}^{k-1} \int\_{\mathbb{R}\_l^n} q\_l f\_l(\mathbf{x}) d\mathbf{x} - \int\_{\mathbb{R}\_k^n} q\_l f\_k(\mathbf{x}) d\mathbf{x}.\tag{18}$$

#### Proof:

At the end of the PPC algorithm, we obtain the prior probabilities of x<sup>0</sup> based on the last column of the partition matrix <sup>U</sup> <sup>ð</sup>μ<sup>i</sup>ðNþ1Þ, i <sup>¼</sup> <sup>1</sup>, <sup>2</sup>, …cÞ. The PPC algorithm helps us determine the prior probabilities via the closeness degree between the classified object and the popu-

In this chapter, Bayesian method with prior probabilities calculated by the uniform distribution approach, the ratio of samples approach, the Laplace approach, and the proposed PPC

Example 1. Given the studied marks (scale 10 grading system) of 20 students. Among them, nine students have marks that are lower than 5 (w1: fail the exam) and 11 students have marks

Assume that we need to classify the ninth object, x<sup>0</sup> = 4.3, to one in two populations. Using the

0:957 0:973 0:981 0:993 1 0:997 0:997 0:830 0:321 0:290 0:158 0:1 0:1 0:01 0:009 0:037 0:045 0:054 0:062 0:724 0:043 0:027 0:019 0:007 0 0:003 0:003 0:170 0:679 0:710 0:842 0:9 0:9 0:99 0:991 0:963 0:955 0:946 0:938 0:276 

This matrix shows the prior probabilities when assigning the ninth object to w<sup>1</sup> and w<sup>2</sup> are 0.724 and 0.276, respectively. Meanwhile, the prior probabilities determined by BayesU,

From the data in Table 1, we might estimate the pdfs f1(x) and f2(x) and compute the values q1f1(x) and q2f2(x), where q<sup>1</sup> and q<sup>2</sup> are the calculated prior probabilities. The results of classify-

Because the actual population of x<sup>0</sup> is w1, only BayesC gives the true result. The Bayes error of BayesC is also the smallest. Thus, in this example, the proposed method improves the draw-

Objects Marks Groups Objects Marks Groups 0.6 w<sup>1</sup> 11 5.6 w<sup>2</sup> 1.0 w<sup>1</sup> 12 6.1 w<sup>2</sup> 1.2 w<sup>1</sup> 13 6.4 w<sup>2</sup> 1.6 w<sup>1</sup> 14 6.4 w<sup>2</sup> 2.2 w<sup>1</sup> 15 7.3 w<sup>2</sup> 2.4 w<sup>1</sup> 16 8.4 w<sup>2</sup> 2.4 w<sup>1</sup> 17 9.2 w<sup>2</sup> 3.9 w<sup>1</sup> 18 9.4 w<sup>2</sup> 4.3 w<sup>1</sup> 19 9.6 w<sup>2</sup> 5.5 w<sup>2</sup> 20 9.8 w<sup>2</sup>

algorithm approach are respectively called BayesU, BayesR, BayesL, and BayesC.

that are higher than 5 (w2: pass the exam). The data are given in Table 1.

BayesR, and BayesL are (0.5; 0.5), (0.421; 0.579), and (0.429; 0.571), respectively.

back of the traditional method in determining the prior probabilities.

Table 1. The studied marks of 20 students and the actual classifications.

ing x<sup>0</sup> by four approaches: BayesU, BayesR, BayesL, and BayesC are given in Table 2.

lations. Each object will receive its suitable prior probabilities.

48 Bayesian Inference

PPC algorithm, we have the following final partition matrix:

To obtain Eq. (18), we need to prove two following results:

$$R\_i^n \cap R\_j^n = \phi, \ (1 \le i \ne j \le k)$$

and ⋃ k i¼1 Rn <sup>i</sup> <sup>¼</sup> <sup>R</sup><sup>n</sup> <sup>1</sup>∪⋃ k�1 i¼2 Rn <sup>i</sup> ∪R<sup>n</sup> <sup>k</sup> <sup>¼</sup> <sup>R</sup>n, f maxðxÞ ¼ <sup>f</sup> <sup>i</sup> <sup>ð</sup>xÞ, <sup>∀</sup>x<sup>∈</sup> Rn i :

Let <sup>A</sup> <sup>¼</sup> <sup>R</sup>n\A, we have

$$\overline{R}\_{\vec{\eta}} = \{ \mathbf{x} \in \mathbb{R}^n : q\_{\vec{\nu}} f\_i(\mathbf{x}) \le q\_{\vec{\mu}} f\_j(\mathbf{x}) \}, \\
R\_{\vec{\eta}} = \{ \mathbf{x} \in \mathbb{R}^n : q\_{\vec{\nu}} f\_i(\mathbf{x}) > q\_{\vec{\mu}} f\_j(\mathbf{x}) \}, \\
(1 \le i, j \le k) .$$

From Eq. (17), we obtain

$$R\_1^n = \bigcap\_{j=2}^k R\_{1j\nu} R\_l^n = \bigcap\_{i \neq k} \overline{R}\_{il\nu} \left( 2 \le l < k \right).$$

Therefore,

$$R\_1^n \cap R\_l^n = (\bigcap\_{j=2}^k R\_{ij}) \cap (\bigcap\_{i \neq k} \overline{R}\_{il}) \subset R\_{il} \cap \overline{R}\_{1l} = \phi \Rightarrow R\_1^n \cap R\_l^n = \phi, \ (2 \le l < k).$$

On the other hand, from antithesis style of D'Morgan, we have

$$\overline{\mathcal{R}\_1^n \cup \mathcal{R}\_l^n} = (\bigcup\_{j=2}^n \overline{\mathcal{R}\_{ij}}) \cup (\bigcup\_{i \neq k} \mathcal{R}\_{il}) \subset \overline{\mathcal{R}\_{il}} \cap \mathcal{R}\_{1l} = \phi \Rightarrow \mathcal{R}\_1^n \cup \mathcal{R}\_l^n = \mathcal{R}^n \ (2 \le l < k) \dots$$

Similarly,

$$R\_k^n \cap R\_l^n = \phi\_\prime \ (2 \le l < k) \; \newline R\_1^n \cap R\_k^n = \phi\_\prime$$

so

$$\begin{aligned} \bigcup\_{i=1}^{k} \mathbb{R}\_{i}^{n} &= \mathbb{R}^{n}, \ \cup \left(\bigcup\_{l=2}^{k-1} \mathbb{R}\_{l}^{n}\right) \cup \mathbb{R}\_{k}^{n} &= \mathbb{R}\_{1}^{n} \cup \left(\bigcup\_{l=2}^{k-1} \mathbb{R}\_{l}^{n}\right) \cup \mathbb{R}\_{k}^{n} \\ &= \left(\bigcup\_{l=2}^{k-1} \mathbb{R}\_{1}^{n} \cup \mathbb{R}\_{l}^{n}\right) \cup \left(\bigcup\_{l=2}^{k-1} \mathbb{R}\_{k}^{n} \cup \mathbb{R}\_{l}^{n}\right) = \mathbb{R}^{n} \cup \mathbb{R}^{n} = \mathbb{R}^{n} \Rightarrow \bigcup\_{i=1}^{k} \mathbb{R}\_{i}^{n} = \mathbb{R}^{n} \dots \end{aligned}$$

In addition, from Eq. (17), we can directly find out

$$\mathcal{g}\_{m\infty}(\mathfrak{x}) = \mathcal{g}\_{i}(\mathfrak{x}), \forall \mathfrak{x} \in \mathbb{R}^{n}\_{i'} \ (1 \le i \le k) .$$

For k = 2, q<sup>1</sup> = q<sup>2</sup> = 1/2, we consider the two following special cases:

i. If f1(x) and f2(x) are two one-dimension normal pdfs (Nðμ<sup>i</sup> , σiÞ, i = 1, 2), without loss of generality, we suppose that μ<sup>1</sup> < μ<sup>2</sup> (for μ<sup>1</sup> 6¼ μ2), σ<sup>1</sup> < σ<sup>2</sup> (for σ<sup>1</sup> 6¼ σ2), then

$$Pe\_{1,2}^{(1/2,1/2)} = \begin{cases} \frac{1}{2} \left[ \int\_{-\infty}^{x\_1} f\_2(\mathbf{x}) d\mathbf{x} + \int\_{x\_1}^{+\infty} f\_1(\mathbf{x}) d\mathbf{x} \right], & \text{if } \sigma\_1 = \sigma\_2, \\\\ \frac{1}{2} \left[ \int\_{-\infty}^{x\_2} f\_1(\mathbf{x}) d\mathbf{x} + \int\_{x\_2}^{x\_3} f\_2(\mathbf{x}) d\mathbf{x} + \int\_{x\_3}^{+\infty} f\_1(\mathbf{x}) d\mathbf{x} \right], & \text{if } \sigma\_1 < \sigma\_2. \end{cases}$$

where

$$\begin{aligned} \mathbf{x}\_1 &= \frac{\mu\_1 + \mu\_2}{2}, \ \mathbf{x}\_2 = \frac{(\mu\_1 \sigma\_2^2 - \mu\_2 \sigma\_1^2) - \sigma\_1 \sigma\_2 \sqrt{\left(\mu\_1 - \mu\_2\right)^2 + K}}{\sigma\_2^2 - \sigma\_1^2}, \\\ \mathbf{x}\_3 &= \frac{(\mu\_1 \sigma\_2^2 - \mu\_2 \sigma\_1^2) + \sigma\_1 \sigma\_2 \sqrt{\left(\mu\_1 - \mu\_2\right)^2 + K}}{\sigma\_2^2 - \sigma\_1^2}, \ K = 2(\sigma\_2^2 - \sigma\_1^2) \ln\left(\frac{\sigma\_2}{\sigma\_1}\right) \ge 0.1 \end{aligned}$$

For μ<sup>1</sup> = μ<sup>2</sup> =μ, the above result becomes:

$$\operatorname{Pe}\_{1,2}^{(1/2,1/2)} = \begin{cases} 1, & \text{if} \sigma\_1 = \sigma\_{2\prime} \\ \frac{1}{2} \left[ \int\_{-\infty}^{\chi\_4} f\_1(\mathbf{x}) d\mathbf{x} + \int\_{\chi\_4} f\_2(\mathbf{x}) d\mathbf{x} + \int\_{\chi\_5} f\_1(\mathbf{x}) d\mathbf{x} \right] & \text{if } \sigma\_1 < \sigma\_{2\prime} \end{cases}$$

$$\text{where } \mathbf{x}\_4 = \mu - \sigma\_1 \sigma\_2 \sqrt{E} \text{ and } \mathbf{x}\_5 = \mu + \sigma\_1 \sigma\_2 \sqrt{E} \text{ with } E = \frac{2}{\sigma\_2^2 - \sigma\_1^2} \ln\left(\frac{\sigma\_2}{\sigma\_1}\right) \ge 0.$$

ii. If f1(x) and f2(x) are two n-dimension normal pdfs ðNðμ<sup>i</sup> ,ΣiÞ, n ≥ 2, i ¼ 1, 2Þ then

$$P e\_{1,2}^{(1/2,1/2)} = \frac{1}{2} \left[ \int\_{\mathbb{R}\_1} f\_2(\mathbf{x}) d\mathbf{x} + \int\_{\mathbb{R}\_2} f\_1(\mathbf{x}) d\mathbf{x} \right],$$

where

On the other hand, from antithesis style of D'Morgan, we have

RijÞ∪ð⋃ i6¼k

Rn <sup>k</sup> ∩Rn

> k�1 l¼2 Rn <sup>l</sup> <sup>Þ</sup> <sup>∪</sup> <sup>R</sup><sup>n</sup>

> > <sup>l</sup> Þ ∪ ð⋃ k�1 l¼2 Rn <sup>k</sup> ∪Rn

gmaxðxÞ ¼ gi

For k = 2, q<sup>1</sup> = q<sup>2</sup> = 1/2, we consider the two following special cases:

i. If f1(x) and f2(x) are two one-dimension normal pdfs (Nðμ<sup>i</sup>

xð1

f <sup>2</sup>ðxÞdx þ

f <sup>1</sup>ðxÞdx þ

<sup>2</sup> � <sup>μ</sup>2σ<sup>2</sup>

q

�∞

xð2

�∞

<sup>1</sup>Þ þ σ1σ<sup>2</sup>

σ2 <sup>2</sup> � <sup>σ</sup><sup>2</sup> 1

xð4

�∞

1, ifσ<sup>1</sup> ¼ σ2,

f <sup>1</sup>ðxÞdx þ

1 2

8

>>>>>>>><

>>>>>>>>:

1 2

<sup>2</sup> , x<sup>2</sup> <sup>¼</sup> <sup>ð</sup>μ1σ<sup>2</sup>

<sup>2</sup> � <sup>μ</sup>2σ<sup>2</sup>

For μ<sup>1</sup> = μ<sup>2</sup> =μ, the above result becomes:

8 >>><

>>>:

1 2 2 4

2 4

2 4 RilÞ⊂Ril <sup>∩</sup> <sup>R</sup>1<sup>l</sup> <sup>¼</sup> <sup>φ</sup> ) Rn

<sup>l</sup> <sup>¼</sup> <sup>φ</sup>,ð<sup>2</sup> <sup>≤</sup> <sup>l</sup> <sup>&</sup>lt; <sup>k</sup>Þ, R<sup>n</sup>

<sup>k</sup> <sup>¼</sup> Rn

<sup>1</sup> ∪ ð⋃ k�1 l¼2 Rn <sup>l</sup> <sup>Þ</sup> <sup>∪</sup> <sup>R</sup><sup>n</sup> k

<sup>ð</sup>xÞ, <sup>∀</sup><sup>x</sup> <sup>∈</sup>Rn

generality, we suppose that μ<sup>1</sup> < μ<sup>2</sup> (for μ<sup>1</sup> 6¼ μ2), σ<sup>1</sup> < σ<sup>2</sup> (for σ<sup>1</sup> 6¼ σ2), then

þ ð∞

x1

xð3

x2

<sup>1</sup>Þ � σ1σ<sup>2</sup>

σ2 <sup>2</sup> � <sup>σ</sup><sup>2</sup> 1

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðμ<sup>1</sup> � μ2Þ

xð5

x4

f <sup>1</sup>ðxÞdx

f <sup>2</sup>ðxÞdx þ

3

þ ð∞

x3

q

<sup>2</sup> <sup>þ</sup> <sup>K</sup>

f <sup>2</sup>ðxÞdx þ

þ ð∞

x5

f <sup>1</sup>ðxÞdx

5, if σ<sup>1</sup> ¼ σ2,

f <sup>1</sup>ðxÞdx

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðμ<sup>1</sup> � μ2Þ

, K <sup>¼</sup> <sup>2</sup>ðσ<sup>2</sup>

3

<sup>2</sup> <sup>þ</sup> <sup>K</sup>

<sup>2</sup> � <sup>σ</sup><sup>2</sup>

3 5 ,

<sup>1</sup>Þln <sup>σ</sup><sup>2</sup> σ1 � �

if σ<sup>1</sup> < σ2,

≥ 0:

5, if σ<sup>1</sup> < σ2,

<sup>1</sup> ∪R<sup>n</sup>

<sup>k</sup> ¼ φ,

<sup>1</sup> ∩R<sup>n</sup>

<sup>l</sup> Þ ¼ Rn <sup>∪</sup> Rn <sup>¼</sup> Rn ) <sup>⋃</sup>

<sup>i</sup> ,ð1 ≤ i ≤ kÞ:

<sup>l</sup> <sup>¼</sup> <sup>R</sup>n,ð<sup>2</sup> <sup>≤</sup> <sup>l</sup> <sup>&</sup>lt; <sup>k</sup>Þ:

k i¼1 Rn <sup>i</sup> <sup>¼</sup> Rn:

, σiÞ, i = 1, 2), without loss of

<sup>l</sup> ¼ ð⋃ n j¼2

<sup>i</sup> <sup>¼</sup> Rn, <sup>∪</sup> <sup>ð</sup><sup>⋃</sup>

¼ ð⋃ k�1 l¼2 Rn <sup>1</sup> ∪R<sup>n</sup>

In addition, from Eq. (17), we can directly find out

Rn <sup>1</sup> ∪ Rn

> ⋃ k i¼1 Rn

Pe<sup>ð</sup>1=2,1=2<sup>Þ</sup> <sup>1</sup>,<sup>2</sup> ¼

<sup>x</sup><sup>1</sup> <sup>¼</sup> <sup>μ</sup><sup>1</sup> <sup>þ</sup> <sup>μ</sup><sup>2</sup>

<sup>x</sup><sup>3</sup> <sup>¼</sup> <sup>ð</sup>μ1σ<sup>2</sup>

Pe<sup>ð</sup>1=2,1=2<sup>Þ</sup> <sup>1</sup>, <sup>2</sup> ¼

where

Similarly,

50 Bayesian Inference

so

$$\begin{split} R\_1^n &= \{ \mathbf{x} : d(\mathbf{x}) \le 0 \}, R\_2^n = \{ \mathbf{x} : d(\mathbf{x}) > 0 \}, \\ d(\mathbf{x}) &= \left[ \mu\_1^T (\Sigma\_1)^{-1} - \mu\_2^T (\Sigma\_2)^{-1} \right] \mathbf{x} - \frac{1}{2} \mathbf{x}^T \left[ (\Sigma\_1)^{-1} - (\Sigma\_2)^{-1} \right] \mathbf{x} - m\_1, \\ m &= \frac{1}{2} \left[ \ln \frac{|\Sigma\_1|}{|\Sigma\_2|} + \mu\_1^T (\Sigma\_1)^{-1} \mu\_1 - \mu\_2^T (\Sigma\_2)^{-1} \mu\_2 \right]. \end{split}$$

In case of n = 2, d(x) can be straight lines or parabola or ellipses or hyperbola.

#### 3.3. Maximum function in the classification problem

To classify a new element by the principle (1) and to determine Bayes error by the formula (3), we must find gmax(x). Some authors, such as Pham–Gia et al. [15, 17] and Tai [21, 22], have surveyed relationships between gmax(x) with some related quantities of classification problem. The specific expression for gmax(x) in some special case has been found [18]. However, the general expression for all of cases is a complex problem that has not been still found yet.

Given k pdfs fi(x) and qi, i = 1, 2, …, k with q<sup>1</sup> + q<sup>2</sup> + …+ q<sup>k</sup> = 1 and let gi(x) = qifi(x), gmax(x) = max {gi(x)}. Now, we take interest in determining gmax(x).

#### (a) For one dimension

In this case, we can find gmax(x) by the following algorithm:

#### Algorithm 2. Find the gmax(x) function

Input: gi(x) = qifi(x), where fi(x) and q<sup>i</sup> are the probability density function and the prior probability of wi, i = 1, 2, …, k, respectively. Output: The gmax(x) function. Find all roots of the equations gi ðxÞ � gj ðxÞ ¼ 0, i ¼ 1, k � 1, j ¼ i þ 1, k. Let B be the set of all roots. For xlm ∈ B (the roof of equation gl ðxÞ � gmðxÞ ¼ 0) do For p ∈{1,2,…,k}\{l,m} do If gl ðxlmÞ < gpðxlmÞ then B ¼ B\fxlmg End End End

Arrange the elements of B in order from smallest to largest: B ¼ fx1, x2,…, xhg, x<sup>1</sup> < x<sup>2</sup> < … < xh

```
(Determine the function gmax(x) in interval (�∞,x1])
            For i = 1 to k do
                      If gi
                          ðx1 � ε1Þ ¼ maxfg1ðx1 � ε1Þ, g2ðx1 � ε1Þ,…, gkðx1 � ε1Þg then
                                   gmaxðxÞ ¼ gi
                                                ðxÞ, for all x ∈ (�∞,x1]
                      End
            End
(Determine the function gmax (x) in interval ðxj, xjþ1�, j ¼ 1, h � 1)
            For i =1 to k do
               For j =1 to h-1 do
                 If gi
                     ðxj þ ε2Þ ¼ max{g1ðxj þ ε2Þ, g2ðxj þ ε2Þ,…, gkðxk þ ε2Þ} then
                         gmaxðxÞ ¼ gi
                                      ðxÞ, for all x∈ðxj, xjþ1�
                 End
               End
            End
(Determine the function gmax (x) in interval (h,+∞))
            For i = 1 to k do
               If gi
                   ðxh þ ε3Þ ¼ maxfg1ðxh þ ε3Þ, g2ðxh þ ε3Þ,…, gkðxh þ ε3Þg then
                                   gmaxðxÞ ¼ gi
                                                ðxÞ, for all x∈ðh, þ ∞Þ
               End
            End
```
In the above algorithm, ε1, ε2, ε<sup>3</sup> are the positive constants such that:

$$
\mathbf{x}\_1 + \varepsilon\_1 < \mathbf{x}\_2, \; \mathbf{x}\_h - \varepsilon\_3 > \mathbf{x}\_{h-1}, \; \mathbf{x}\_i - \varepsilon\_2 < \mathbf{x}\_{i-1} \text{ and } \mathbf{x}\_i + \varepsilon\_2 < \mathbf{x}\_{i+1}.
$$

From this algorithm, we have written a MATLAB code to find the gmax(x). When gmax(x) is determined, we will easily calculate Bayes error by using formula (3), as well as classify a new element by principle (1).

Example 2. Given seven populations having univariate normal pdfs {f1, f2,…, f7} with specific parameters as follows (Figure 1):

$$\begin{aligned} \mu\_1 &= 0.3, \mu\_2 = 4.0, \mu\_3 = 9.1, \mu\_4 = 1.9, \mu\_5 = 5.3, \mu\_6 = 8, \mu\_7 = 4.8, \\ \sigma\_1 &= 1.0, \sigma\_2 = 1.3, \sigma\_3 = 1.4, \sigma\_4 = 1.6, \sigma\_5 = 2, \sigma\_6 = 1.9, \sigma\_7 = 2.3. \end{aligned}$$

Using codes written with qi ¼ 1=7, gi ðxÞ ¼ qi f i ðxÞ, i ¼ 1, 2, ::, 7, we have the results:

$$g\_{\max}(\mathbf{x}) = \begin{cases} g\_1 & \text{if } \quad -1.28 < \mathbf{x} \le 0.99, \\ g\_2 & \text{if } \quad 2.58 < \mathbf{x} \le 4.89, \\ g\_3 & \text{if } \quad 8.30 < \mathbf{x} \le 12.52, \\ g\_4 & \text{if } \quad \{-7.86 < \mathbf{x} \le -1.28\} \cup \{0.99 < \mathbf{x} \le 2.58\}, \\ g\_5 & \text{if } \quad 4.89 < \mathbf{x} \le 6.65, \\ g\_6 & \text{if } \quad \{6.65 < \mathbf{x} \le 8.30\} \cup \{12.52 < \mathbf{x} \le 23.33\}, \\ g\_7 & \text{if } \quad \{\mathbf{x} \le -7.86\} \cup \{\mathbf{x} > 23.33\}. \end{cases}$$

#### (b) For multidimension

In multidimension cases, it should be very complicated to obtain closed expression for gmax(x). The difficulty comes from the various forms of the intersection space curves between the pdfs surfaces. This problem has been interested by many authors in Refs. [17, 18, 21–25]. Pham–Gia

Figure 1. The graph of seven one-dimension normal pdfs, fmax(x) and gmax(x).

In the above algorithm, ε1, ε2, ε<sup>3</sup> are the positive constants such that:

element by principle (1).

(b) For multidimension

parameters as follows (Figure 1):

(Determine the function gmax(x) in interval (�∞,x1]) For i = 1 to k do If gi

End

For i =1 to k do For j =1 to h-1 do If gi

End End End

End End

End

52 Bayesian Inference

gmaxðxÞ ¼ gi

gmaxðxÞ ¼ gi

(Determine the function gmax (x) in interval ðxj, xjþ<sup>1</sup>�, j ¼ 1, h � 1)

gmaxðxÞ ¼ gi

(Determine the function gmax (x) in interval (h,+∞)) For i = 1 to k do If gi

Using codes written with qi ¼ 1=7, gi

gmaxðxÞ ¼

8

>>>>>>>>>>>>><

>>>>>>>>>>>>>:

x<sup>1</sup> þ ε<sup>1</sup> < x2, xh � ε<sup>3</sup> > xh�<sup>1</sup>, xi � ε<sup>2</sup> < xi�<sup>1</sup> and xi þ ε<sup>2</sup> < xiþ<sup>1</sup>:

ðx<sup>1</sup> � ε1Þ ¼ maxfg1ðx<sup>1</sup> � ε1Þ, g2ðx<sup>1</sup> � ε1Þ,…, gkðx<sup>1</sup> � ε1Þg then

ðxÞ, for all x ∈ (�∞,x1]

ðxj þ ε2Þ ¼ max{g1ðxj þ ε2Þ, g2ðxj þ ε2Þ,…, gkðxk þ ε2Þ} then

ðxh þ ε3Þ ¼ maxfg1ðxh þ ε3Þ, g2ðxh þ ε3Þ,…, gkðxh þ ε3Þg then

ðxÞ, for all x∈ðh, þ ∞Þ

ðxÞ, for all x∈ðxj, xjþ<sup>1</sup>�

From this algorithm, we have written a MATLAB code to find the gmax(x). When gmax(x) is determined, we will easily calculate Bayes error by using formula (3), as well as classify a new

Example 2. Given seven populations having univariate normal pdfs {f1, f2,…, f7} with specific

μ<sup>1</sup> ¼ 0:3, μ<sup>2</sup> ¼ 4:0, μ<sup>3</sup> ¼ 9:1, μ<sup>4</sup> ¼ 1:9, μ<sup>5</sup> ¼ 5:3, μ<sup>6</sup> ¼ 8, μ<sup>7</sup> ¼ 4:8, σ<sup>1</sup> ¼ 1:0, σ<sup>2</sup> ¼ 1:3, σ<sup>3</sup> ¼ 1:4, σ<sup>4</sup> ¼ 1:6, σ<sup>5</sup> ¼ 2, σ<sup>6</sup> ¼ 1:9, σ<sup>7</sup> ¼ 2:3:

g<sup>4</sup> if { � 7:86 < x ≤ � 1:28} ∪ {0:99 < x ≤ 2:58};

g<sup>6</sup> if {6:65 < x ≤ 8:30} ∪ {12:52 < x ≤ 23:33};

g<sup>7</sup> if {x ≤ � 7:86} ∪ {x > 23:33}:

In multidimension cases, it should be very complicated to obtain closed expression for gmax(x). The difficulty comes from the various forms of the intersection space curves between the pdfs surfaces. This problem has been interested by many authors in Refs. [17, 18, 21–25]. Pham–Gia

ðxÞ, i ¼ 1, 2, ::, 7, we have the results:

ðxÞ ¼ qi f i

g<sup>1</sup> if �1:28 < x ≤ 0:99; g<sup>2</sup> if 2:58 < x ≤ 4:89; g<sup>3</sup> if 8:30 < x ≤ 12:52;

g<sup>5</sup> if 4:89 < x ≤ 6:65;

et al. [18] have attempted to find the function gmax(x); however, it has been only established for some cases of bivariate normal distribution.

Example 3. Given the four bivariate normal pdfs N(μi, Σi) with the following specific parameters [16]:

$$\begin{aligned} \mu\_1 &= \begin{bmatrix} 40 \\ 20 \end{bmatrix}, \mu\_2 = \begin{bmatrix} 48 \\ 24 \end{bmatrix}, \mu\_3 = \begin{bmatrix} 43 \\ 32 \end{bmatrix}, \mu\_4 = \begin{bmatrix} 38 \\ 28 \end{bmatrix}, \\ \Sigma\_1 &= \begin{pmatrix} 35 & 18 \\ 18 & 20 \end{pmatrix}, \Sigma\_1 = \begin{pmatrix} 28 & -20 \\ -20 & 25 \end{pmatrix}, \Sigma\_1 = \begin{pmatrix} 15 & 25 \\ 25 & 65 \end{pmatrix}, \Sigma\_1 = \begin{pmatrix} 5 & -10 \\ -10 & 7 \end{pmatrix}. \end{aligned}$$

With q<sup>1</sup> = 0.25, q<sup>2</sup> = 0.2, q<sup>3</sup> = 0.4, and q<sup>4</sup> = 0.15, we have the graphs of gi(x) = qifi(x) and their intersection curves as shown in Figure 2.

Here, we do not find the expression of gmax(x). We compute Bayes error instead by taking integration of gmax(x) by quasi-Monte Carlo method [17]. An algorithm for doing calculations has been constructed, and a corresponding MATLAB procedure is used in Section 4.

#### 3.4. Estimate the probability density function

There are many parameter and nonparameter methods to estimate pdfs. In the examples and applications of Section 4, we use the kernel function method, the popular one in practice nowadays. It has the following formula:

$$\widehat{f}\left(\mathbf{x}\right) = \frac{1}{Nh\_1h\_2\dots h\_n} \sum\_{i=1}^{N} \prod\_{j=1}^{n} f\_j\left(\frac{\mathbf{x}\_j - \mathbf{x}\_{ij}}{h\_j}\right). \tag{19}$$

where xj, j = 1,2,…,n are variables, xj, i = 1,2,…,N are the ith data of the jth variable, hj is the bandwidth parameter for the jth variable, fj(.) is the kernel function of the jth variable which is usually normal, Epanechnikov, biweight, and triweight. According to this method, the choice

Figure 2. The graph of three bivariate normal pdfs and their gmax(x).

of smoothing parameter and the type of kernel function play an important role and affect the result. Although Silverman [20], Martinez and Martinez [10], and some other authors [7, 13, 27] had discussions about this problem, the optimal choice still has not been found yet. In this chapter, the smoothing parameter is from the idea of Scott [19] and the kernel function is the Gaussian one. We have also written the code by MATLAB software to estimate the pdfs in ndimensions space using this method.

We have written the complete code for the proposed algorithm by MATLAB software. It is applied effectively for the examples of Section 4.

### 4. Some applications

In this section, we will consider three applications in three domains: biology, medicine, and economics to illustrate for present theories and to test established algorithms. They also show that the proposed algorithm presents more advantages than the existing ones.

Application 1. We consider classification for well-known Iris flower data, which have been presented in many documents like in Ref. [17]. These data are often used to compare the new method and existing ones in classifying. The three varieties of Iris, namely, Setosa (Se), Versicolor (Ve), and Virginica (Vi), have data in four attributes: X1 = sepal length, X2 = sepal width, X3 = petal length, and X4 = petal width.

In this application, the cases of one, two, three and four variables are respectively considered to classify for three groups (Se), (Ve), and (Vi) by Bayesian method with different prior probabilities. The purpose of this performance is to compare the results of BayesC with BayesU, BayesR, and BayesL. Because the numbers of the three groups are equal, and the results of BayesU, BayesR, and BayesL are the same. The correct probability of methods is summarized in Table 3.

Table 3 shows that in almost all cases, the results of proposed algorithm are better than those using other algorithms, and in the case using three variables X1, X2, and X3, it gives the best results.

Application 2. This application considers thyroid gland disease (TGD). Thyroid gland is an important and the largest gland in our body. It is responsible for the metabolism and work process of all cells. Some of the common diseases of gland thyroid are hypothyroidism, hyperthyroidism, thyroid nodules, and thyroid cancer. They are dangerous diseases. Recently, the rate of thyroid gland disease has been increasing in some poor countries. Data includes 3772 person with 3541 for ill group (I) and 231 ones for nonill group (NI). Detail for this data is given in http://www.cs.sfu.ca/wangk/ucidata/dataset/thyroid–disease, in which the surveyed variables are Age (X1), Query on thyroxin (X2), Anti-thyroid medication (X3), Sick (X4), Pregnant (X5), Thyroid surgery (X6), Thyroid Stimulating Hormone (X7),


Table 3. The correct probability (%) in classifying Iris flower.

of smoothing parameter and the type of kernel function play an important role and affect the result. Although Silverman [20], Martinez and Martinez [10], and some other authors [7, 13, 27] had discussions about this problem, the optimal choice still has not been found yet. In this chapter, the smoothing parameter is from the idea of Scott [19] and the kernel function is the Gaussian one. We have also written the code by MATLAB software to estimate the pdfs in n-

We have written the complete code for the proposed algorithm by MATLAB software. It is

In this section, we will consider three applications in three domains: biology, medicine, and economics to illustrate for present theories and to test established algorithms. They also show

Application 1. We consider classification for well-known Iris flower data, which have been presented in many documents like in Ref. [17]. These data are often used to compare the new

that the proposed algorithm presents more advantages than the existing ones.

dimensions space using this method.

4. Some applications

54 Bayesian Inference

applied effectively for the examples of Section 4.

Figure 2. The graph of three bivariate normal pdfs and their gmax(x).

Triiodothyronine (X8), Total thyroxin (X9), T4U measured (X10), and Referral source (X11). In this application, this chapter will use random 70% of the data size (2479 elements belong to group I and 162 elements belong to group NI) as the training set to determine significant variables, to estimate pdfs, and to find suitable model. About 30% of the remaining data will be used as test set (1062 elements belong to group I and 69 elements belong to group NI). The result of Bayesian method is also compared to others.

To assess the effect of independent variables in TGD, we build the logistic regression model log (p/1p) with variables Xi, i = 1, 2, …, 11 (p is the probability of TGD). The analytical results are summarized in Table 4.

In Table 4, the three variables X1, X8, and X11 in bold face have statistical significance in classifying the two groups (I) and (NI) at 5% level, so we use them to classify TGD.

Applying the PPC algorithm for cases of one variable, two variables, and three variables with all prior probabilities, we obtain the results given in Table 5.

Table 5 shows that the correct probability is high, in which BayesC always gives the best result in all three cases of variables. BayesC gives the almost exact result with three variables. We also compare BayesC with existing methods (Fisher, SWM, and logistic) for all the above three cases. All cases show that BayesC is more advantageous than others in reducing Bayes error.


Table 4. Value Sigs of logistic regression model.


Table 5. The correct probability (%) in classifying TGD by Bayesian method from training set.

Using the best results for each case of methods from Table 6, classifying for test set (1131 elements), we have the results given in Table 7.

From Table 7, we see that with the test set, BayesC also gives the best result.

Triiodothyronine (X8), Total thyroxin (X9), T4U measured (X10), and Referral source (X11). In this application, this chapter will use random 70% of the data size (2479 elements belong to group I and 162 elements belong to group NI) as the training set to determine significant variables, to estimate pdfs, and to find suitable model. About 30% of the remaining data will be used as test set (1062 elements belong to group I and 69 elements belong to group NI). The

To assess the effect of independent variables in TGD, we build the logistic regression model log (p/1p) with variables Xi, i = 1, 2, …, 11 (p is the probability of TGD). The analytical results are

In Table 4, the three variables X1, X8, and X11 in bold face have statistical significance in

Applying the PPC algorithm for cases of one variable, two variables, and three variables with

Table 5 shows that the correct probability is high, in which BayesC always gives the best result in all three cases of variables. BayesC gives the almost exact result with three variables. We also compare BayesC with existing methods (Fisher, SWM, and logistic) for all the above three cases. All cases show that BayesC is more advantageous than others in reducing Bayes error.

Variable Sig. Variable Sig. X1 0.000 X7 0.304 X2 0.279 X8 0.000 X3 0.998 X9 0.995 X4 0.057 X10 0.999 X5 0.997 X11 0.000 X6 0.997 Const 0.992

Cases Variables BayesU BayesR BayesL BayesC One variable X1 91.13 97.47 97.46 97.97

Two variables X1, X8 98.73 98.77 98.77 99.78

Three variables X1, X8, X11 98.35 98.89 98.89 99.96

Table 5. The correct probability (%) in classifying TGD by Bayesian method from training set.

X8 90.72 98.51 98.50 98.65 X11 90.53 97.48 97.47 98.19

X1, X11 98.11 98.65 97.65 99.44 X8, X11 98.71 98.77 98.77 99.82

classifying the two groups (I) and (NI) at 5% level, so we use them to classify TGD.

result of Bayesian method is also compared to others.

all prior probabilities, we obtain the results given in Table 5.

summarized in Table 4.

56 Bayesian Inference

Table 4. Value Sigs of logistic regression model.

Application 3. This application considers the problem of repaying bank debt (RBD) by customers. In bank credit operations, determining the repayment ability of customers is really important. If the lending is too easy, the bank may have bad debt problems. In contrast, the bank will miss a good business. Therefore, in the current years, the classification of credit application on assessing the ability to repay bank debt has been specially studied and has been a difficult problem in Vietnam. In this section, we appraise this ability of companies in Can Tho city (CTC), Vietnam by using the proposed approach. We collect a data on 214 enterprises operating in key sectors as agriculture, industry, and commerce, including 143 cases of good debt (G) and 71 cases of bad debt (B). Data are provided by responsible organizations of CTC. Each company is evaluated by 13 independent variables in the expert opinion. The specific variables are given in Table 8.

Because of sensitive problem, author has to conceal real data and use training data set. The steps to perform in this application are similar as in Application 2. Training set has 100 elements belonging to group G and 50 elements belonging to group B, and the test set has 43 elements belonging to group G and 21 elements belonging to group B. With training set, the logistic regression model shows only three variables X1, X4, and X7 have statistical significance at 5% level, so we use these three variables to perform BayesU, BayesR, BayesL, and BayesC. Their results are given in Table 9.

From Table 9, we see that BayesC gives the highest probability in all the cases. We also use logistic method, Fisher, and SVM with training set to find the best results. We have the correct probability given in Table 10.


Table 6. The correct probability (%) for optimal models of methods in classifying TGD.


Table 7. Compare the correct probability (%) in classifying TGD from test set.

Using the best model for each case of methods from Table 10 to classify the test set (67 elements), we obtain the results given in Table 11.


Once again from Table 11, we see that with test data, BayesC also gives the best result.

Table 8. The surveyed independent variables.


Table 9. The correct probability (%) in classifying RBD by Bayesian method from training set.


Table 10. The correct probability (%) for optimal models of methods in classifying RBD.


Table 11. Compare the correct probability (%) in classifying RBD from test set.

### 5. Conclusion

Using the best model for each case of methods from Table 10 to classify the test set (67

Cases variables BayesU BayesR BayesL BayesC One variable X1 86.21 86.14 84.13 87.13

Two variables X1, X4 87.25 88.72 87.19 89.06

Three variables X1, X5, X7 91.15 91.53 90.17 93.18

Methods One variable Two variables Three variables

Logistic 84.04 88.29 88.69 Fisher 84.73 80.73 79.32 SWM 82.34 82.03 83.07 BayesC 88.19 91.34 93.18

Table 9. The correct probability (%) in classifying RBD by Bayesian method from training set.

Table 10. The correct probability (%) for optimal models of methods in classifying RBD.

X4 81.12 82.91 86.16 88.19 X7 83.21 84.63 83.14 84.52

X1, X7 88.16 88.34 83.26 89.56 X4, X7 89.25 89.04 89.02 91.34

Once again from Table 11, we see that with test data, BayesC also gives the best result.

X4 Interest (Net income + depreciation)/total assets

X5 Floating capital (Current assets current liabilities)/total assets X6 Liquidity (Cash + Short-term investments)/current liabilities

elements), we obtain the results given in Table 11.

58 Bayesian Inference

Table 8. The surveyed independent variables.

Xi Independent variables Detail

X1 Financial leverage Total debt/total equity X2 Reinvestment Total debt/total equity X3 Roe Net profit/equity

X7 Profits Net profit/total assets X8 Ability Net sales/Total assets X9 Size Logarithm of total assets X10 Experience Years in business activity X11 Agriculture Agricultural and forestry sector X12 Industry Industry and construction X13 Commerce Trade and services

This chapter presents the classification algorithm by Bayesian method in both theory and application aspect. We establish the relations of Bayes error with other measures and consider the problem to compute it in real application for one and multidimensions. An algorithm to determine the prior probabilities which may decrease Bayes error is proposed. The researched problems are applied in three different domains: biology, medicine, and economics. They show that the proposed approach has more advantages than existing ones. In addition, a complete procedure on MATLAB software is completed and is effectively used in some real applications. These examples show that our works present potential applications for research on real problems.

### Author details

Tai Vovan

Address all correspondence to: vvtai@ctu.edu.vn

College of Natural Sciences, Can Tho University, Can Tho City, Vietnam

### References


[19] Scott DW. Multivariate density estimation: Theory, practice, and visualization. 1st ed. New York: Wiley; 1992. DOI: 10.1002/9780470316849

[5] Fadili MJ, et al. On the number of clusters and the fuzziness index for unsupervised FCA application to BOLD fMRI time series. Medical Image Analysis. 2001;5(1):55-67. DOI:

[6] Fisher RA. The statistical utilization of multiple measurements. Annals of Eugenic. 1936;7:

[7] Ghosh AK. Classification using kernel density estimates. Technometrics. 2006;48:120-132.

[8] Jan YK, Cheng CW, Shih YH. Application of logistic regression analysis of home mortgage loan prepayment and default. ICIC Express Letters. 2010;2:325-331. DOI: 325-331.

[9] Hall LO, et al. A comparison of neural network and fuzzy clustering techniques in segmenting magnetic resonance images of the brain. IEEE Transactions on Neural Net-

[10] Martinez WL, Martinez AR. Computational statistics handbook with Matlab. 1st ed. Boca

[11] Matusita K. On the notion of affinity of several distributions and some of its applications. Annals of the institute of Statistical Mathematics. 1967;19:181-192. DOI: 10.1007/

[12] Marta E. Application of Fisher's method to materials that only release water at high temperatures. Portugaliae Etecfochlmlca Acta. 2001;15:301-311. DOI: 10.1016/S0167-7152

[13] McLachlan GJ, Basford KE. Mixture Models: Inference and Applications to Clustering. 1st

[14] Miller G, Inkret WC, Little TT. Bayesian prior probability distributions for internal dosimetry. Radiation Protection Dosimetry. 2001;94:347-352. DOI: 10.1093/oxfordjour

[15] Pham-Gia T, Turkkan T. Bounds for the Bayes error in classification: A Bayesian approach using discriminant analysis. Statistical Methods and Applications. 2006;16:7-26. DOI:

[16] Pham–Gia T, Turkkan N, Bekker A. Bayesian analysis in the L1–norm of the mixing proportion using discriminant analysis. Metrika. 2008;64:1-22. DOI: 10.1007/s00184-006-

[17] Pham–Gia T, Turkkan N, Tai VV. Statistical discrimination analysis using the maximum function. Communications in Statistics-Simulation and Computation. 2008;37:320-336.

[18] Pham–Gia T, Nhat ND, Phong, NV. Statistical classification using the maximum function.

Open Journal of Statistics. 2015;15:665-679. DOI: 10.4236/ojs.2015.57068

10.1016/S1361-8415(00)00035-9

60 Bayesian Inference

DOI: 10.1198/004017005000000391

10.12783/ijss.2015.03.014

BF02481018

(02)00310-3

nals.rpd.a006509

0027-1

10.1007/s10260-006-0012-x

DOI: 10.1080/03610910701790475

376-386. DOI: 10.1111/j.1469-1809.1938.tb02189

works. 1992;3(5):672-682. DOI: 10.1109/72.159057

Raton: CRC Press; 2007. DOI: 1198/tech.2002.s89

ed. New York: Marcel Dekker; 1988. DOI: 10.2307/2348072


### **Hypothesis Testing for High-Dimensional Problems** Hypothesis Testing for High-Dimensional Problems

DOI: 10.5772/intechopen.70210

### Naveen K. Bansal Naveen K. Bansal

Additional information is available at the end of the chapter Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/intechopen.70210

### Abstract

For high-dimensional hypothesis problems, new approaches have emerged since the publication. The most promising of them uses Bayesian approach. In this chapter, we review some of the past approaches applicable to only law-dimensional hypotheses testing and contrast it with the modern approaches of high-dimensional hypotheses testing. We review some of the new results based on Bayesian decision theory and show how Bayesian approach can be used to accommodate directional hypotheses testing and skewness in the alternatives. A real example of gene expression data is used to demonstrate a Bayesian decision theoretic approach to directional hypotheses testing with skewed alternatives.

Keywords: multiple directional hypotheses, false discovery rate, familywise error rate, gene expression, skew-normal distribution

### 1. Introduction

In today's world, most of the statistical inference problems involve high-dimensional multiple hypothesis testing. Whenever we collect data, we collect data on multiple features, involving very high-dimensional variables in some cases. For example, gene expression data consist of gene expressions on thousands of genes; image data consist of image expressions on multiple voxels. The statistical analysis for these types of data involves multiple hypotheses testing (MHT). It is well known that univariate methods cannot be applied to simultaneously test hypotheses on the multiple features. The reason for this is that the error rates for the univariate analysis get multiplied under MHT, and as a result the actual error rate can be very high. To understand the main issue of multiplicity, consider the following example. Suppose there are, say, 100 misspelled words in a book, and each of these words occurs in 5% of the pages. You pick a page at random. For each misspelled word, the probability is certainly 0.05 of finding that word in the page. However, the probability is much higher that you will find at least one of the 100 misspelled words. If these words were independently

© The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons © 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and eproduction in any medium, provided the original work is properly cited.

distributed, then the probability of finding at least one misspelled word is 1 � (0.95)<sup>100</sup> <sup>≈</sup> 0.995. If the placements of the misspelled words were positively dependent, then the probability will be lower than 0.995. For example, if we take an extreme case of dependence that they all occur together, then the probability will be 0.05. The same phenomenon occurs in the MHT. The statistical inference, based on the error rate of individual hypothesis testing, can lead to very high error rate for the combined hypotheses. Thus, for the MHT, adjustment in the error rate needs to be made. Note that the adjustment rate may depend on the dependent structure, but due to the complexity of the dependent structure in high dimension, dependency is usually ignored in the literature [1].

The statistical inference depends on how we define the error rate for the combined hypotheses testing. Let us suppose that there are m hypotheses testing H<sup>i</sup> <sup>0</sup> vs:H<sup>i</sup> <sup>a</sup>, i ¼ 1, 2, …, m. If we do not want to make even one false discovery, then we should control the familywise error rate (FWER), which is defined as

$$FWER = \Pr\left(\text{Fallsely Reject } H\_0^i \text{ for at least one } i, \, i = 1, 2, \dots, m\right) \tag{1}$$

There are many methods for controlling FWER ≤ α<sup>F</sup> (=0.05, e.g.). A simplest method is the Bonferroni's procedure. Let Ti be the test statistics for testing H<sup>i</sup> <sup>0</sup> vs:H<sup>i</sup> <sup>a</sup> with the corresponding p-values pi. Then, Bonferroni's procedure rejects H<sup>i</sup> <sup>0</sup> if pi < αF/m. To see the proof of this, suppose I<sup>0</sup> be the set of all i for which H<sup>i</sup> <sup>0</sup> is true, and suppose pj < αF/m for at least one ∈ I<sup>0</sup> . Then using Boole's inequality, we have, from Eq. (1),

$$FWER = Pr\left\{\bigcup\_{i \in I\_0} \left(p\_i < \alpha\_{\mathbb{F}}/m\right)\right\} \le \sum\_{i \in I\_0} Pr\left\{p\_i < \alpha\_{\mathbb{F}}/m\right\} \tag{2}$$

Now, since, under H<sup>i</sup> <sup>0</sup>, pi � Uð Þ 0, 1 , Pr{pi < αF/m} = αF/m. Then, assuming that there are m<sup>0</sup> number of elements in I0, we have, from Eq. (2),

$$FWER \leq \frac{m\_0 \alpha\_F}{m} \leq \alpha\_F$$

Holm [2] gave a modified version of Bonferroni's procedure which also controls the familywise error rate. Holm's Bonferroni Procedure is the following: First rank all the p-values, p(1) ≤ p(2) ≤ … ≤ p(m), and let Hð Þ<sup>1</sup> <sup>0</sup> , Hð Þ<sup>2</sup> <sup>0</sup> ,…, Hð Þ <sup>m</sup> <sup>0</sup> be their associated null hypotheses. Let l be the smallest index such that p(l) > αF/(m � l + 1). Then, reject only those null hypotheses that are associated with Hð Þ<sup>1</sup> <sup>0</sup> , Hð Þ<sup>2</sup> <sup>0</sup> , …, Hð Þ <sup>l</sup>�<sup>1</sup> <sup>0</sup> . Note that the selected hypotheses have p-values with p(1) < αF/m,p(2) < αF/(m � 1),…,p(<sup>l</sup> � 1) < αF/(m � l + 2) , and thus more powerful than Bonferroni's procedure, since hypotheses that are selected under Bonferroni's procedure will also be selected under Holm's procedure.

The above Bonferroni type procedures are not very satisfactory when m is very high. Let us suppose m = 10, 000 (this is actually not very high for most of the high-dimensional problems), and suppose we want to control FWER by α<sup>F</sup> = 0.05. Then, for Holm's procedure, the smallest p-value has to be lower than 0.000005 in order to reject at least one hypothesis, which may be very hard to achieve. The problem is not really with Holm's procedure; the problem is with the use of FWER as an error rate. For a high-dimensional problem, it is unrealistic to seek for a procedure which will not make at least one false discovery. Benjamini and Hochberg [1] proposed a new approach called false discovery rate (FDR) and proposed a procedure that works much better for high-dimensional MHT.

In Section 2, we review the FDR procedure and Bayesian procedures for two-sided alternatives. An extension of directional hypotheses is presented in Section 3. In Section 3, we also discuss Bayesian procedures under skewed alternatives. In Section 4, the problem of directional hypotheses is considered by converting p-values to normally distributed test statistics. We also discuss, in Section 4, a Bayes procedure under skew-normal alternatives. An application using real data of gene expressions is also discussed in Section 4. Some concluding remarks are made in Section 5.

### 2. False discovery rate (FDR), Benjamini and Hochberg's (BH) procedure, and Bayesian procedures

For each of the hypothesis testing H<sup>i</sup> <sup>0</sup> vs:H<sup>i</sup> <sup>a</sup>, suppose a statistical procedure either rejects the null hypothesis H<sup>i</sup> <sup>0</sup> or fails to reject H<sup>i</sup> <sup>0</sup>. For the sake of simplicity, we equate fail to reject H<sup>i</sup> <sup>0</sup> as accepting the null H<sup>i</sup> <sup>0</sup> . However, for small sample size case, it will be unwise to make a conclusion of accepting H<sup>i</sup> <sup>0</sup>. From now on, rejections of the null will be called discoveries. Table 1 shows the possible outcomes by a procedure, where, for example, V is the total number of discoveries, among them V<sup>0</sup> is the number of false discoveries.

Thus, the proportion of the false discoveries is V0/max(V,1). The FDR is defined as the expected proportion of false discoveries, that is,

$$FDR = E\left[\frac{V\_0}{\max(V, 1)}\right].\tag{3}$$

If, for example, FDR = 0.05, then we can expect on the average 5% of all discoveries to be false. In other words, under repeated experiments on the average, we make 5% of the false discoveries (in a frequentist's sense). Note that FDR ≤ FWER = P(V<sup>0</sup> ≥ 1) as the following inequality shows:


Table 1. Total number of decisions made.

distributed, then the probability of finding at least one misspelled word is 1 � (0.95)<sup>100</sup> <sup>≈</sup> 0.995. If the placements of the misspelled words were positively dependent, then the probability will be lower than 0.995. For example, if we take an extreme case of dependence that they all occur together, then the probability will be 0.05. The same phenomenon occurs in the MHT. The statistical inference, based on the error rate of individual hypothesis testing, can lead to very high error rate for the combined hypotheses. Thus, for the MHT, adjustment in the error rate needs to be made. Note that the adjustment rate may depend on the dependent structure, but due to the complexity of the dependent structure in high dimension, dependency is

The statistical inference depends on how we define the error rate for the combined hypotheses

want to make even one false discovery, then we should control the familywise error rate

There are many methods for controlling FWER ≤ α<sup>F</sup> (=0.05, e.g.). A simplest method is the

≤ X i ∈I<sup>0</sup>

≤ α<sup>F</sup>

<sup>0</sup>, pi � Uð Þ 0, 1 , Pr{pi < αF/m} = αF/m. Then, assuming that there are m<sup>0</sup>

<sup>0</sup> vs:H<sup>i</sup>

<sup>0</sup> for at least one i, i <sup>¼</sup> <sup>1</sup>, <sup>2</sup>, …, m � � (1)

<sup>0</sup> vs:H<sup>i</sup>

<sup>0</sup> is true, and suppose pj < αF/m for at least one ∈ I<sup>0</sup> .

<sup>0</sup> be their associated null hypotheses. Let l be the

<sup>0</sup> . Note that the selected hypotheses have p-values with

<sup>0</sup> if pi < αF/m. To see the proof of this,

Pr pi <sup>&</sup>lt; <sup>α</sup>F=<sup>m</sup> � � (2)

<sup>a</sup>, i ¼ 1, 2, …, m. If we do not

<sup>a</sup> with the corresponding

usually ignored in the literature [1].

(FWER), which is defined as

64 Bayesian Inference

Now, since, under H<sup>i</sup>

associated with Hð Þ<sup>1</sup>

p(1) ≤ p(2) ≤ … ≤ p(m), and let Hð Þ<sup>1</sup>

testing. Let us suppose that there are m hypotheses testing H<sup>i</sup>

FWER <sup>¼</sup> Pr Falsely Reject <sup>H</sup><sup>i</sup>

Bonferroni's procedure. Let Ti be the test statistics for testing H<sup>i</sup>

p-values pi. Then, Bonferroni's procedure rejects H<sup>i</sup>

Then using Boole's inequality, we have, from Eq. (1),

number of elements in I0, we have, from Eq. (2),

<sup>0</sup> , Hð Þ<sup>2</sup>

also be selected under Holm's procedure.

FWER ¼ Pr ⋃

i ∈I<sup>0</sup>

<sup>0</sup> , Hð Þ<sup>2</sup>

<sup>0</sup> , …, Hð Þ <sup>l</sup>�<sup>1</sup>

pi <sup>&</sup>lt; <sup>α</sup>F=<sup>m</sup> � � ( )

FWER ≤

<sup>0</sup> ,…, Hð Þ <sup>m</sup>

m0α<sup>F</sup> m

Holm [2] gave a modified version of Bonferroni's procedure which also controls the familywise error rate. Holm's Bonferroni Procedure is the following: First rank all the p-values,

smallest index such that p(l) > αF/(m � l + 1). Then, reject only those null hypotheses that are

p(1) < αF/m,p(2) < αF/(m � 1),…,p(<sup>l</sup> � 1) < αF/(m � l + 2) , and thus more powerful than Bonferroni's procedure, since hypotheses that are selected under Bonferroni's procedure will

The above Bonferroni type procedures are not very satisfactory when m is very high. Let us suppose m = 10, 000 (this is actually not very high for most of the high-dimensional problems), and suppose we want to control FWER by α<sup>F</sup> = 0.05. Then, for Holm's procedure, the smallest

suppose I<sup>0</sup> be the set of all i for which H<sup>i</sup>

$$FDR = E\left[\frac{V\_0}{\max(V, 1)}\right] = E\left[\frac{V\_0}{\max(V, 1)} I(V\_0 \ge 1)\right] \le E[I(V\_0 \ge 1)] = P(V\_0 \ge 1).$$

Thus, we are likely to make a higher number of discoveries under FDR approach than under FWER, since if a procedure controls FWER (≤α), then it also controls FDR ((≤α), but not vice versa.

#### 2.1. Benjamini and Hochberg's procedure

Benjamini and Hochberg [1] proposed the following BH procedure which controls the FDR.

Let pi be the p-value for the ith hypothesis under a test statistic Ti. Suppose T1,T2,…,Tm are independently distributed. Let p[1] < p[2] < … < p[m] be the ordered p-values with the corresponding null hypotheses be denoted by Hð Þ<sup>1</sup> <sup>0</sup> , Hð Þ<sup>2</sup> <sup>0</sup> , …, Hð Þ <sup>m</sup> <sup>0</sup> . Let

$$i\_0 = \max\left\{ i : p\_{[i]} \le \frac{i}{m} \alpha \right\},$$

Then, reject Hð Þ<sup>i</sup> <sup>0</sup> for all i ≤ i0.

This procedure controls FDR ≤ <sup>m</sup><sup>0</sup> <sup>m</sup> <sup>α</sup> <sup>≤</sup> <sup>α</sup>. Since <sup>m</sup><sup>0</sup> is unknown, having the upper bound of <sup>m</sup><sup>0</sup> <sup>m</sup> α is not very useful. If m<sup>0</sup> can be estimated reliably, a better bound is possible.

The above result was proven in [1], under the independence of the test statistics. Hochberg and Yekitieli [3] extended the result to positively correlated test statistics, and they also sharpened the BH procedure with new i<sup>0</sup> defined as

$$i\_0 = \max\left\{ i : p\_{[i]} \le \frac{1}{mc(m)}\alpha \right\},$$

where c mð Þ¼ <sup>P</sup><sup>m</sup> i¼1 1 i .

#### 2.2. Bayesian procedures

Under Bayesian setting, we assume that H<sup>i</sup> <sup>0</sup> and H<sup>i</sup> <sup>a</sup>, i ¼ 1, 2, …, m are generated probabilistically with

$$P(H\_0^i) = p \text{ and } P(H\_a^i) = 1 - p$$

Under this setting, [4] developed a concept of local false discovery rate (fdr). If Ti,i = 1, 2, …,m are test statistics with pdf Ti|H<sup>0</sup> � f0(t) and Ti|Ha � fa(t). Then, marginally, Ti � f(t) = pf0(t) + (1 � p)fa(t), and

$$fdr(t) = P\{H\_0^i | T\_i = t\} = \frac{pf\_0(t)}{f(t)}\tag{4}$$

The idea is that if Ti ∈ [t,t + δt], where δt ! 0, then fdr(t) represents that the proportion of the times H<sup>i</sup> <sup>0</sup> will be true. If t is very high, then fdr(t) will be very small indicating the probability of Hi <sup>0</sup> to be very small (i.e., the false discovery rate will be very small). In Eq. (3), p and f(t) are unknown, which can be estimated (see [4]).

Storey [5] proposed a positive false discovery rate

FDR <sup>¼</sup> <sup>E</sup> <sup>V</sup><sup>0</sup>

2.1. Benjamini and Hochberg's procedure

<sup>0</sup> for all i ≤ i0.

the BH procedure with new i<sup>0</sup> defined as

i¼1 1 i .

Under Bayesian setting, we assume that H<sup>i</sup>

This procedure controls FDR ≤ <sup>m</sup><sup>0</sup>

corresponding null hypotheses be denoted by Hð Þ<sup>1</sup>

versa.

66 Bayesian Inference

Then, reject Hð Þ<sup>i</sup>

where c mð Þ¼ <sup>P</sup><sup>m</sup>

cally with

2.2. Bayesian procedures

pf0(t) + (1 � p)fa(t), and

maxð Þ V; 1 � �

<sup>¼</sup> <sup>E</sup> <sup>V</sup><sup>0</sup>

maxð Þ <sup>V</sup>; <sup>1</sup> I Vð Þ <sup>0</sup> <sup>≥</sup> <sup>1</sup> � �

Thus, we are likely to make a higher number of discoveries under FDR approach than under FWER, since if a procedure controls FWER (≤α), then it also controls FDR ((≤α), but not vice

Benjamini and Hochberg [1] proposed the following BH procedure which controls the FDR.

<sup>i</sup><sup>0</sup> <sup>¼</sup> max <sup>i</sup> : <sup>p</sup>½ �<sup>i</sup> <sup>≤</sup>

not very useful. If m<sup>0</sup> can be estimated reliably, a better bound is possible.

P H<sup>i</sup> 0

f dr tðÞ¼ P H<sup>i</sup>

Let pi be the p-value for the ith hypothesis under a test statistic Ti. Suppose T1,T2,…,Tm are independently distributed. Let p[1] < p[2] < … < p[m] be the ordered p-values with the

The above result was proven in [1], under the independence of the test statistics. Hochberg and Yekitieli [3] extended the result to positively correlated test statistics, and they also sharpened

<sup>0</sup> and H<sup>i</sup>

Under this setting, [4] developed a concept of local false discovery rate (fdr). If Ti,i = 1, 2, …,m are test statistics with pdf Ti|H<sup>0</sup> � f0(t) and Ti|Ha � fa(t). Then, marginally, Ti � f(t) =

a � � <sup>¼</sup> <sup>1</sup> � <sup>p</sup>

<sup>0</sup>jTi <sup>¼</sup> <sup>t</sup> � � <sup>¼</sup> pf <sup>0</sup>ð Þ<sup>t</sup>

� � <sup>¼</sup> p and P H<sup>i</sup>

<sup>i</sup><sup>0</sup> <sup>¼</sup> max <sup>i</sup> : <sup>p</sup>½ �<sup>i</sup> <sup>≤</sup>

<sup>0</sup> , Hð Þ<sup>2</sup>

� �

i m α

1 mc mð Þ <sup>α</sup>

� �

<sup>0</sup> , …, Hð Þ <sup>m</sup>

<sup>m</sup> <sup>α</sup> <sup>≤</sup> <sup>α</sup>. Since <sup>m</sup><sup>0</sup> is unknown, having the upper bound of <sup>m</sup><sup>0</sup>

;

<sup>a</sup>, i ¼ 1, 2, …, m are generated probabilisti-

f tð Þ (4)

<sup>0</sup> . Let

<sup>m</sup> α is

≤ EI V ½ �¼ ð Þ <sup>0</sup> ≥ 1 P Vð Þ <sup>0</sup> ≥ 1 :

$$pFDR = E\left[\frac{V\_0}{V} | V > 0\right],\tag{5}$$

where expectation is with respect to the distribution of (Ti,θi),i = 1, 2, …, m. Under the assumption that T1,T2, … Tm are identically and independently distributed, [6] proved that

$$pFDR(\Gamma) = P(H\_0 | T \in \Gamma),$$

for a procedure that rejects H<sup>i</sup> <sup>0</sup> when Ti ∈ Γ. Based on this, q � value for the multiple hypothesis (analogous to p-value for a single hypothesis) is defined as the smallest value of pFDR(Γ) such that the observed Ti = ti ∈ Γ, see [6]. Under most cases, q � value(ti) = P(H0| Ti > ti). This gives a procedure under multiple hypothesis that rejects H<sup>i</sup> <sup>0</sup> if q � value(ti) < α.

### 3. Directional hypotheses testing

As described earlier, the null hypothesis H<sup>i</sup> <sup>0</sup> is either accepted or rejected. In most cases, however, rejection of null hypotheses is not sufficient. After rejecting H<sup>i</sup> <sup>0</sup>, finding the direction of the alternatives may also be important. A detailed discussion of the directional hypotheses can be found in [7].

Direction hypotheses testing involves testing H<sup>i</sup> <sup>0</sup> against directional hypotheses H<sup>i</sup> � and <sup>H</sup><sup>i</sup> þ, and the objective is to obtain selection region {Ti <sup>∈</sup> <sup>Γ</sup>�} for selecting <sup>H</sup><sup>i</sup> � and selection region {Ti ∈ Γ+} for selecting H<sup>i</sup> <sup>þ</sup>. In other words, <sup>H</sup><sup>i</sup> <sup>0</sup> will be rejected if Ti ∈ Γ� or Ti ∈ Γ+, and the direction H<sup>i</sup> � or <sup>H</sup><sup>i</sup> <sup>þ</sup> is determined according to Ti <sup>∈</sup> <sup>Γ</sup>� or Ti <sup>∈</sup> <sup>Γ</sup>+, respectively. Analogous to Table 1, we now have

Table 2 illustrates the number of cases possible when accepting H<sup>0</sup> or selecting H� or selecting H+. For example, out of V times when selecting H�, V<sup>0</sup> times errors are made when in fact H<sup>0</sup> is


Table 2. Number of decisions under directional hypotheses.

true, and V<sup>+</sup> times errors are made when in fact H<sup>+</sup> is true. In other words, when selecting H�, not only H<sup>0</sup> is falsely rejected V<sup>0</sup> times but the direction is also falsely selected V<sup>+</sup> times. This leads to a concept of directional false discovery rate DFDR defined as

$$DFDR = E\left[\frac{V\_0 + V\_+ + W\_0 + W\_-}{\max(V + W, 1)}\right].\tag{6}$$

This is analogous to FDR for two-sided alternatives. For most cases, [8] showed that DFDRcontrolling procedures for directional hypotheses can be treated as FDR-controlling procedure for two-sided multiple hypotheses with direction determined by the sign of the test statistics.

Bansal and Miescke [9] considered a decision theoretic formulation to multiple hypotheses problems. The approach assumes parametric modeling. Suppose the model for the observed data x be represented by P(x; θ,η), where θ = (θ1,θ2,…,θm) <sup>0</sup> is a parameter vector of interest, and η is a nuisance parameter. The problem of interest is to test

$$H\_0^i: \theta\_i = 0 \text{vs.} \, H\_-^i: \theta\_i < 0 \text{ or } H\_+^i: \theta\_i > 0 \tag{7}$$

Let the loss function of a decision rule d(x)=(d1(x),d2(x),…,dm(x)) is given by

$$L(\boldsymbol{\theta}, d(\boldsymbol{\pi})) = \sum\_{i=1}^{m} l\_i(\boldsymbol{\theta}, d\_i(\boldsymbol{\pi})), \tag{8}$$

where li(θ,di(x)) is an individual loss of di. Here, di ∈ {�1,0,1} with di = 0, di = � 1, and di = 1 means accepting H<sup>i</sup> <sup>0</sup>, selecting H<sup>i</sup> � and selecting <sup>H</sup><sup>i</sup> <sup>þ</sup>, respectively. Note that for the "0-1" loss, that is, when li = 0 for correct decision, and li = 1 for the incorrect decision, L is the total number of incorrect decisions. Thus, minimizing the E[L(θ,d(X))] for the "0-1" loss amounts to minimizing the expected number of incorrect decisions.

Now, suppose under the Bayesian setting, θi,i = 1, 2, …, m are generated from

$$
\pi(\theta) = p\_- \pi\_-(\theta) + p\_0 I(\theta = 0) + p\_+ \pi\_+(\theta), \tag{9}
$$

where π� is the prior density over (�∞,0) and π<sup>+</sup> is the prior density over (0, ∞). A special case of prior (9) is that π�(θ) = π+(�θ). In this case, p� and p<sup>+</sup> reflect the skewness in the alternative hypotheses. For example, if p� = p+, then we have a symmetric case. In this case, the selection of H� or H+, after rejecting H0, based on the sign of the test statistics makes sense. On the other hand, if p� < p+, then it reflects that more of the θis are positives than negatives. For many gene expressions data analyses, this presents a useful case when over-expressed genes may occur more frequently than under-expressed genes as a result of gene mutation (naturally or as a result of external factors). For specific examples, see [9, 10].

From now on, we focus on the "0-1" loss. The results can be easily extended to other loss functions. The "0-1" loss can be written as

Hypothesis Testing for High-Dimensional Problems http://dx.doi.org/10.5772/intechopen.70210 69

$$L(\theta, \mathbf{d}) = \sum\_{i=1}^{m} \left[ 1 - \sum\_{j=-1}^{1} I(d\_i = j) I(\nu\_i^{\theta} = j) \right],$$

where v<sup>θ</sup> <sup>i</sup> <sup>∈</sup> f g �1, <sup>0</sup>, <sup>1</sup> is an indicator variable indicating <sup>θ</sup><sup>i</sup> <sup>&</sup>lt; 0 when <sup>v</sup><sup>θ</sup> <sup>i</sup> ¼ �1, θ<sup>i</sup> = 0 when vθ <sup>i</sup> <sup>¼</sup> 0, and <sup>θ</sup><sup>i</sup> <sup>&</sup>gt; 0 when <sup>v</sup><sup>θ</sup> <sup>i</sup> ¼ 1. It is easy to see that minimizing the posterior expected loss yields the selection rule that selects H<sup>i</sup> �, H<sup>i</sup> 0, orH<sup>i</sup> <sup>þ</sup> according to max <sup>v</sup> ð Þ � <sup>i</sup> , vð Þ<sup>0</sup> <sup>i</sup> , vð Þ <sup>þ</sup> i n o; where v ð Þ � <sup>i</sup> ¼ P Hð Þ � <sup>i</sup> jx � �, vð Þ<sup>0</sup> <sup>i</sup> <sup>¼</sup> P Hð Þ<sup>0</sup> <sup>i</sup> jx � �; and <sup>v</sup> ð Þ þ <sup>i</sup> ¼ P Hð Þ <sup>þ</sup> <sup>i</sup> jx � �:

#### 3.1. The constrained Bayes rule

true, and V<sup>+</sup> times errors are made when in fact H<sup>+</sup> is true. In other words, when selecting H�, not only H<sup>0</sup> is falsely rejected V<sup>0</sup> times but the direction is also falsely selected V<sup>+</sup> times. This

DFDR <sup>¼</sup> <sup>E</sup> <sup>V</sup><sup>0</sup> <sup>þ</sup> <sup>V</sup><sup>þ</sup> <sup>þ</sup> <sup>W</sup><sup>0</sup> <sup>þ</sup> <sup>W</sup>�

This is analogous to FDR for two-sided alternatives. For most cases, [8] showed that DFDRcontrolling procedures for directional hypotheses can be treated as FDR-controlling procedure for two-sided multiple hypotheses with direction determined by the sign of the test

Bansal and Miescke [9] considered a decision theoretic formulation to multiple hypotheses problems. The approach assumes parametric modeling. Suppose the model for the observed data x be represented by P(x; θ,η), where θ = (θ1,θ2,…,θm) <sup>0</sup> is a parameter vector of interest,

� : <sup>θ</sup><sup>i</sup> <sup>&</sup>lt; <sup>0</sup> or H<sup>i</sup>

i¼1

where li(θ,di(x)) is an individual loss of di. Here, di ∈ {�1,0,1} with di = 0, di = � 1, and di = 1

that is, when li = 0 for correct decision, and li = 1 for the incorrect decision, L is the total number of incorrect decisions. Thus, minimizing the E[L(θ,d(X))] for the "0-1" loss amounts to mini-

where π� is the prior density over (�∞,0) and π<sup>+</sup> is the prior density over (0, ∞). A special case of prior (9) is that π�(θ) = π+(�θ). In this case, p� and p<sup>+</sup> reflect the skewness in the alternative hypotheses. For example, if p� = p+, then we have a symmetric case. In this case, the selection of H� or H+, after rejecting H0, based on the sign of the test statistics makes sense. On the other hand, if p� < p+, then it reflects that more of the θis are positives than negatives. For many gene expressions data analyses, this presents a useful case when over-expressed genes may occur more frequently than under-expressed genes as a result of gene mutation (naturally or as a result of external factors). For specific exam-

From now on, we focus on the "0-1" loss. The results can be easily extended to other loss

maxð Þ V þ W, 1 � �

: (6)

<sup>þ</sup> : <sup>θ</sup><sup>i</sup> <sup>&</sup>gt; <sup>0</sup> (7)

lið Þ θ, dið Þx ; (8)

<sup>þ</sup>, respectively. Note that for the "0-1" loss,

π θð Þ¼ <sup>p</sup>�π�ð Þþ <sup>θ</sup> <sup>p</sup>0Ið Þþ <sup>θ</sup> <sup>¼</sup> <sup>0</sup> <sup>p</sup>þπþð Þ <sup>θ</sup> ; (9)

leads to a concept of directional false discovery rate DFDR defined as

and η is a nuisance parameter. The problem of interest is to test

<sup>0</sup> : <sup>θ</sup><sup>i</sup> <sup>¼</sup> <sup>0</sup>vs:H<sup>i</sup>

Let the loss function of a decision rule d(x)=(d1(x),d2(x),…,dm(x)) is given by

<sup>L</sup>ð Þ¼ <sup>θ</sup>, dð Þ<sup>x</sup> <sup>X</sup><sup>m</sup>

� and selecting <sup>H</sup><sup>i</sup>

Now, suppose under the Bayesian setting, θi,i = 1, 2, …, m are generated from

Hi

<sup>0</sup>, selecting H<sup>i</sup>

mizing the expected number of incorrect decisions.

functions. The "0-1" loss can be written as

statistics.

68 Bayesian Inference

means accepting H<sup>i</sup>

ples, see [9, 10].

The Bayes procedure described earlier accommodates skewness in the prior, but no type of false discovery rates is controlled. In order to control a false discovery rate, we need to obtain a constrained Bayes rule that minimizes the posterior expected loss subject to a constraint on the false discovery rate.

The directional false discovery rate (6) is defined in a frequentist's manner, in which expectation is with respect to X|θ. Let us define Eq. (6) as BDFDR when expectation is taken with respect to X|θ and then further expectation is taken with respect to θ. We define posterior version of Eq. (6) as PDFDR when the expectation is taken with respect to the posterior distribution of θ|X = x. It can be shown that

$$PDFDR = 1 - \frac{\sum\_{i=1}^{m} \left\{ I(d\_i = -1)v\_i^{(-)} + I(d\_i = +1)v\_i^{(+)} \right\}}{(|D\_-| + |D\_+|) \lor 1} \tag{10}$$

Here, j j¼ D� P<sup>m</sup> <sup>i</sup>¼<sup>1</sup> I dð Þ <sup>i</sup> ¼ �<sup>1</sup> and <sup>j</sup>Dþj ¼ <sup>P</sup><sup>m</sup> <sup>i</sup>¼<sup>1</sup> I dð Þ <sup>i</sup> <sup>¼</sup> <sup>1</sup> :

A constrained Bayes rule can be obtained by minimizing the posterior expected loss subject to the constraint that PDFDR ≤ α. There can be many approaches to obtain the constraint minimization. We present, here, an approach given in [9], which is as follows:

Consider the sets D<sup>B</sup> � and D<sup>B</sup> <sup>þ</sup> of indices that selects Hð Þ � <sup>i</sup> and Hð Þ <sup>þ</sup> <sup>i</sup> , respectively, according to the unconstraint Bayes rule, that is, when vð Þ � <sup>i</sup> ¼ max v ð Þ0 <sup>i</sup> , vð Þ <sup>þ</sup> i n o and vð Þ <sup>þ</sup> <sup>i</sup> ¼ max v ð Þ0 <sup>i</sup> , vð Þ � i n o, respectively. Define ξ<sup>i</sup> ¼ ν ð Þ � <sup>i</sup> for i<sup>∈</sup> <sup>D</sup><sup>B</sup> <sup>þ</sup>, and <sup>ξ</sup><sup>i</sup> <sup>¼</sup> <sup>ν</sup> ð Þ þ <sup>i</sup> for i<sup>∈</sup> <sup>D</sup><sup>B</sup> <sup>þ</sup>, and then rank all <sup>ξ</sup>i, i <sup>∈</sup> <sup>D</sup><sup>B</sup> �∪D<sup>B</sup> <sup>þ</sup> from the lowest to the highest. Let the ranked values be denoted by <sup>ξ</sup>½ � <sup>1</sup> <sup>≤</sup> <sup>ξ</sup>½ � <sup>2</sup> <sup>≤</sup>… <sup>≤</sup> <sup>ξ</sup> <sup>b</sup><sup>k</sup> � �, where <sup>b</sup><sup>k</sup> <sup>¼</sup> <sup>D</sup><sup>B</sup> �⋃D<sup>B</sup> þ � � � � . Denote

$$\widehat{i}\_0 = \max \left\{ j \le \widehat{k} : \frac{1}{j} \sum\_{i=1}^j \xi\_{\left[ \widehat{k} - i + 1 \right]} \ge 1 - \alpha \right\}.$$

Let D<sup>ξ</sup> denotes the set of indices corresponding to ξ bk h i <sup>≥</sup> <sup>ξ</sup> <sup>b</sup><sup>k</sup>�<sup>1</sup> h i <sup>≥</sup> …<sup>≥</sup> <sup>ξ</sup> <sup>b</sup><sup>k</sup>�b<sup>i</sup>0þ<sup>1</sup> h i. Now, select <sup>H</sup><sup>i</sup> � for i∈ D<sup>B</sup> � <sup>∩</sup> <sup>D</sup>ξ, and <sup>H</sup><sup>i</sup> <sup>þ</sup> for <sup>i</sup> <sup>∈</sup> <sup>D</sup><sup>B</sup> <sup>þ</sup>⋂Dξ.

#### 3.2. Estimating mixture parameters

The above procedure requires estimation of the parameters (p�,p0,p+) and estimation of the nuisance parameter η. Note that marginally,

$$X\_i \sim p\_- f\_- (\mathbf{x}\_i|\eta) + p\_0 f\_0 (\mathbf{x}\_i|\eta) + p\_+ f\_+ (\mathbf{x}\_i|\eta),$$

where f0(xi| η) = f(xi| 0,η), and

$$f\_{-}(\mathbf{x}\_{i}|\eta) = \int\_{-\infty}^{0} f(\mathbf{x}\_{i}|\boldsymbol{\theta}, \eta) \pi\_{-}(\boldsymbol{\theta}) d\boldsymbol{\theta} \, \boldsymbol{\theta}\_{\prime} f\_{+}(\mathbf{x}\_{i}|\eta) = \int\_{0}^{\infty} f(\mathbf{x}\_{i}|\boldsymbol{\theta}, \eta) \pi\_{+}(\boldsymbol{\theta}) d\boldsymbol{\theta} \, \boldsymbol{\theta}\_{\prime}$$

and X1,X2,…,Xm are independently distributed. Estimates of the parameters of the mixed density can be obtained by using EM algorithm. It is easy to see that the EM estimators of (p�,p0,p+) follows the following iterative scheme:

$$p\_{-}^{(j+1)} = \frac{1}{m} \sum\_{i=1}^{m} \frac{p\_{-}^{(j)}f\_{-}(\mathbf{x}\_{i}|\eta)}{p\_{-}^{(j)}f\_{-}(\mathbf{x}\_{i}|\eta) + p\_{0}^{(j)}f\_{0}(\mathbf{x}\_{i}|\eta) + p\_{+}^{(j)}f\_{+}(\mathbf{x}\_{i}|\eta)},$$

$$p\_{0}^{(j+1)} = \frac{1}{m} \sum\_{i=1}^{m} \frac{p\_{0}^{(j)}f\_{-}(\mathbf{x}\_{i}|\eta)}{p\_{-}^{(j)}f\_{-}(\mathbf{x}\_{i}|\eta) + p\_{0}^{(j)}f\_{0}(\mathbf{x}\_{i}|\eta) + p\_{+}^{(j)}f\_{+}(\mathbf{x}\_{i}|\eta)},$$

$$\dots$$

$$p\_{+}^{(j+1)} = \frac{1}{m} \sum\_{i=1}^{m} \frac{p\_{+}^{(j)} f\_{-}(\mathbf{x}\_{i}|\eta)}{p\_{-}^{(j)} f\_{-}(\mathbf{x}\_{i}|\eta) + p\_{0}^{(j)} f\_{0}(\mathbf{x}\_{i}|\eta) + p\_{+}^{(j)} f\_{+}(\mathbf{x}\_{i}|\eta)}$$

Estimation of η can also be estimated iteratively by using EM algorithm or by different means. See [9] for more details.

### 4. Bayes rules by converting p-values to normally distributed test statistics

Let Ti,i = 1, 2,..,m be independently and identically distributed test statistics. Let Pi ¼ P Tð <sup>i</sup> ≤ tijH<sup>i</sup> <sup>0</sup><sup>Þ</sup> be the corresponding <sup>p</sup>-values. Note that under <sup>H</sup><sup>i</sup> <sup>0</sup>, Pi � <sup>U</sup>ð Þ <sup>0</sup>, <sup>1</sup> . Let Xi <sup>=</sup> <sup>Φ</sup>� <sup>1</sup> (Pi) be the corresponding z-score. Then, under H<sup>i</sup> <sup>0</sup>; Xi � <sup>N</sup>(0,1) . Efron [11] suggested using Xi � <sup>N</sup>(0,σ<sup>2</sup> ) under H<sup>i</sup> <sup>0</sup> with <sup>σ</sup><sup>2</sup> appropriately estimated. Efron pointed out that, in practice, <sup>σ</sup><sup>2</sup> may not be equal to 1 due to possible correlation among multiple components. Under the alternative, we assume that Xi � <sup>N</sup>(θi,σ<sup>2</sup> ), where θis are generated with distribution described in Eq. (9). It is true that this is a big leap in making this assumption. In practice, this assumption can be tested, however, and if true, it can lead to very powerful results. [9] assumed that π+(θ) is a truncated normal distribution N(0, σ<sup>2</sup> /ω) , and π�(θ) = π+(�θ), where ω is some positive constant depending upon how inflated we believe the alternative θis are. It can be seen that

Hypothesis Testing for High-Dimensional Problems http://dx.doi.org/10.5772/intechopen.70210 71

$$\circ \upsilon\_i^{(-)} \circ p\_- T\_-(\mathfrak{x}\_i), \upsilon\_i^{(+)} \circ p\_+ T\_+(\mathfrak{x}\_i), \text{ and } \upsilon\_i^{(0)} \circ p\_0 \tag{11}$$

with the proportionality constant [p�T�(xi) + p+T+(xi) + p0} � <sup>1</sup> . Also, <sup>T</sup>�(xi) = <sup>T</sup>+(�xi), and

$$T\_+({\bf x}\_i) = \exp\left\{\frac{{\bf x}\_i^2}{2(1+\omega)\sigma^2}\right\} \Phi\left(\frac{{\bf x}\_i}{\sigma\sqrt{1+\omega}}\right) \tag{12}$$

In order to apply the Bayes procedure as discussed in Section 3, all we need are Eqs. (11) and (12). For computation details, see [9].

#### 4.1. Skew-normal alternatives

3.2. Estimating mixture parameters

70 Bayesian Inference

where f0(xi| η) = f(xi| 0,η), and

nuisance parameter η. Note that marginally,

<sup>f</sup> � xi ð Þ¼ <sup>j</sup><sup>η</sup>

ð0 �∞

(p�,p0,p+) follows the following iterative scheme:

pð Þ <sup>j</sup>þ<sup>1</sup> � <sup>¼</sup> <sup>1</sup> m Xm i¼1

p ð Þ jþ1 <sup>0</sup> <sup>¼</sup> <sup>1</sup> m Xm i¼1

p ð Þ jþ1 <sup>þ</sup> <sup>¼</sup> <sup>1</sup> m Xm i¼1

corresponding z-score. Then, under H<sup>i</sup>

See [9] for more details.

assume that Xi � <sup>N</sup>(θi,σ<sup>2</sup>

normal distribution N(0, σ<sup>2</sup>

statistics

tijH<sup>i</sup>

under H<sup>i</sup>

The above procedure requires estimation of the parameters (p�,p0,p+) and estimation of the

Xi � <sup>p</sup>�<sup>f</sup> � xi ð Þþ <sup>j</sup><sup>η</sup> <sup>p</sup>0<sup>f</sup> <sup>0</sup> xi ð Þþ <sup>j</sup><sup>η</sup> <sup>p</sup>þ<sup>f</sup> <sup>þ</sup> xi ð Þ <sup>j</sup><sup>η</sup> ;

and X1,X2,…,Xm are independently distributed. Estimates of the parameters of the mixed density can be obtained by using EM algorithm. It is easy to see that the EM estimators of

> pð Þ<sup>j</sup> � <sup>f</sup> � xi ð Þ <sup>j</sup><sup>η</sup>

> p ð Þj <sup>0</sup> <sup>f</sup> � xi ð Þ <sup>j</sup><sup>η</sup>

p ð Þj <sup>þ</sup> <sup>f</sup> � xi ð Þ <sup>j</sup><sup>η</sup>

Estimation of η can also be estimated iteratively by using EM algorithm or by different means.

Let Ti,i = 1, 2,..,m be independently and identically distributed test statistics. Let Pi ¼ P Tð <sup>i</sup> ≤

equal to 1 due to possible correlation among multiple components. Under the alternative, we

true that this is a big leap in making this assumption. In practice, this assumption can be tested, however, and if true, it can lead to very powerful results. [9] assumed that π+(θ) is a truncated

depending upon how inflated we believe the alternative θis are. It can be seen that

<sup>0</sup> with <sup>σ</sup><sup>2</sup> appropriately estimated. Efron pointed out that, in practice, <sup>σ</sup><sup>2</sup> may not be

ð Þj

ð Þj

ð Þj

<sup>0</sup> f <sup>0</sup> xi ð Þþ jη p

<sup>0</sup> f <sup>0</sup> xi ð Þþ jη p

<sup>0</sup> f <sup>0</sup> xi ð Þþ jη p

ð Þj <sup>þ</sup> <sup>f</sup> <sup>þ</sup> xi ð Þ <sup>j</sup><sup>η</sup> ;

ð Þj <sup>þ</sup> <sup>f</sup> <sup>þ</sup> xi ð Þ <sup>j</sup><sup>η</sup> ;

ð Þj <sup>þ</sup> <sup>f</sup> <sup>þ</sup> xi ð Þ <sup>j</sup><sup>η</sup>

<sup>0</sup>, Pi � <sup>U</sup>ð Þ <sup>0</sup>, <sup>1</sup> . Let Xi <sup>=</sup> <sup>Φ</sup>� <sup>1</sup>

<sup>0</sup>; Xi � <sup>N</sup>(0,1) . Efron [11] suggested using Xi � <sup>N</sup>(0,σ<sup>2</sup>

), where θis are generated with distribution described in Eq. (9). It is

/ω) , and π�(θ) = π+(�θ), where ω is some positive constant

(Pi) be the

)

ð∞ 0

f xi ð Þ jθ, η πþð Þ θ dθ

f xi ð Þ <sup>j</sup>θ, <sup>η</sup> <sup>π</sup>�ð Þ <sup>θ</sup> <sup>d</sup>θ, f <sup>þ</sup> xi ð Þ¼ <sup>j</sup><sup>η</sup>

� <sup>f</sup> � xi ð Þþ <sup>j</sup><sup>η</sup> <sup>p</sup>

� <sup>f</sup> � xi ð Þþ <sup>j</sup><sup>η</sup> <sup>p</sup>

� <sup>f</sup> � xi ð Þþ <sup>j</sup><sup>η</sup> <sup>p</sup>

4. Bayes rules by converting p-values to normally distributed test

<sup>0</sup><sup>Þ</sup> be the corresponding <sup>p</sup>-values. Note that under <sup>H</sup><sup>i</sup>

pð Þ<sup>j</sup>

pð Þ<sup>j</sup>

pð Þ<sup>j</sup>

In the above discussions, we assumed that θis are generated from distribution with pdf (9). [12] considered the case when θis are generated from a skew-normal distribution under the alternative hypotheses. The skew-normal distribution was first introduced in [13]. It has an important property that if (ξ1,ξ2) � Bivariate Norma with mean 0, then the distribution of ξ1|ξ<sup>2</sup> > 0 � Skew � normal. Its pdf is given by

$$g\_+(\xi\_1) = 2\frac{1}{\sigma\_1} \phi\left(\frac{\xi\_1}{\sigma\_1}\right) \Phi\left(\lambda \frac{\xi\_1}{\sigma\_1}\right),$$

and is denoted by SN(0,σ1,λ). Here, λ is a skew parameter. If λ = 0, then this distribution is N (0,σ1). The implication of this result is the following: suppose within a normal system an outcome follows a normal distribution, but if a correlated factor starts exerting a positive effect, then the outcome variable will start following a skew-normal distribution. For example, consider RNAs experiments and assume that genes are in a normal state. Suppose a gene mutation occurs at a later state and it starts exerting positive effect on the affected genes. In this case, based on the above property of skew-normal distribution, we can assume that the expressions of the affected genes will follow a skew-normal distribution.

Under this formulation, we assume that θ<sup>i</sup> = 1, 2, …,m are generated from

$$\pi\_{\lambda}(\theta\_{i}) = pI(\theta\_{i} = 0) + (1 - p)\frac{2}{\sigma\_{1}}\phi\left(\frac{\theta\_{i}}{\sigma\_{1}}\right)\Phi\left(\lambda\frac{\theta\_{i}}{\sigma\_{1}}\right)$$

Now, similar to Eq. (11), it can be seen that

$$v\_i^{(-)} \ast (1-p)T\_-(\mathfrak{x}\_i)\_\prime v\_i^{(+)} \ast (1-p)T\_+(\mathfrak{x}\_i)\_\prime v\_i^{(0)} \ast p$$

with proportionality constant [(1 � p)(T+(xi) + T�(xi) + p] � 1 , where

$$T\_+(\mathbf{x}\_i) = \frac{2}{\sigma\_1} \int\_0^\sigma \exp\left(\frac{\mathbf{x}\_i \theta}{\sigma^2}\right) \phi\left(\sqrt{\frac{1}{\sigma\_1^2} + \frac{1}{\sigma^2}} \theta\right) \Phi\left(\frac{\lambda \theta}{\sigma\_1}\right) d\theta,\ \theta$$

and

$$T\_{-}(\mathbf{x}\_{i}) = \frac{2}{\sigma\_{1}} \int\_{-\infty}^{0} \exp\left(\frac{\mathbf{x}\_{i}\theta}{\sigma^{2}}\right) \phi\left(\sqrt{\frac{1}{\sigma\_{1}^{2}} + \frac{1}{\sigma^{2}}}\theta\right) \Phi\left(\frac{\lambda\theta}{\sigma\_{1}}\right) d\theta.$$

The sets DB � and <sup>D</sup><sup>B</sup> <sup>þ</sup> can be written as

$$D\_-^{\mathcal{B}} = \{ i : \mathfrak{x}\_i < -\mathfrak{c}\_1 \} \text{ and } D\_+^{\mathcal{B}} = \{ i : \mathfrak{x}\_i > \mathfrak{c}\_2 \}.$$

where c<sup>1</sup> > 0 and c<sup>2</sup> > 0 are determined as shown in Figure 1 by considering the point of intersections of y = p/(1 � p) and y = T�(x), and y = p/(1 � p) and y = T+(x), respectively. Note that when λ > 0, the intersection point Q (as shown in the figure) will be to the left of x = 0, and when λ < 0, Q will be to the right of x = 0. Thus, when λ > 0,c<sup>1</sup> > c<sup>2</sup> and the opposite is true when λ < 0. When λ = 0,T�(x) = T+(�x) and thus c<sup>1</sup> = c2. If λ ! ∞, T�(x) ! 0 and thus D� <sup>B</sup> is an empty set which is equivalent to a one-tailed test. As discussed in Section 3, the procedure based on Eq. (13) by itself does not control BDFDR. However, c<sup>1</sup> and c<sup>2</sup> can be further shrunk so that the resulting procedure achieves BDFDR ≤ α; see [12] for details.

To illustrate the above procedure, and to compare it with the standard FDR procedure (BY) of [8], which selects the direction based on the sign of the test statistics, we consider a HIV data described in [14]. For detailed analysis, see [12]. Here, we describe the analysis very briefly. The data consist of eight microarrays, four from cells of HIV-infected subjects and four from uninfected subjects, each with expression levels of 7680 genes. For each gene, we obtained a two-sample t-statistic, comparing the infected versus the uninfected subjects, which is then transformed to a z-value, where zi = Φ� <sup>1</sup> {F6(ti)}. Here,F6(∙) denotes the cumulative distribution

Figure 1. Graph of <sup>T</sup>+(x) and <sup>T</sup>�(x) with cutoff values � <sup>c</sup><sup>1</sup> and <sup>c</sup><sup>2</sup> such that <sup>T</sup>þð Þ<sup>x</sup> <sup>≥</sup> <sup>p</sup> <sup>1</sup>�<sup>p</sup> and <sup>T</sup>�ð Þ<sup>x</sup> <sup>≥</sup> <sup>p</sup> 1�p.

Figure 2. Histogram of the HIV data with cutoff points by BY and the Bayes method under skew-normal prior.

function (cdf) of t -distribution with six degrees of freedom. Figure 2 shows the histogram of the z-values with a skew-normal fit. Although the null distribution of Zi should be N(0,1). However, as suggested in [11], we use the null distribution as <sup>N</sup>(�0.11,0.752 ). Thus, we formulate our problem as testing hypotheses (7) with test statistics Zi � <sup>N</sup>(�0.11 + <sup>θ</sup>i,0.752 ).

BY procedure resulted in cutoffs (�3.94,3.94), which resulted in 18 total discoveries with two genes declared as under-expressed and 16 as over-expressed. For the constrained Bayes rule, we first used the EM algorithm to obtain the parameter estimates as <sup>b</sup><sup>p</sup> <sup>¼</sup> <sup>0</sup>:9, <sup>b</sup><sup>σ</sup> <sup>¼</sup> <sup>0</sup>:79, <sup>c</sup>σ<sup>1</sup> <sup>¼</sup> <sup>1</sup>:54; and <sup>λ</sup><sup>b</sup> <sup>¼</sup> <sup>0</sup>:22. The Bayes procedure ended up with cut-off points (�2.82,2.70) with a total of 86 discoveries (under-expressed genes: 23 and over-expressed genes: 63). Note that the number of discoveries by the Bayes rule is much higher than by the BY procedure.

### 5. Concluding remarks

T�ð Þ¼ xi

<sup>þ</sup> can be written as DB

The sets DB

72 Bayesian Inference

� and <sup>D</sup><sup>B</sup>

transformed to a z-value, where zi = Φ� <sup>1</sup>

2 σ1 ð0 �∞ exp

xiθ σ2 � � φ

� <sup>¼</sup> f g <sup>i</sup> : xi <sup>&</sup>lt; �c<sup>1</sup> and D<sup>B</sup>

so that the resulting procedure achieves BDFDR ≤ α; see [12] for details.

Figure 1. Graph of <sup>T</sup>+(x) and <sup>T</sup>�(x) with cutoff values � <sup>c</sup><sup>1</sup> and <sup>c</sup><sup>2</sup> such that <sup>T</sup>þð Þ<sup>x</sup> <sup>≥</sup> <sup>p</sup>

where c<sup>1</sup> > 0 and c<sup>2</sup> > 0 are determined as shown in Figure 1 by considering the point of intersections of y = p/(1 � p) and y = T�(x), and y = p/(1 � p) and y = T+(x), respectively. Note that when λ > 0, the intersection point Q (as shown in the figure) will be to the left of x = 0, and when λ < 0, Q will be to the right of x = 0. Thus, when λ > 0,c<sup>1</sup> > c<sup>2</sup> and the opposite is true when λ < 0. When λ = 0,T�(x) = T+(�x) and thus c<sup>1</sup> = c2. If λ ! ∞, T�(x) ! 0 and thus D�

empty set which is equivalent to a one-tailed test. As discussed in Section 3, the procedure based on Eq. (13) by itself does not control BDFDR. However, c<sup>1</sup> and c<sup>2</sup> can be further shrunk

To illustrate the above procedure, and to compare it with the standard FDR procedure (BY) of [8], which selects the direction based on the sign of the test statistics, we consider a HIV data described in [14]. For detailed analysis, see [12]. Here, we describe the analysis very briefly. The data consist of eight microarrays, four from cells of HIV-infected subjects and four from uninfected subjects, each with expression levels of 7680 genes. For each gene, we obtained a two-sample t-statistic, comparing the infected versus the uninfected subjects, which is then

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 σ2 1 þ 1 σ2

!

θ

<sup>þ</sup> <sup>¼</sup> f g <sup>i</sup> : xi <sup>&</sup>gt; <sup>c</sup><sup>2</sup>

<sup>Φ</sup> λθ σ1 � �

{F6(ti)}. Here,F6(∙) denotes the cumulative distribution

<sup>1</sup>�<sup>p</sup> and <sup>T</sup>�ð Þ<sup>x</sup> <sup>≥</sup> <sup>p</sup>

1�p.

dθ:

<sup>B</sup> is an

s

There are many different methods of testing multiple hypotheses. Methodologies, however, depend on the criteria we choose. When the dimension of multiple hypotheses is not very high, the familywise error rate (FWER) is an appropriate criterion which safeguards against making even one false discovery. However, when the dimension of multiple hypotheses is very high, the FWER is not very useful; instead, a false discover rate (FDR) criterion is a good approach. Although FDR was originally defined as a frequentist's concept, it can be re-interpreted in a Bayesian framework. The Bayesian framework brings many advantages. For example, a decision-theoretic formulation is easy to implement, directional hypotheses are easy to handle, and the skewness in the alternatives is easy to implement. Drawback is that we need to make an assumption about the prior distributions under the alternatives. Some work has been done based on nonparametric priors; however, much more work is needed.

### Author details

Naveen K. Bansal

Address all correspondence to: naveen.bansal@mu.edu

Department of Mathematics, Statistics, and Computer Science, Marquette University, Milwaukee, WI, USA

### References


[12] Bansal NK, Hamedani GG, Maadooliat M. Testing multiple hypotheses with skewed alternatives. Biometrics. 2016;72(2):494-502

and the skewness in the alternatives is easy to implement. Drawback is that we need to make an assumption about the prior distributions under the alternatives. Some work has been done

Department of Mathematics, Statistics, and Computer Science, Marquette University,

[1] Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practice and powerful approach to multiple testing. Journal of the Royal Statistical Society B. 1995;57(1):289-300

[2] Holm S. A simple sequentially rejective multiple test procedure. Scandinavian Journal of

[3] Hochberg B, Yekitieli D. The control of the false discovery rate in multiple testing under

[4] Efron B, Tibshirani R, Storey JD, Tusher V. Empirical Bayes analysis of a microarray experiment. Journal of the American Statistical Association. 2001;96(456):1151-1160 [5] Storey JD. A direct approach to false discovery rates. Journal of the Royal Statistical

[6] Storey JD. The positive false discovery rate: A Bayesian interpretation and the q value.

[7] Shaffer JP. Multiplicity, directional (Type III) errors, and the null hypothesis. Psychologi-

[8] Benjamini Y, Yekutieli D. False discovery rate controlling confidence intervals for selected

[9] Bansal NK, Miescke KJ. A Bayesian decision theoretic approach to directional multiple

[10] Bansal NK, Jiang H, Pradeep P. A Bayesian methodology for detecting targeted genes

[11] Efron B. Correlation and large-scale simultaneous significance testing. Journal of the

under two related experiments. Statistics in Medicine. 2015;34(25):3362-3375

parameters. Journal of American Statistical Association. 2005:71-80

hypotheses problems. Journal of Multivariate Analysis. 2013:205-215

based on nonparametric priors; however, much more work is needed.

Address all correspondence to: naveen.bansal@mu.edu

dependency. Annals of Statistics. 2001;29(4):1165-1188

The Annals of Statistics. 2003;31(6):2013-2035

American Statistical Association. 2007:93-103

Author details

74 Bayesian Inference

Naveen K. Bansal

Milwaukee, WI, USA

Statistics. 1979;6(2):65-70

Society B. 2002;64(3):479-498

cal Methods. 2002;7(3):356-369

References


Provisional chapter

### **Bayesian vs Frequentist Power Functions to Determine the Optimal Sample Size: Testing One Sample Binomial Proportion Using Exact Methods** Bayesian vs Frequentist Power Functions to Determine the Optimal Sample Size: Testing One Sample Binomial Proportion Using Exact Methods

DOI: 10.5772/intechopen.70168

Valeria Sambucini

Additional information is available at the end of the chapter Valeria Sambucini

http://dx.doi.org/10.5772/intechopen.70168 Additional information is available at the end of the chapter

### Abstract

In order to avoid the drawbacks of sample size determination procedures based on classical power analysis, it is possible to define analogous criteria based on 'hybrid classical-Bayesian' or 'fully Bayesian' approaches. We review these conditional and predictive procedures and provide an application, when the focus is on a binomial model and the analysis is performed through exact methods. The distinction between analysis and design prior distributions is essential for the practical implementation of the criteria: some guidelines for choosing these priors are discussed, and their impact on the required sample size is examined.

Keywords: analysis and design prior distributions, binomial proportion, Bayesian power functions, conditional and predictive approach, sample size determination, saw-toothed behaviour of power

### 1. Introduction

The calculation of an adequate sample size is a crucial aspect in the design of experiments. Researchers need to select the appropriate number of participants required to ensure ethically and scientifically valid results. If samples are too large, time and resources are wasted, often for minimal gain. On the other hand, too small samples may lead to inaccurate results. Therefore, sample size determination (SSD) plays a very important role in the design aspect of studies in many fields, especially in the context of clinical trials where, in addition to economical problems, investigators have to deal with important ethical implications.

Sample size determination (SSD) methods, when the focus is on hypothesis testing, are typically related to the concept of power function. Let us denote the parameter of interest by θ and

© The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and eproduction in any medium, provided the original work is properly cited.

let us assume that we are interested in testing H<sup>0</sup> : θ ∈ Θ<sup>0</sup> versus H<sup>1</sup> : θ ∈ Θ1, where Θ<sup>0</sup> and Θ<sup>1</sup> form a partition of the parameter space Θ. The most widely used frequentist SSD criterion consists in choosing the minimal sample size that guarantees a given power, for a fixed type I error rate, under the assumption that θ is equal to a suitable design value, θ<sup>D</sup> ∈ Θ1. In practice, the idea is to ensure a sufficiently large probability of obtaining a statistically significant result (i.e. of rejecting the null hypothesis), when the true value of θ belongs to the alternative hypothesis and is equal to θD. In many textbooks (see [1–3], among others) sample size formulas, derived using this procedure, are provided in many occurring situations, under different hypothesis testing and based on both categorical and quatitative data.

In the frequentist criterion described above, a crucial role is played by the design value that the trial is designed to detect with high probability, whose uncertainty is not accounted for. In fact, the local optimality is one of the most criticized aspects of the method. Moreover, this frequentist procedure does not allow to take into account pre-experimental information about θ, for instance available from previous studies. By adopting a 'hybrid classical-Bayesian approach' or a 'fully Bayesian approach', it is possible to define analogous criteria for sample size selection that allow the researcher to avoid the problem of the local optimality or/and to introduce possible prior information in the SSD process.

In this chapter, we illustrate how to construct frequentist and Bayesian power functions, based on both conditional and predictive approaches, and how to use them to determine the optimal sample size. An essential element of the method is the use of two different prior distributions for the parameter of interest, which play two distinct roles in the criteria. The importance of this distinction in sample size determination problems has been stressed by several authors (see, for instance, [4–9] among others). The rest of the chapter is organized as follows: in Section 2, we review both the frequentist conditional and predictive procedures based on power analysis to determine the optimal sample size. Section 3 provides a description of analogous methods based on Bayesian power functions. Then, in Section 4, we formalize different SSD criteria that depend on the shape of the power curves as a function of the sample size and, as a consequence, on the nature of the data distributions. Furthermore, in Section 5, we illustrate an application of the frequentist and Bayesian SSD procedures, when the parameter of interest is a single binomial proportion. Finally, Section 6 contains a brief final discussion.

### 2. Frequentist power functions and SSD methods

Let us consider a parameter of interest θ and assume that we are interested in testing H<sup>0</sup> : θ ∈ Θ<sup>0</sup> versus H<sup>1</sup> : θ ∈ Θ1, where Θ<sup>0</sup> and Θ<sup>1</sup> form a partition of the parameter space Θ. Moreover, let Yn be the random result of the experiment that is typically a suitable statistic used to summarize the data relevant to the parameter θ. In the notation, we have highlighted that Yn depends on the sample size n. Finally, we denote by fn(yn|θ) the sampling distribution of Yn.

The power function is defined as the probability of obtaining a statistically significant result that leads to reject the null hypothesis H0, when the actual value of the parameter is θ. In a frequentist approach, the investigator is firstly required to specify a fixed level α for the type I error probability that one is willing to tolerate. This significance level is typically set equal to 0.05 and is used to obtain the rejection region of H0, denoted by RH<sup>0</sup> , that represents an appropriate subset of outcomes that—if observed—lead to the rejection of H0. Therefore, given a frequentist test of size α, Yn is considered a statistically significant result if it belongs to RH<sup>0</sup> . Consequently, in general terms, the power function is defined as

$$\eta(n,\Theta) = \mathbb{P}\_{\theta}(Y\_n \in R\_{H\_0}),\tag{1}$$

where P<sup>θ</sup> is the probability measure associated with a suitable distribution of Yn.

let us assume that we are interested in testing H<sup>0</sup> : θ ∈ Θ<sup>0</sup> versus H<sup>1</sup> : θ ∈ Θ1, where Θ<sup>0</sup> and Θ<sup>1</sup> form a partition of the parameter space Θ. The most widely used frequentist SSD criterion consists in choosing the minimal sample size that guarantees a given power, for a fixed type I error rate, under the assumption that θ is equal to a suitable design value, θ<sup>D</sup> ∈ Θ1. In practice, the idea is to ensure a sufficiently large probability of obtaining a statistically significant result (i.e. of rejecting the null hypothesis), when the true value of θ belongs to the alternative hypothesis and is equal to θD. In many textbooks (see [1–3], among others) sample size formulas, derived using this procedure, are provided in many occurring situations, under

In the frequentist criterion described above, a crucial role is played by the design value that the trial is designed to detect with high probability, whose uncertainty is not accounted for. In fact, the local optimality is one of the most criticized aspects of the method. Moreover, this frequentist procedure does not allow to take into account pre-experimental information about θ, for instance available from previous studies. By adopting a 'hybrid classical-Bayesian approach' or a 'fully Bayesian approach', it is possible to define analogous criteria for sample size selection that allow the researcher to avoid the problem of the local optimality or/and to

In this chapter, we illustrate how to construct frequentist and Bayesian power functions, based on both conditional and predictive approaches, and how to use them to determine the optimal sample size. An essential element of the method is the use of two different prior distributions for the parameter of interest, which play two distinct roles in the criteria. The importance of this distinction in sample size determination problems has been stressed by several authors (see, for instance, [4–9] among others). The rest of the chapter is organized as follows: in Section 2, we review both the frequentist conditional and predictive procedures based on power analysis to determine the optimal sample size. Section 3 provides a description of analogous methods based on Bayesian power functions. Then, in Section 4, we formalize different SSD criteria that depend on the shape of the power curves as a function of the sample size and, as a consequence, on the nature of the data distributions. Furthermore, in Section 5, we illustrate an application of the frequentist and Bayesian SSD procedures, when the parameter of interest is

Let us consider a parameter of interest θ and assume that we are interested in testing H<sup>0</sup> : θ ∈ Θ<sup>0</sup> versus H<sup>1</sup> : θ ∈ Θ1, where Θ<sup>0</sup> and Θ<sup>1</sup> form a partition of the parameter space Θ. Moreover, let Yn be the random result of the experiment that is typically a suitable statistic used to summarize the data relevant to the parameter θ. In the notation, we have highlighted that Yn depends

The power function is defined as the probability of obtaining a statistically significant result that leads to reject the null hypothesis H0, when the actual value of the parameter is θ. In a frequentist approach, the investigator is firstly required to specify a fixed level α for the type

on the sample size n. Finally, we denote by fn(yn|θ) the sampling distribution of Yn.

different hypothesis testing and based on both categorical and quatitative data.

a single binomial proportion. Finally, Section 6 contains a brief final discussion.

2. Frequentist power functions and SSD methods

introduce possible prior information in the SSD process.

78 Bayesian Inference

In order to exploit the frequentist power function in Eq. (1) for sample size determination purposes, investigators can adopt two different approaches: the conditional and the predictive one. The conditional approach is certainly the most widely known and used, when performing sample size calculations based on pre-study power analysis. It requires the specification of a suitable design value for θ, denoted by θD, that belongs to the alternative hypothesis and is considered a relevant value important to detect. By assuming that the true value of the parameter is equal to θD, we obtain the frequentist conditional power given by

$$\eta\_F^{\mathbb{C}}(\mathfrak{n}, \Theta^D) = \mathbb{P}\_{f\_n(\cdot|\theta^D)}(Y\_n \in \mathbb{R}\_{H\_0}),\tag{2}$$

where <sup>P</sup><sup>f</sup> <sup>n</sup> � <sup>θ</sup><sup>D</sup> ðj <sup>Þ</sup> is the probability measure associated with the sampling distribution of Yn when θ = θD. Since θ<sup>D</sup> has to be selected within the subspace Θ1, the conditional frequentist power can be interpreted as the probability of correctly rejecting H0, when the true value of the parameter belongs to the alternative hypothesis and is exactly equal to θD. Then, the sample size determination criterion consists in choosing the minimal sample size that guarantees a desired level for ηC <sup>F</sup> <sup>n</sup>; <sup>θ</sup><sup>D</sup> � �. In practice, the idea is to ensure a sufficiently large probability of rejecting <sup>H</sup>0, when the true θ belongs to the alternative hypothesis and, more specifically, it is equal to θ<sup>D</sup> ∈ Θ1.

The SSD procedure based on the power function in Eq. (2) is strongly affected by the choice of θD. In order to account for uncertainty in the specification of the design value and to avoid local optimality, it is natural to incorporate Bayesian concepts into the sample size determination process. By adopting a 'hybrid classical-Bayesian approach', it is possible to model uncertainty on the appropriate design value for θ through the elicitation of a prior distribution, denoted by πD(θ) and called design prior. This prior is used to compute the marginal or prior predictive distribution of the data by averaging the sampling distribution as follows:

$$m\_n^D(y\_n) = \int\_{\Theta} f\_n(y\_n|\theta) \pi^D(\theta) d\theta. \tag{3}$$

Therefore, the design prior cannot be a non-informative improper distribution in order to have m<sup>D</sup> <sup>n</sup> yn � � well defined. In any case, the elicitation of a non-informative πD(θ) would not be reasonable choice. In fact, the design prior is used to introduce uncertainty on the suitable design value for θ that we need to specify when using the SSD procedure previously described and the possible guessed values have to belong to the subspace Θ1. Thus, πD(θ) serves to describe a design scenario of interest that supports values of θ under the alternative hypothesis: it has to be an informative distribution that assigns a negligible probability to values of θ under the null hypothesis.

Once the design prior has been elicited, the idea is to average the conditional frequentist power with respect to it by computing

$$\begin{split} \int\_{\Theta} \eta\_F^C(n, \Theta) \pi^D(\theta) d\theta &= \int\_{\Theta} \left[ \int\_{R\_{\mathcal{H}\_0}} f\_n(y\_n|\theta) dy\_n \right] \pi^D(\theta) d\theta \\ &= \int\_{R\_{\mathcal{H}\_0}} m\_n^D(y\_n) dy\_n. \end{split} \tag{4}$$

This leads to the frequentist predictive power that is given by

$$\eta\_F^P(\mathfrak{n}, \pi^D) = \mathbb{P}\_{m\_n^D(\cdot)}(Y\_n \in \mathcal{R}\_{H\_0}),\tag{5}$$

where PmD <sup>n</sup> ð Þ� is the probability measure associated with the marginal distribution of Yn obtained using πD(θ). The power function in Eq. (5) expresses the probability of making a correct decision by rejecting H0, when θ actually belongs to the subspace defined under the alternative hypothesis, where we can assume that it is distributed according to the design prior. Therefore, the corresponding SSD criterion requires to select the minimum n to achieve a desired level for η<sup>P</sup> <sup>F</sup> <sup>n</sup>; <sup>π</sup><sup>D</sup> � �.

Note that if πD(θ) is chosen as a point mass distribution centred on θD, no uncertainty on the relevant design values is taken into account and the marginal distribution coincides with the sampling one. In this case, there is no difference between the frequentist power functions obtained under the conditional and the predictive approach.

### 3. Bayesian power functions and SSD methods

In the previous section, we have described how to select the sample size through power functions by assuming that a frequentist analysis will be performed at the end of the study. In both the frequentist conditional and predictive powers, the decision about the two hypotheses is based on the construction of the rejection region of H<sup>0</sup> of a classical test of fixed size α. A major limitation to the fully classical and the hybrid classical-Bayesian approaches previously introduced is the inability to incorporate past experience and information about the unknown parameter, as well as expert prior opinions. The use of a 'fully Bayesian approach' allows to take into account important knowledge and belief about θ when planning the study.

It is well known that the information available before starting the study can be expressed by introducing a prior distribution for θ, πA(θ), which in this context is typically called analysis prior to distinguish it from the design prior. It is worth pointing out that πA(θ) is the usual prior distribution employed in a Bayesian analysis: it formalizes pre-experimental knowledge, often represented by historical data, and subjective opinions of experts and is used to compute the posterior distribution of the parameter, π<sup>A</sup> <sup>n</sup> θ yn � � � <sup>∝</sup><sup>f</sup> <sup>n</sup> ynjθÞπ<sup>A</sup>ð Þ <sup>θ</sup> � � . Moreover, it is often chosen as a non-informative distribution to avoid the inclusion of external evidence in the posterior inference.

it has to be an informative distribution that assigns a negligible probability to values of θ

Once the design prior has been elicited, the idea is to average the conditional frequentist power

ð RH<sup>0</sup>

<sup>n</sup> ð Þ� is the probability measure associated with the marginal distribution of Yn

obtained using πD(θ). The power function in Eq. (5) expresses the probability of making a correct decision by rejecting H0, when θ actually belongs to the subspace defined under the alternative hypothesis, where we can assume that it is distributed according to the design prior. Therefore, the corresponding SSD criterion requires to select the minimum n to achieve

Note that if πD(θ) is chosen as a point mass distribution centred on θD, no uncertainty on the relevant design values is taken into account and the marginal distribution coincides with the sampling one. In this case, there is no difference between the frequentist power functions

In the previous section, we have described how to select the sample size through power functions by assuming that a frequentist analysis will be performed at the end of the study. In both the frequentist conditional and predictive powers, the decision about the two hypotheses is based on the construction of the rejection region of H<sup>0</sup> of a classical test of fixed size α. A major limitation to the fully classical and the hybrid classical-Bayesian approaches previously introduced is the inability to incorporate past experience and information about the unknown parameter, as well as expert prior opinions. The use of a 'fully Bayesian approach' allows to

It is well known that the information available before starting the study can be expressed by introducing a prior distribution for θ, πA(θ), which in this context is typically called analysis prior to distinguish it from the design prior. It is worth pointing out that πA(θ) is the usual prior distribution employed in a Bayesian analysis: it formalizes pre-experimental knowledge, often represented by historical data, and subjective opinions of experts and is used to compute the

> <sup>n</sup> θ yn � � �

take into account important knowledge and belief about θ when planning the study.

<sup>f</sup> <sup>n</sup> ynj<sup>θ</sup> � �dyn " #

<sup>π</sup><sup>D</sup>ð Þ <sup>θ</sup> <sup>d</sup><sup>θ</sup>

<sup>n</sup> ð Þ� Yn ∈RH<sup>0</sup> ð Þ; (5)

<sup>∝</sup><sup>f</sup> <sup>n</sup> ynjθÞπ<sup>A</sup>ð Þ <sup>θ</sup> � � . Moreover, it is often chosen

(4)

ð Θ

¼ ð RH<sup>0</sup> m<sup>D</sup> <sup>n</sup> yn � �dyn:

<sup>F</sup> <sup>n</sup>; <sup>π</sup><sup>D</sup> � � <sup>¼</sup> <sup>P</sup>mD

under the null hypothesis.

80 Bayesian Inference

where PmD

a desired level for η<sup>P</sup>

with respect to it by computing

ð Θ ηC

<sup>F</sup> ð Þ <sup>n</sup>; <sup>θ</sup> <sup>π</sup><sup>D</sup>ð Þ <sup>θ</sup> <sup>d</sup><sup>θ</sup> <sup>¼</sup>

ηP

This leads to the frequentist predictive power that is given by

<sup>F</sup> <sup>n</sup>; <sup>π</sup><sup>D</sup> � �.

posterior distribution of the parameter, π<sup>A</sup>

obtained under the conditional and the predictive approach.

3. Bayesian power functions and SSD methods

Let us recall that, in general terms, a power function is defined as the probability of obtaining a significant result, i.e. a result that leads to the rejection of the null hypothesis. Then, to exploit this function as a useful tool to determine the optimal sample size, we need to compute it under the assumption that the alternative hypothesis is true. In practice, we have to consider a design scenario where the true θ belongs to Θ1, so that the power function represents the probability of making a correct decision. Therefore, to define power functions from a Bayesian point of view, first of all we need to decide when we reject the null hypothesis in a Bayesian setting, that is we have to establish the condition for the 'Bayesian significance'. Following Spiegelhalter et al. [10], we define the result Yn as 'significant from a Bayesian perspective' if the corresponding posterior probability that θ belongs to the alternative hypothesis is sufficiently large, that is if

$$\mathbb{P}\_{\pi^{\mathsf{A}}\_{\pi}(\cdot|Y\_{\mathsf{n}})}(\theta \in \Theta\_{\mathsf{1}}) > \lambda,\tag{6}$$

where Pπ<sup>A</sup> <sup>n</sup> ð�jYn<sup>Þ</sup> denotes the probability measure associated with the posterior distribution of <sup>θ</sup> computed using the analysis prior and λ ∈ (0, 1) represents a suitably specified threshold. Let us stress that, since we are dealing with a pre-experimental problem, the posterior probability in Eq. (6) is a random variable, depending on a random result that has not yet been observed. In order to construct Bayesian power functions, we need to compute the probability of obtaining a Bayesian significant result. Similar to what we have seen in the frequentist case, we can use two alternative distributions of the data, according to the approach we decide to adopt.

The conditional approach realizes the pre-experimental assumption that the alternative hypothesis is true, by fixing a design value θ<sup>D</sup> ∈ Θ1, which is considered relevant and important to detect. Then the sampling distribution of Yn conditional on <sup>θ</sup>D, fn(�|θD), is used to compute the probability of getting Bayesian significance. In this way, we obtain the Bayesian conditional power

$$\eta\_B^{\mathbb{C}}(\mathfrak{n}, \Theta^D) = \mathbb{P}\_{f\_{\mathfrak{n}}(\cdot|\theta^{\mathbb{O}})} \left( \mathbb{P}\_{\pi\_{\mathfrak{n}}^4(\cdot|Y\_{\mathfrak{n}})} (\Theta \in \Theta\_1) > \lambda \right). \tag{7}$$

The predictive approach, instead, aims at avoiding the problem of local optimality in the SSD procedure by introducing a design prior for θ, πD(θ), that accounts for additional uncertainty involved in the choice of the design values θD. Then, the prior predictive distribution of Yn, m<sup>D</sup> <sup>n</sup> ð Þ� , is computed and used in place of the sampling distribution conditional on <sup>θ</sup>D. This leads to the Bayesian predictive power

$$\eta\_B^p(\mathfrak{n}, \pi^D) = \mathbb{P}\_{m\_n^D(\cdot)} \left( \mathbb{P}\_{\pi\_n^A(\cdot|Y\_n)} (\theta \in \Theta\_1) > \lambda \right). \tag{8}$$

Both the power functions in Eqs. (7) and (8) express the probability of rejecting H<sup>0</sup> under a Bayesian framework, assuming that the true θ actually belongs to H1. In fact, we assume that θ is equal to a specific value under the alternative hypothesis (conditional approach) or that θ is in the specific subspace defined under the alternative hypothesis, where we can assume that it is distributed according to the design prior (predictive approach). The sample size determination criteria, therefore, require to select the minimal sample size to ensure a sufficiently large level for η<sup>C</sup> <sup>B</sup> <sup>n</sup>; <sup>θ</sup><sup>D</sup> or <sup>η</sup><sup>P</sup> <sup>B</sup> <sup>n</sup>; <sup>π</sup><sup>D</sup> . Moreover, note that, when the specified design prior distribution assigns the whole mass probability to θD, the two Bayesian power functions coincide, leading to the same optimal sample size.

### 4. SSD criteria according to the nature of the distribution of Yn

In this section, we explicitly formalize the SSD criteria based on frequentist and Bayesian power functions, according to the nature of the random result Yn. When Yn has a continuous distribution, each of the power functions previously introduced shows a monotonically increasing behaviour as a function of n. In this case, the SSD criteria sensibly select the minimum sample size to guarantee the desired level of power, that is

$$n\_{\mathbb{F}}^{\mathbb{C}} = \min \{ n \in \mathbb{N} \colon \eta\_{\mathbb{F}}^{\mathbb{C}}(n, \theta^{D}) > \gamma \},\tag{9}$$

$$n\_{\mathcal{F}}^p = \min\{n \in \mathbb{N} \colon \eta\_{\mathcal{F}}^p(n, \pi^D) > \gamma\},\tag{10}$$

$$m\_B^\mathbb{C} = \min\{n \in \mathbb{N} \colon \eta\_B^\mathbb{C}(n, \theta^D) > \gamma\},\tag{11}$$

$$m\_{\mathcal{B}}^p = \min\{n \in \mathbb{N} \colon \eta\_{\mathcal{B}}^p(n, \pi^D) > \gamma\},\tag{12}$$

for a conveniently chosen threshold γ ∈ (0, 1]. Let us remark that in the notation for the optimal sample sizes, as well as in the notations for the power functions, the subscripts are used to specify the approach (frequentist or Bayesian) adopted at the analysis stage. The superscripts, instead, indicate the appoach (conditional or predictive) used to represent the design expectations. An application of the criteria formalized above is provided by Gubbiotti and De Santis [11], where it is assumed that the statistic Yn follows a normal distribution with mean equal to θ and known variance.

However, it may happen that η<sup>C</sup> <sup>F</sup> <sup>n</sup>; <sup>θ</sup><sup>D</sup> , <sup>η</sup><sup>P</sup> <sup>F</sup> <sup>n</sup>; <sup>π</sup><sup>D</sup> , <sup>η</sup><sup>C</sup> <sup>B</sup> <sup>n</sup>; <sup>θ</sup><sup>D</sup> and <sup>η</sup><sup>P</sup> <sup>B</sup> <sup>n</sup>; <sup>π</sup><sup>D</sup> are not monotonically increasing functions of the sample size: this occurs when dealing with discrete distributions of Yn. In these cases, the power functions show a basically increasing behaviour as a function of n, but with some small fluctuations. A suitable SSD criterion has to take into account this kind of behaviour. For instance, instead of selecting the smallest sample size that attains the condition of interest, it can be considered more appropriate to select the smallest sample size in such a way that the condition is fulfilled also for all the sample size values greater than it. Given a threshold γ ∈ (0, 1), the corresponding SSD criteria are

$$n\_{\rm F}^{\mathbb{C}} = \min \{ n^\* \in \mathbb{N} \colon \eta\_{\rm F}^{\mathbb{C}}(n, \theta^D) > \gamma, \ \forall n \ge n^\* \}, \tag{13}$$

$$n\_{\mathbb{F}}^{\mathbb{P}} = \min \{ n^\* \in \mathbb{N} \colon \eta\_{\mathbb{F}}^{\mathbb{P}}(n, \pi^D) > \gamma, \ \forall n \ge n^\* \}, \tag{14}$$

$$n\_B^\mathbb{C} = \min \{ n^\* \in \mathbb{N} \colon \eta\_B^\mathbb{C}(n, \theta^D) > \gamma, \,\,\forall n \ge n^\* \},\tag{15}$$

$$m\_B^P = \min\{n^\* \in \mathbb{N} \colon \eta\_B^P(n, \pi^D) > \gamma, \ \forall n \ge n^\*\}.\tag{16}$$

In this way, it is possible to avoid the paradox of having the condition of interest fulfilled for the selected sample size, but not satisfied for some larger values of n any longer.

### 5. Single binomial proportion using exact methods

determination criteria, therefore, require to select the minimal sample size to ensure a suffi-

prior distribution assigns the whole mass probability to θD, the two Bayesian power functions

In this section, we explicitly formalize the SSD criteria based on frequentist and Bayesian power functions, according to the nature of the random result Yn. When Yn has a continuous distribution, each of the power functions previously introduced shows a monotonically increasing behaviour as a function of n. In this case, the SSD criteria sensibly select the

4. SSD criteria according to the nature of the distribution of Yn

<sup>F</sup> <sup>¼</sup> min <sup>n</sup><sup>∈</sup> <sup>N</sup>: <sup>η</sup><sup>C</sup>

<sup>F</sup> <sup>¼</sup> min <sup>n</sup><sup>∈</sup> <sup>N</sup>: <sup>η</sup><sup>P</sup>

<sup>B</sup> <sup>¼</sup> min <sup>n</sup><sup>∈</sup> <sup>N</sup>: <sup>η</sup><sup>C</sup>

<sup>B</sup> <sup>¼</sup> min <sup>n</sup><sup>∈</sup> <sup>N</sup>: <sup>η</sup><sup>P</sup>

<sup>F</sup> <sup>n</sup>; <sup>θ</sup><sup>D</sup> , <sup>η</sup><sup>P</sup>

greater than it. Given a threshold γ ∈ (0, 1), the corresponding SSD criteria are

<sup>F</sup> <sup>¼</sup> min <sup>n</sup>� <sup>∈</sup> <sup>N</sup>: <sup>η</sup><sup>C</sup>

<sup>F</sup> <sup>¼</sup> min <sup>n</sup>� <sup>∈</sup> <sup>N</sup>: <sup>η</sup><sup>P</sup>

<sup>B</sup> <sup>¼</sup> min <sup>n</sup>� <sup>∈</sup> <sup>N</sup>: <sup>η</sup><sup>C</sup>

<sup>B</sup> <sup>¼</sup> min <sup>n</sup>� <sup>∈</sup> <sup>N</sup>: <sup>η</sup><sup>P</sup>

nC

nP

nC

nP

for a conveniently chosen threshold γ ∈ (0, 1]. Let us remark that in the notation for the optimal sample sizes, as well as in the notations for the power functions, the subscripts are used to specify the approach (frequentist or Bayesian) adopted at the analysis stage. The superscripts, instead, indicate the appoach (conditional or predictive) used to represent the design expectations. An application of the criteria formalized above is provided by Gubbiotti and De Santis [11], where it is assumed that the statistic Yn follows a normal distribution with mean

<sup>F</sup> <sup>n</sup>; <sup>π</sup><sup>D</sup> , <sup>η</sup><sup>C</sup>

ically increasing functions of the sample size: this occurs when dealing with discrete distributions of Yn. In these cases, the power functions show a basically increasing behaviour as a function of n, but with some small fluctuations. A suitable SSD criterion has to take into account this kind of behaviour. For instance, instead of selecting the smallest sample size that attains the condition of interest, it can be considered more appropriate to select the smallest sample size in such a way that the condition is fulfilled also for all the sample size values

minimum sample size to guarantee the desired level of power, that is

nC

nP

nC

nP

<sup>B</sup> <sup>n</sup>; <sup>π</sup><sup>D</sup> . Moreover, note that, when the specified design

<sup>F</sup> <sup>n</sup>; <sup>θ</sup><sup>D</sup> <sup>&</sup>gt; <sup>γ</sup> ; (9)

<sup>F</sup> <sup>n</sup>; <sup>π</sup><sup>D</sup> <sup>&</sup>gt; <sup>γ</sup> ; (10)

<sup>B</sup> <sup>n</sup>; <sup>θ</sup><sup>D</sup> <sup>&</sup>gt; <sup>γ</sup> ; (11)

<sup>B</sup> <sup>n</sup>; <sup>π</sup><sup>D</sup> <sup>&</sup>gt; <sup>γ</sup> ; (12)

<sup>B</sup> <sup>n</sup>; <sup>π</sup><sup>D</sup> are not monoton-

<sup>B</sup> <sup>n</sup>; <sup>θ</sup><sup>D</sup> and <sup>η</sup><sup>P</sup>

<sup>F</sup> <sup>n</sup>; <sup>θ</sup><sup>D</sup> <sup>&</sup>gt; <sup>γ</sup>, <sup>∀</sup><sup>n</sup> <sup>≥</sup> <sup>n</sup>� ; (13)

<sup>F</sup> <sup>n</sup>; <sup>π</sup><sup>D</sup> <sup>&</sup>gt; <sup>γ</sup>, <sup>∀</sup><sup>n</sup> <sup>≥</sup> <sup>n</sup>� ; (14)

<sup>B</sup> <sup>n</sup>; <sup>θ</sup><sup>D</sup> <sup>&</sup>gt; <sup>γ</sup>, <sup>∀</sup><sup>n</sup> <sup>≥</sup> <sup>n</sup>� ; (15)

<sup>B</sup> <sup>n</sup>; <sup>π</sup><sup>D</sup> <sup>&</sup>gt; <sup>γ</sup>, <sup>∀</sup><sup>n</sup> <sup>≥</sup> <sup>n</sup>� : (16)

<sup>B</sup> <sup>n</sup>; <sup>θ</sup><sup>D</sup> or <sup>η</sup><sup>P</sup>

coincide, leading to the same optimal sample size.

ciently large level for η<sup>C</sup>

82 Bayesian Inference

equal to θ and known variance.

However, it may happen that η<sup>C</sup>

In this section, we focus on exact procedures for one-sample testing problem with binary response. For instance, in a clinical context, we could be interested in evaluating the efficacy of a new experimental treatment or drug that is received at the same dose by all the n patients enrolled in the trial. No comparisons with other therapies are involved. A binary response variable, which assumes value 1 if clinicians classify the patient as a responder to the therapy and 0 otherwise, is considered and, therefore, the parameter of interest θ is the true response rate (i.e. an unknown proportion). In these one-arm studies, θ is compared with a fixed target value, say θ0, that should ideally represent the response rate for the current 'gold standard' therapy and that is typically obtained through historical data. Values of θ greater than θ<sup>0</sup> suggest that the experimental drug can be considered sufficiently effective and, therefore, the following hypotheses are considered

$$H\_0: \theta = \theta\_0 \quad \text{and} \quad H\_1: \theta > \theta\_0. \tag{17}$$

This kind of single-arm studies is typically conducted in phase II of clinical trials, whose primary goal is not to definitively assess the efficacy of new drugs, but to screen out those that are ineffective. In practice, in the clinical development process of a new drug, phase II aims at avoiding that not sufficiently promising treatments reach phase III, where randomized controlled trials, based on large patients groups, are generally conducted.

It is important to point out that the power functions based on exact procedures usually do not have explicit forms. Hence, exact formulas for sample size calculations cannot be obtained. However, it is possible to proceed numerically by evaluating the conditions of interest for different increasing or decreasing values of the sample size, until reaching the optimal one. In the following sections, we provide the expressions of the frequentist and Bayesian power functions for non-comparative studies with binary responses. The saw-toothed shape of the power curves as a function of n is shown and, hence, the conservative criteria illustrated in the previous section are adopted. All the graphical and numerical results have been obtained by using the R programming language [12].

#### 5.1. Frequentist conditional power

In the statistical context described above, the number of responders out of the n patients treated with the new drug (i.e. the number of successes in n trials) is the natural statistic Yn we have to consider and its sampling distribution is

$$f\_n(y\_n|\theta) = \text{bin}(y\_n; n, \theta), \quad \text{for } y\_n = 0, \dots, n,\tag{18}$$

where bin(�; n, θ) denotes the probability mass function of a binomial distribution of parameters n and θ.

Let us consider the two hypotheses in Eq. (17). For a fixed significance level α and assuming that H<sup>0</sup> is true, there exists a non-negative integer r between 0 and n such that

$$\sum\_{i=r}^{n} \text{bin}(i; n, \theta\_0) \le \alpha \quad \text{and} \quad \sum\_{i=r-1}^{n} \text{bin}(i; n, \theta\_0) > \alpha. \tag{19}$$

Then, the rejection region at <sup>α</sup> level is RH<sup>0</sup> <sup>¼</sup> yn <sup>∈</sup>f g <sup>0</sup>, <sup>1</sup>, ::.; <sup>n</sup> : yn <sup>≥</sup> <sup>r</sup> � �, where the critical value r can be expressed in symbols by

$$r = \min\left\{ k \in \{0, 1, \ldots, n\} : \sum\_{i=k}^{n} \text{bin}(i; n, \theta\_0) \le \alpha \right\}.\tag{20}$$

For a given design value θD, that has to be specified under the alternative hypothesis, the frequentist conditional power is provided by

$$\begin{split} \eta\_F^{\mathbb{C}}(n, \Theta^D) &= \mathbb{P}\_{f\_n(\cdot|\theta^D)}(Y\_n \in \mathbb{R}\_{H\_0}) \\ &= \sum\_{y\_n=r}^n \text{bin}(y\_n; n, \Theta^D). \end{split} \tag{21}$$

In practice, η<sup>C</sup> <sup>F</sup> <sup>n</sup>; <sup>θ</sup><sup>D</sup> � � is obtained by the sum of the probabilities of the all the outcomes that belong to RH<sup>0</sup> , when we assume that the true θ is equal to the design value.

Figure 1 shows the behaviour of the frequentist conditional power as a function of n, when θ<sup>0</sup> = 0.2, θ<sup>D</sup> = 0.4 and α = 0.05. It is evident that η<sup>C</sup> <sup>F</sup> <sup>n</sup>; <sup>θ</sup><sup>D</sup> � � is not a monotonically increasing function of the sample size, because of the discrete nature of the sampling distribution of Yn.

Figure 1. Behaviour of η<sup>C</sup> <sup>F</sup> <sup>n</sup>; <sup>θ</sup><sup>D</sup> � � as a function of <sup>n</sup>, when <sup>θ</sup><sup>0</sup> = 0.20, <sup>θ</sup><sup>D</sup> = 0.4 and <sup>α</sup> = 0.05.

The reasons for this saw-toothed behaviour can be clarified by the numerical results presented in Table 1. Here, for all the possible values of the sample size between 3 and 50, we provide not only the level of the frequentist conditional power used to obtain Figure 1, but also the corresponding critical value r and the actual value for the type I error probability. Obviously, this latter value is always below the fixed threshold 0.05. Note that whenever the sample size is increased by one unit, the corresponding critical value r may also increase or it may remain constant. In the second case, both the actual type I error rate and the conditional frequentist power grow up; otherwise, if also the critical value changes by one unit, they both get smaller. To help in reading the table, the colours white and grey are used alternately to highlight blocks

Let us consider the two hypotheses in Eq. (17). For a fixed significance level α and assuming

Then, the rejection region at <sup>α</sup> level is RH<sup>0</sup> <sup>¼</sup> yn <sup>∈</sup>f g <sup>0</sup>, <sup>1</sup>, ::.; <sup>n</sup> : yn <sup>≥</sup> <sup>r</sup> � �, where the critical value

For a given design value θD, that has to be specified under the alternative hypothesis, the

<sup>F</sup> <sup>n</sup>; <sup>θ</sup><sup>D</sup> � � <sup>¼</sup> <sup>P</sup><sup>f</sup> <sup>n</sup>ð�jθ<sup>D</sup>ÞðYn <sup>∈</sup>RH<sup>0</sup> <sup>Þ</sup>

Figure 1 shows the behaviour of the frequentist conditional power as a function of n, when

function of the sample size, because of the discrete nature of the sampling distribution of Yn.

<sup>F</sup> <sup>n</sup>; <sup>θ</sup><sup>D</sup> � � as a function of <sup>n</sup>, when <sup>θ</sup><sup>0</sup> = 0.20, <sup>θ</sup><sup>D</sup> = 0.4 and <sup>α</sup> = 0.05.

<sup>¼</sup> <sup>X</sup><sup>n</sup> yn¼r i¼r�1

Xn i¼k

<sup>F</sup> <sup>n</sup>; <sup>θ</sup><sup>D</sup> � � is obtained by the sum of the probabilities of the all the outcomes that

( )

binð Þ i; n; θ<sup>0</sup> ≤ α

binð Þ i; n; θ<sup>0</sup> > α: (19)

binðyn; n, <sup>θ</sup><sup>D</sup>Þ: (21)

<sup>F</sup> <sup>n</sup>; <sup>θ</sup><sup>D</sup> � � is not a monotonically increasing

: (20)

that H<sup>0</sup> is true, there exists a non-negative integer r between 0 and n such that

r ¼ min k ∈f g 0, 1, ::.; n :

belong to RH<sup>0</sup> , when we assume that the true θ is equal to the design value.

ηC

binð Þ <sup>i</sup>; <sup>n</sup>; <sup>θ</sup><sup>0</sup> <sup>≤</sup> <sup>α</sup> and <sup>X</sup><sup>n</sup>

Xn i¼r

frequentist conditional power is provided by

θ<sup>0</sup> = 0.2, θ<sup>D</sup> = 0.4 and α = 0.05. It is evident that η<sup>C</sup>

r can be expressed in symbols by

In practice, η<sup>C</sup>

84 Bayesian Inference

Figure 1. Behaviour of η<sup>C</sup>


Table 1. Numerical calculations related to Figure 1: sample sizes, corresponding critical values, frequentist conditional power and actual values for the type I error rate, when θ<sup>0</sup> = 0.20, θ<sup>D</sup> = 0.4 and α = 0.05.

of sample sizes with the same critical value: within each block both the power and the actual type I rate monotonically raise as n increases. But, in correspondence with the first sample size of the subsequent block, they both decrease. This determines the basically increasing behaviour of the power as a function of n, with some small fluctuations, which is represented in Figure 1. For additional discussion about the saw-toothed shape of the frequentist power function, the reader is referred to Chernick and Liu [13].

Now, the problem of which sample size we should select arises because of the non-monotonic behaviour of η<sup>C</sup> <sup>F</sup> <sup>n</sup>; <sup>θ</sup><sup>D</sup> . If we set the desired threshold <sup>γ</sup> for the power equal to 0.8, we have that the smallest sample size that meets the power requirement is n = 35. At that sample size, the critical value is 12 and the power level is 0.8048. Then for n = 36, the critical value is still 12 and the power increases to 0.8380. However, the power drops below 0.8 to 0.7783, when n = 37, at which r = 13, and rises again over 0.8 when n = 38. Then η<sup>C</sup> <sup>F</sup> <sup>n</sup>; <sup>θ</sup><sup>D</sup> never decreases below 0.8 for sample sizes greater than 38. Therefore, instead of selecting the smallest n that attains the power condition, it can be more appropriate to consider the more conservative sample size criterion formalized in Section 4, according to which the optimal sample size is selected as

$$n\_F^\mathbb{C} = \min\{n^\* \in \mathbb{N} \colon \eta\_F^\mathbb{C}(n, \Theta^D) > \gamma, \ \forall n \ge n^\*\}. \tag{22}$$

The criterion ensures that the power will not decrease below the desired threshold for any larger sample size: in our specific case, it consists in selecting n = 38, instead of n = 35.

#### 5.2. Frequentist predictive power

In order to model uncertainty in the specification of the design value, we need to adopt the hybrid classical-Bayesian approach described previously. We introduce a beta design prior density for θ, πD(θ) = beta(θ; αD, βD), that is used to obtain the prior predictive distribution of the data. It is well known that by averaging the binomial sampling fn(yn|θ) with respect to the beta design prior, we obtain the following marginal distribution

$$m\_n^D(y\_n) = \text{beta-bin}(y\_n; a^D, \beta^D, n), \quad \text{for } y\_n = 0, \dots, n,\tag{23}$$

where beta-bin(�; <sup>α</sup>D, <sup>β</sup>D, <sup>n</sup>) denotes the probability mass function of a beta-binomial distribution with parameters (αD, βD, n).

The design prior πD(θ) can be elicited in many different ways. One useful possibility consists in (i) setting the prior mode equal to the fixed design value θD, which investigators would choose within the subset under H<sup>1</sup> when using the conditional approach, and (ii) regulating the concentration of the distribution around its mode according to the degree of uncertainty one wishes to express. This can be done by using for the hyperparameters of πD(θ) the following expressions:

$$
\alpha^D = n^D \theta^D + 1 \quad \text{and} \quad \beta^D = n^D (1 - \theta^D) + 1,\tag{24}
$$

where θ<sup>D</sup> is the prior mode and nD is a design parameter that can be interpreted as prior sample size. The larger the nD, the smaller the variance of the beta design prior. Therefore, we need to of sample sizes with the same critical value: within each block both the power and the actual type I rate monotonically raise as n increases. But, in correspondence with the first sample size of the subsequent block, they both decrease. This determines the basically increasing behaviour of the power as a function of n, with some small fluctuations, which is represented in Figure 1. For additional discussion about the saw-toothed shape of the frequentist power

Now, the problem of which sample size we should select arises because of the non-monotonic

that the smallest sample size that meets the power requirement is n = 35. At that sample size, the critical value is 12 and the power level is 0.8048. Then for n = 36, the critical value is still 12 and the power increases to 0.8380. However, the power drops below 0.8 to 0.7783, when n = 37,

for sample sizes greater than 38. Therefore, instead of selecting the smallest n that attains the power condition, it can be more appropriate to consider the more conservative sample size criterion formalized in Section 4, according to which the optimal sample size is selected as

The criterion ensures that the power will not decrease below the desired threshold for any

In order to model uncertainty in the specification of the design value, we need to adopt the hybrid classical-Bayesian approach described previously. We introduce a beta design prior density for θ, πD(θ) = beta(θ; αD, βD), that is used to obtain the prior predictive distribution of the data. It is well known that by averaging the binomial sampling fn(yn|θ) with respect to the

where beta-bin(�; <sup>α</sup>D, <sup>β</sup>D, <sup>n</sup>) denotes the probability mass function of a beta-binomial distribu-

The design prior πD(θ) can be elicited in many different ways. One useful possibility consists in (i) setting the prior mode equal to the fixed design value θD, which investigators would choose within the subset under H<sup>1</sup> when using the conditional approach, and (ii) regulating the concentration of the distribution around its mode according to the degree of uncertainty one wishes to express. This can be done by using for the hyperparameters of πD(θ) the following expressions:

where θ<sup>D</sup> is the prior mode and nD is a design parameter that can be interpreted as prior sample size. The larger the nD, the smaller the variance of the beta design prior. Therefore, we need to

larger sample size: in our specific case, it consists in selecting n = 38, instead of n = 35.

<sup>F</sup> <sup>n</sup>; <sup>θ</sup><sup>D</sup> . If we set the desired threshold <sup>γ</sup> for the power equal to 0.8, we have

<sup>F</sup> <sup>n</sup>; <sup>θ</sup><sup>D</sup> never decreases below 0.8

<sup>F</sup> <sup>n</sup>; <sup>θ</sup><sup>D</sup> <sup>&</sup>gt; <sup>γ</sup>, <sup>∀</sup><sup>n</sup> <sup>≥</sup> <sup>n</sup>� : (22)

<sup>¼</sup> beta-bin yn; <sup>α</sup><sup>D</sup>; <sup>β</sup><sup>D</sup>; <sup>n</sup> , for yn <sup>¼</sup> <sup>0</sup>, ::.; <sup>n</sup>; (23)

<sup>α</sup><sup>D</sup> <sup>¼</sup> nDθ<sup>D</sup> <sup>þ</sup> 1 and <sup>β</sup><sup>D</sup> <sup>¼</sup> nD <sup>1</sup> � <sup>θ</sup><sup>D</sup> <sup>þ</sup> <sup>1</sup>; (24)

function, the reader is referred to Chernick and Liu [13].

at which r = 13, and rises again over 0.8 when n = 38. Then η<sup>C</sup>

<sup>F</sup> <sup>¼</sup> min <sup>n</sup>� <sup>∈</sup> <sup>N</sup>: <sup>η</sup><sup>C</sup>

beta design prior, we obtain the following marginal distribution

m<sup>D</sup> <sup>n</sup> yn

nC

5.2. Frequentist predictive power

tion with parameters (αD, βD, n).

behaviour of η<sup>C</sup>

86 Bayesian Inference

increase n<sup>D</sup> if we want to reduce uncertainty on the guessed values of θ. More specifically, if we set n<sup>D</sup> = ∞, the design prior of θ assigns all the probability mass to θD: in this case, no uncertainty is involved and the marginal distribution of the data coincides with the sampling distribution conditional on θD. We thus must set nD < ∞ to distinguish between conditional and predictive approaches. In particular, once a prior mode θ<sup>D</sup> has been selected, the researcher can choose <sup>n</sup><sup>D</sup> by assuring a large level (say very close to 1) for <sup>P</sup>πDð Þ� ð Þ <sup>θ</sup> <sup>&</sup>gt; <sup>θ</sup><sup>0</sup> , that is the probability assigned by πD(θ) to the event θ > θ0. Let us assume, for instance, that θ<sup>0</sup> = 0.2 and consider three possible choices for θ<sup>D</sup> (i.e. 0.3, 0.4 and 0.5). For each of them, we compute the smallest nD such that <sup>P</sup>πDð Þ� ð Þ <sup>θ</sup> <sup>&</sup>gt; <sup>θ</sup><sup>0</sup> is about equal to 0.999, and the behaviour of the corresponding design priors is shown in Figure 2(a). Clearly, if the prior mode approaches θ0, we need to increase nD to guarantee that <sup>P</sup>πDð Þ� ð Þ <sup>θ</sup> <sup>&</sup>gt; <sup>θ</sup><sup>0</sup> <sup>≃</sup> <sup>0</sup>:999. Moreover, for a fixed prior mode θD, if we decided to decrease the value of nD with respect to the one used in the graph, <sup>P</sup>πDð Þ� ð Þ <sup>θ</sup> <sup>&</sup>gt; <sup>θ</sup><sup>0</sup> would decrease. In fact, nD has been specified in order to express the minimum degree of prior enthusiasm about the efficacy of the treatment necessary to have the prior probability that θ exceeds the target θ<sup>0</sup> at least equal to the chosen level 0.999. An alternative way of proceeding consists in choosing nD by ensuring a fixed level for the prior probability assigned to a symmetrical interval around the prior mode. For instance, if we set θ<sup>D</sup> = 0.4, we can find that 255, 111 and 60 are the values of n<sup>D</sup> such that it is about equal to 0.999 the probability that πD(θ) assigns to the intervals (0.3, 0.5), (0.25, 0.55) and (0.2, 0.6), respectively. The corresponding design prior distributions are shown in Figure 2(b). It is important to point out that all the design densities, represented in both the graphs of Figure 2, express uncertainty in the suitable design value that it is worthwhile to consider when applying the SSD criteria based on power analysis. Thus, all the distributions assign a negligible probability to values of θ smaller than θ0, which are those values specified under H0.

Figure 2. Possible choices of the design prior distribution, when θ<sup>0</sup> = 0.2.

Once πD(θ) has been specified, the frequentist predictive power can be obtained by computing the probability of rejecting the null hypothesis at α level with respect to m<sup>D</sup> <sup>n</sup> yn � �. Hence, we have

$$\begin{split} \eta\_{\text{F}}^{\text{p}}(\boldsymbol{n}, \pi^{\text{D}}) &= \mathbb{P}\_{\boldsymbol{m}\_{n}^{\text{D}}(\cdot)}(Y\_{n} \in \mathcal{R}\_{H\_{0}}) \\ &= \sum\_{y\_{n}=r}^{n} \mathsf{Beta-bin}(y\_{n}; \alpha^{\text{D}}, \beta^{\text{D}}, n), \end{split} \tag{25}$$

where r is the critical value provided in Eq. (20). In practice η<sup>P</sup> <sup>F</sup> <sup>n</sup>; <sup>π</sup><sup>D</sup> � � is given by the sum of the probabilities of the all the outcomes inside RH<sup>0</sup> , computed under a design scenario according to which the true θ belongs to the interval (θ0, 1), where it is distributed according to the design prior density. Let us remark again that if the design prior is a point mass distribution on θ<sup>D</sup> (i.e. nD = ∞), we have that the frequentist power functions, conditional and predictive coincide.

Similarly to the frequentist conditional power, also the predictive one presents a saw-toothed shape as a function of n, since m<sup>D</sup> <sup>n</sup> yn � � is a discrete distribution. Therefore, we suggest to adopt the conservative approach previously described and to select

$$n\_F^P = \min \{ n^\* \in \mathbb{N} \, : \, \eta\_F^P(n, \pi^D) > \gamma, \,\, \forall n \ge n^\* \},\tag{26}$$

for a fixed desired threshold γ. Figure 3 shows the behaviour of the frequentist predictive power as a function of n for different choices of the design prior, when θ<sup>0</sup> = 0.2 and α = 0.05. More specifically, we consider the three πD(θ) plotted in Figure 2(b) that are all centred on θ<sup>D</sup> = 0.4, but with different degrees of concentrations regulated by the n<sup>D</sup> value. In each graph, we highlight which is the optimal sample size obtained according to the criterion in Eq. (26) when γ = 0.8. Note that the larger the nD, the smaller the degree of uncertainty we introduce through the design prior and, as a consequence, the smaller the optimal sample size. In fact, we obtain the optimal values 46, 42 and 39, for n<sup>D</sup> equal to 60, 111 and 255, respectively. If we set nD = ∞, we would retrieve the conditional criterion in Eq. (22), where no uncertainty is considered in specifying the design value, and the optimal n would be equal to 38 (see

Figure 3. Behaviour of η<sup>P</sup> <sup>F</sup> <sup>n</sup>; <sup>π</sup><sup>D</sup> � � as a function of <sup>n</sup> for different choices of the design prior distribution, when <sup>θ</sup><sup>0</sup> = 0.2 and α = 0.05.

Figure 1). Moreover, let us fix again θ<sup>0</sup> = 0.2, α = 0.05 and γ = 0.8 and consider the three design prior distributions in Figure 2(a), which are characterized by different prior modes. The evident difference between the prior scenarios represented by these design priors clearly affects the optimal sample size: we obtain the optimal values 157, 46 and 23, for (θD, nD) = (0.3, 163), (θD, nD) = (0.4, 43) and (θD, nD) = (0.5, 20), respectively.

### 5.3. Bayesian conditional power

Once πD(θ) has been specified, the frequentist predictive power can be obtained by computing

<sup>n</sup> ð Þ� Yn ∈RH<sup>0</sup> ð Þ

probabilities of the all the outcomes inside RH<sup>0</sup> , computed under a design scenario according to which the true θ belongs to the interval (θ0, 1), where it is distributed according to the design prior density. Let us remark again that if the design prior is a point mass distribution on θ<sup>D</sup> (i.e. nD = ∞), we have that the frequentist power functions, conditional and predictive coincide.

Similarly to the frequentist conditional power, also the predictive one presents a saw-toothed

for a fixed desired threshold γ. Figure 3 shows the behaviour of the frequentist predictive power as a function of n for different choices of the design prior, when θ<sup>0</sup> = 0.2 and α = 0.05. More specifically, we consider the three πD(θ) plotted in Figure 2(b) that are all centred on θ<sup>D</sup> = 0.4, but with different degrees of concentrations regulated by the n<sup>D</sup> value. In each graph, we highlight which is the optimal sample size obtained according to the criterion in Eq. (26) when γ = 0.8. Note that the larger the nD, the smaller the degree of uncertainty we introduce through the design prior and, as a consequence, the smaller the optimal sample size. In fact, we obtain the optimal values 46, 42 and 39, for n<sup>D</sup> equal to 60, 111 and 255, respectively. If we set nD = ∞, we would retrieve the conditional criterion in Eq. (22), where no uncertainty is considered in specifying the design value, and the optimal n would be equal to 38 (see

<sup>n</sup> yn

<sup>F</sup> <sup>n</sup>; <sup>π</sup><sup>D</sup> � � is given by the sum of the

beta-bin yn; <sup>α</sup><sup>D</sup>; <sup>β</sup><sup>D</sup>; <sup>n</sup> � �; (25)

� � is a discrete distribution. Therefore, we suggest to adopt

<sup>F</sup> <sup>n</sup>; <sup>π</sup><sup>D</sup> � � <sup>&</sup>gt; <sup>γ</sup>, <sup>∀</sup><sup>n</sup> <sup>≥</sup> <sup>n</sup>� � �; (26)

<sup>F</sup> <sup>n</sup>; <sup>π</sup><sup>D</sup> � � as a function of <sup>n</sup> for different choices of the design prior distribution, when <sup>θ</sup><sup>0</sup> = 0.2 and

� �. Hence, we

the probability of rejecting the null hypothesis at α level with respect to m<sup>D</sup>

<sup>¼</sup> <sup>X</sup><sup>n</sup> yn¼r

<sup>F</sup> <sup>n</sup>; <sup>π</sup><sup>D</sup> � � <sup>¼</sup> <sup>P</sup>mD

ηP

where r is the critical value provided in Eq. (20). In practice η<sup>P</sup>

<sup>n</sup> yn

<sup>F</sup> <sup>¼</sup> min <sup>n</sup>� <sup>∈</sup> <sup>N</sup> : <sup>η</sup><sup>P</sup>

the conservative approach previously described and to select

nP

shape as a function of n, since m<sup>D</sup>

Figure 3. Behaviour of η<sup>P</sup>

α = 0.05.

have

88 Bayesian Inference

When we decide to adopt a Bayesian approach to establish the statistical significance of the result, we need to introduce an analysis prior distribution for θ. In our specific case, it is computationally convenient to specify a beta analysis prior, πA(θ) = beta(θ; αA, βA): in this way, from conjugate analysis we obtain that the corresponding posterior distribution is still a beta density with updated parameters,

$$
\pi\_n^A \{ \theta | y\_n \} = \mathtt{beta} \{ \theta; \alpha^A + y\_{n'} \beta^A + n - y\_n \}. \tag{27}
$$

Through πA(θ), the researcher can incorporate in the SSD procedure pre-experimental knowledge, as well as sceptical or enthusiastic expert prior opinions about the efficacy of the experimental treatment. However, one of the most common ways of proceeding is to choose a non-informative—or based on very weak information–density, to let the posterior distribution be based almost entirely on the evidence in the data. We could, therefore, specify πA(θ) = beta (θ; 1, 1) or consider the non-informative Jeffreys prior. Alternatively, if we want to use informative analysis prior distributions, we can express the hyperparameters in terms of the prior mode θ<sup>A</sup> and the prior sample size nA, that is

$$
\alpha^A = n^A \theta^A + 1 \quad \text{and} \quad \beta^A = n^A (1 - \theta^A) + 1. \tag{28}
$$

In this way, for instance, it is possible to express scepticism or optimism about large treatment effects by setting θ<sup>A</sup> less or higher than the target θ0, respectively. Obviously, when θ<sup>A</sup> < θ0, the larger the nA, the larger the degree of scepticism we wish to express; while, when θ<sup>A</sup> > θ<sup>0</sup> larger values of nA are used to increase the degree of enthusiasm we desire to take into account. However, the value nA = 1 is often used to have a weakly informative prior distribution. The upper panel of Figure 4 shows three possible choices for the analysis prior when θ<sup>0</sup> = 0.2. These distributions are obtained by fixing the prior mode θ<sup>A</sup> and, then, selecting nA so that <sup>P</sup>πAð Þ� ð Þ <sup>θ</sup> <sup>&</sup>gt; <sup>θ</sup><sup>0</sup> (i.e. the probability assigned by <sup>π</sup>A(θ) to the event <sup>θ</sup> <sup>&</sup>gt; <sup>θ</sup>0) is about equal to a desired level. More specifically, we have considered (i) a sceptical prior mode θ<sup>A</sup> = 0.1 and <sup>P</sup>πAð Þ� ð Þ <sup>θ</sup> <sup>&</sup>gt; <sup>θ</sup><sup>0</sup> <sup>≃</sup>0:4, (ii) a neutral prior mode <sup>θ</sup><sup>A</sup> = 0.2 and <sup>P</sup>πAð Þ� ð Þ <sup>θ</sup> <sup>&</sup>gt; <sup>θ</sup><sup>0</sup> <sup>≃</sup>0:6 and finally (iii) an enthusiastic prior mode <sup>θ</sup><sup>A</sup> = 0.3 and <sup>P</sup>πAð Þ� ð Þ <sup>θ</sup> <sup>&</sup>gt; <sup>θ</sup><sup>0</sup> <sup>≃</sup> <sup>0</sup>:8. The corresponding values of nA are 7, 14 and 4, respectively. These densities will be used to illustrate how the optimal sample sizes based on Bayesian powers are affected by the information formalized through the analysis priors.

The random result Yn is defined as 'significant' from a Bayesian perspective, if the corresponding posterior probability that θ > θ<sup>0</sup> is sufficiently large. In symbols, we decide to reject the null hypothesis, on the basis of the result Yn, if the following condition is satisfied.

Figure 4. Upper panel: possible choices of the analysis prior distribution, when θ<sup>0</sup> = 0.2. Lower panel: behaviour of ηB <sup>C</sup> <sup>n</sup>; <sup>θ</sup><sup>D</sup> � � as a function of <sup>n</sup> for each of the analysis prior distributions represented in the upper panel, when <sup>θ</sup><sup>0</sup> = 0.2, θ<sup>D</sup> = 0.4 and λ = 0.9.

$$\mathbb{P}\_{\pi^{4}\_{n}(\cdot|Y\_{n})}(\theta > \theta\_{0}) > \lambda,\tag{29}$$

where PπAð�j Þ Yn is the probability measure associated with the posterior distribution in Eq. (27) and λ ∈ (0, 1) is a pre-specified threshold. It is worth noting that, for a given value of n, the posterior quantity Pπ<sup>A</sup> <sup>n</sup> ð Þ �jYn ð Þ <sup>θ</sup> <sup>&</sup>gt; <sup>θ</sup><sup>0</sup> is an increasing function of Yn. As a consequence, we can find a non-negative integer ~r between 0 and n, such that

$$\mathbb{P}\_{\pi^{A}\_{n}(\cdot|\vec{r})}(\theta > \theta\_{0}) > \lambda \quad \text{and} \quad \mathbb{P}\_{\pi^{A}\_{n}(\cdot|\vec{r}-1)}(\theta > \theta\_{0}) \le \lambda,\tag{30}$$

and we can claim that H<sup>0</sup> is rejected if the observed number of responders yn is equal to or greater than ~r. In practice, ~r represents the smallest number of successes such that the condition for the Bayesian significance is satisfied, and in symbols it can be expressed by

$$\tilde{r} = \min\left\{ k \in \{0, 1, \ldots, n\} : \mathbb{P}\_{\pi^{\mathbf{d}}\_n(\cdot|k)}(\theta > \theta\_0) > \lambda \right\}.\tag{31}$$

By considering a fixed design value θ<sup>D</sup> greater than θ0, the Bayesian conditional power is therefore obtained as

Bayesian vs Frequentist Power Functions to Determine the Optimal Sample Size: Testing One Sample Binomial… http://dx.doi.org/10.5772/intechopen.70168 91

$$\begin{split} \eta\_{B}^{\mathbb{C}}(n, \theta^{D}) &= \mathbb{P}\_{f\_{n}(\cdot|\theta^{D})} \left( \mathbb{P}\_{\pi\_{n}^{d}(\cdot|Y\_{n})} (\theta > \theta\_{0}) > \lambda \right) \\ &= \sum\_{y\_{n}=\bar{r}}^{n} \text{bin}(y\_{n}; n, \theta^{D}). \end{split} \tag{32}$$

Essentially, it is given by the sum of the probabilities of all the Bayesian significant results, computed assuming that the true θ is equal to θD.

Since we are dealing with discrete data, also this power function is not monotonically increasing as a function of n. Let us assume that θ<sup>0</sup> = 0.20, θ<sup>D</sup> = 0.4 and λ = 0.9. The detailed calculations shown in Table 2 can help to understand why η<sup>C</sup> <sup>B</sup> <sup>n</sup>; <sup>θ</sup><sup>D</sup> � � has the typical saw-toothed behaviour. For each sample size between 3 and 50, the table provides the corresponding value of ~r, the level of the Bayesian conditional power and the posterior probability that θ exceeds θ<sup>0</sup> conditional on the result ~r. Clearly, these latter values are always larger than the threshold λ that is 0.9. The white and grey colours are used alternately to highlight blocks of sample sizes with the same value of ~r associated. When the sample size grows, but ~r remains constant, Pπ<sup>A</sup> <sup>n</sup> ð Þ �j~<sup>r</sup> ð Þ <sup>θ</sup> <sup>&</sup>gt; <sup>θ</sup><sup>0</sup> decreases, while η<sup>C</sup> <sup>B</sup> <sup>n</sup>; <sup>θ</sup><sup>D</sup> � � increases. However, when both <sup>n</sup> and <sup>~</sup><sup>r</sup> are simultaneously increased by one unit, Pπ<sup>A</sup> <sup>n</sup> ð Þ �j~<sup>r</sup> ð Þ <sup>θ</sup> <sup>&</sup>gt; <sup>θ</sup><sup>0</sup> jumps up, while the Bayesian power drops.

Because of the saw-toothed nature of the power curve, for a fixed threshold γ, the optimal sample size is selected using the conservative criterion, that is

$$n\_B^\mathbb{C} = \min\{n^\* \in \mathbb{N} \colon \eta\_B^\mathbb{C}(n, \Theta^D) > \gamma, \ \forall n \ge n^\*\}.\tag{33}$$

The lower panel of Figure 4 shows the behaviour of the Bayesian conditional power as a function of n for each of the three analysis prior density plotted in the upper panel, when θ<sup>0</sup> = 0.2, θ<sup>D</sup> = 0.4 and λ = 0.9. In each graph, it is indicated the optimal sample size according to the criterion in Eq. (33) for γ = 0.8. As expected, as we move from sceptical prior opinions towards more enthusiastic beliefs about the efficacy of the experimental treatment, the required sample size decreases.

#### 5.4. Bayesian predictive power

Pπ<sup>A</sup>

<sup>n</sup> ð�j~rÞð Þ <sup>θ</sup> <sup>&</sup>gt; <sup>θ</sup><sup>0</sup> <sup>&</sup>gt; <sup>λ</sup> and <sup>P</sup>π<sup>A</sup>

tion for the Bayesian significance is satisfied, and in symbols it can be expressed by

<sup>e</sup><sup>r</sup> <sup>¼</sup> min <sup>k</sup><sup>∈</sup> <sup>f</sup>0, <sup>1</sup>, ::.; <sup>n</sup><sup>g</sup> : <sup>P</sup>π<sup>A</sup>

find a non-negative integer ~r between 0 and n, such that

Pπ<sup>A</sup>

posterior quantity Pπ<sup>A</sup>

ηB

θ<sup>D</sup> = 0.4 and λ = 0.9.

90 Bayesian Inference

therefore obtained as

where PπAð�j Þ Yn is the probability measure associated with the posterior distribution in Eq. (27) and λ ∈ (0, 1) is a pre-specified threshold. It is worth noting that, for a given value of n, the

Figure 4. Upper panel: possible choices of the analysis prior distribution, when θ<sup>0</sup> = 0.2. Lower panel: behaviour of

<sup>C</sup> <sup>n</sup>; <sup>θ</sup><sup>D</sup> � � as a function of <sup>n</sup> for each of the analysis prior distributions represented in the upper panel, when <sup>θ</sup><sup>0</sup> = 0.2,

and we can claim that H<sup>0</sup> is rejected if the observed number of responders yn is equal to or greater than ~r. In practice, ~r represents the smallest number of successes such that the condi-

By considering a fixed design value θ<sup>D</sup> greater than θ0, the Bayesian conditional power is

n o

<sup>n</sup> ð�jYnÞð<sup>θ</sup> <sup>&</sup>gt; <sup>θ</sup>0<sup>Þ</sup> <sup>&</sup>gt; <sup>λ</sup>; (29)

<sup>n</sup> ð�j~r�<sup>1</sup>Þð Þ <sup>θ</sup> <sup>&</sup>gt; <sup>θ</sup><sup>0</sup> <sup>≤</sup> <sup>λ</sup>; (30)

: (31)

<sup>n</sup> ð Þ �jYn ð Þ <sup>θ</sup> <sup>&</sup>gt; <sup>θ</sup><sup>0</sup> is an increasing function of Yn. As a consequence, we can

<sup>n</sup> ð�j<sup>k</sup>Þð<sup>θ</sup> <sup>&</sup>gt; <sup>θ</sup>0<sup>Þ</sup> <sup>&</sup>gt; <sup>λ</sup>

Besides introducing pre-experimental information, if we also wish to model uncertainty on the design value, we have to consider the Bayesian predictive power. Therefore, as described in Section 5.3, we elicit an analysis prior distribution to obtain the beta posterior density πA <sup>n</sup> θ yn � � �� . Moreover, following the indications provided in Section 5.2, we introduce a design prior distribution to construct the marginal distribution m<sup>D</sup> <sup>n</sup> yn � �.

The Bayesian predictive power is computed by adding the probabilities of all the Bayesian significant results, computed under the design scenario expressed through the design prior. Thus, we have

$$\begin{split} \eta\_B^p(\boldsymbol{n}, \boldsymbol{\pi}^D) &= \mathbb{P}\_{\boldsymbol{m}\_n^D(\cdot)} \Big( \mathbb{P}\_{\pi\_n^A(\cdot|Y\_n)}(\boldsymbol{\theta} > \boldsymbol{\theta}\_0) > \boldsymbol{\lambda} \Big) \\ &= \sum\_{y\_n=\vec{r}}^n \text{beta-bin}(y\_n; \boldsymbol{\alpha}^D, \boldsymbol{\beta}^D, \boldsymbol{n}), \end{split} \tag{34}$$


Table 2. Numerical calculations to explain the saw-toothed behaviour of η<sup>C</sup> <sup>B</sup> <sup>n</sup>; <sup>θ</sup><sup>D</sup> � � as a function of <sup>n</sup>: sample sizes, the corresponding value of ~r, the Bayesian conditional power and the posterior probability that θ > θ<sup>0</sup> when the observed result is equal to ~r successes, for θ<sup>0</sup> = 0.20, θ<sup>D</sup> = 0.4 and λ = 0.9.

where ~r is given in Eq. (31). Obviously, also η<sup>P</sup> <sup>B</sup> <sup>n</sup>; <sup>π</sup><sup>D</sup> � � shows the typical saw-toothed behaviour as a function of n, because of the discrete nature of the beta-binomial marginal distribution of yn. Therefore, given a desired threshold γ and according to the suitable conservative approach previously used, we select the optimal sample size as

$$
\boldsymbol{m}\_{\boldsymbol{B}}^{\boldsymbol{P}} = \min \{ \boldsymbol{n}^\* \in \mathbb{N} \colon \boldsymbol{\eta}\_{\boldsymbol{B}}^{\boldsymbol{P}}(\boldsymbol{n}, \boldsymbol{\pi}^{\boldsymbol{D}}) > \boldsymbol{\gamma}, \ \forall \boldsymbol{n} \ge \boldsymbol{n}^\* \}. \tag{35}
$$


Bayesian vs Frequentist Power Functions to Determine the Optimal Sample Size: Testing One Sample Binomial… http://dx.doi.org/10.5772/intechopen.70168 93

Table 3. nP <sup>B</sup> for different choices of the analysis and the design priors, when θ<sup>0</sup> = 0.2 and λ = 0.9.

In Table 3 we provide the values of nP <sup>B</sup>, for different choices of the analysis and the design prior densities. More specifically, we consider the three analysis priors plotted in the upper panel of Figure 4 and the design prior distributions represented in both the panels of Figure 2, when θ<sup>0</sup> = 0.2 and λ = 0.9. Similarly to what we have seen for the Bayesian conditional power, the sample sizes obtained under the sceptical analysis prior are uniformly larger than those obtained under the more enthusiastic distributions. As regard the impact of the design priors, it is straightforward to see that the stronger the degree of uncertainty on the appropriate design value expressed by πD(θ), the larger the required sample size. For instance, for a fixed prior mode of the design prior, nP <sup>B</sup> increases as nD get smaller (see Table 3(b), where <sup>θ</sup><sup>D</sup> = 0.4). However, let us note that more evident changes in the sample size can be appreciated when we compare the effects of design priors based on different prior modes (see the results in Table 3(a), where the design priors represent very distant design scenarios).

These Bayesian predictive SSD procedures, which include the conditional ones as a special case, have been exploited in Ref. [8] to construct single-arm two-stage design for phase II of clinical trials based on binary data. In Ref. [14], instead, an extension to the randomized case has been presented, while in Ref. [15] the same procedures have been implemented by adding the possibility of taking into account uncertainty in the historical response rate.

### 6. Conclusions

where ~r is given in Eq. (31). Obviously, also η<sup>P</sup>

result is equal to ~r successes, for θ<sup>0</sup> = 0.20, θ<sup>D</sup> = 0.4 and λ = 0.9.

<sup>n</sup> <sup>e</sup><sup>r</sup> <sup>η</sup><sup>C</sup>

92 Bayesian Inference

<sup>B</sup> <sup>n</sup>; <sup>θ</sup><sup>D</sup> � � <sup>P</sup>π<sup>A</sup>

<sup>n</sup> ð Þ �j~<sup>r</sup> ð Þ <sup>θ</sup> <sup>&</sup>gt; <sup>θ</sup><sup>0</sup> <sup>n</sup> <sup>e</sup><sup>r</sup> <sup>η</sup><sup>C</sup>

3 3 0.0640 0.9263 27 9 0.8161 0.9077 4 4 0.0256 0.9703 28 10 0.7412 0.9464

5 4 0.0870 0.9558 29 10 0.7853 0.9354

6 4 0.1792 0.9377 30 10 0.8237 0.9230

7 4 0.2898 0.9159 31 10 0.8566 0.9092

8 5 0.1737 0.9618 32 11 0.7954 0.9460 9 5 0.2666 0.9476 33 11 0.8310 0.9356 10 5 0.3669 0.9304 34 11 0.8617 0.9239 11 5 0.4672 0.9102 35 11 0.8877 0.9110 12 6 0.3348 0.9559 36 12 0.8380 0.9460

13 6 0.4256 0.9422 37 12 0.8667 0.9362

14 6 0.5141 0.9260 38 12 0.8911 0.9252

15 6 0.5968 0.9075 39 12 0.9118 0.9131

16 7 0.4728 0.9518 40 13 0.8715 0.9464 17 7 0.5522 0.9388 41 13 0.8945 0.9371 18 7 0.6257 0.9237 42 13 0.9140 0.9267 19 7 0.6919 0.9065 43 13 0.9305 0.9153 20 8 0.5841 0.9491 44 13 0.9441 0.9028 21 8 0.6505 0.9367 45 14 0.9164 0.9381

22 8 0.7102 0.9226 46 14 0.9320 0.9284

23 8 0.7627 0.9067 47 14 0.9450 0.9176

24 9 0.6721 0.9474 48 14 0.9558 0.9059

25 9 0.7265 0.9357 49 15 0.9336 0.9394 26 9 0.7745 0.9225 50 15 0.9460 0.9301

<sup>B</sup> <sup>n</sup>; <sup>θ</sup><sup>D</sup> � � <sup>P</sup>π<sup>A</sup>

<sup>n</sup> ð Þ �j~<sup>r</sup> ð Þ <sup>θ</sup> <sup>&</sup>gt; <sup>θ</sup><sup>0</sup>

approach previously used, we select the optimal sample size as

Table 2. Numerical calculations to explain the saw-toothed behaviour of η<sup>C</sup>

<sup>B</sup> <sup>¼</sup> min <sup>n</sup>� <sup>∈</sup> <sup>N</sup>: <sup>η</sup><sup>P</sup>

nP

<sup>B</sup> <sup>n</sup>; <sup>π</sup><sup>D</sup> � � shows the typical saw-toothed behav-

<sup>B</sup> <sup>n</sup>; <sup>θ</sup><sup>D</sup> � � as a function of <sup>n</sup>: sample sizes, the

<sup>B</sup> <sup>n</sup>; <sup>π</sup><sup>D</sup> � � <sup>&</sup>gt; <sup>γ</sup>, <sup>∀</sup><sup>n</sup> <sup>≥</sup> <sup>n</sup>� � �: (35)

iour as a function of n, because of the discrete nature of the beta-binomial marginal distribution of yn. Therefore, given a desired threshold γ and according to the suitable conservative

corresponding value of ~r, the Bayesian conditional power and the posterior probability that θ > θ<sup>0</sup> when the observed

Especially in clinical research, the pre-experimental power analysis is one of the most commonly used methods for sample size calculations. It is tacitly implied that the power function is constructed under a frequentist framework. However, it is possible to introduce Bayesian concepts in the power analysis to provide more flexibility to the sample size determination process.

When the power function is used as a tool to obtain the appropriate sample size, the general idea is to ensure a large probability of correctly rejecting the null hypothesis H0, when it is actually false because the true θ belongs to H1. Therefore, the conjecture that the alternative hypothesis is true represents an essential element of the method. It can be realized by assuming that the true θ is equal to a fixed design value θD, suitably selected inside H<sup>1</sup> (conditional approach); alternatively, we can introduce uncertainty on the guessed design value by introducing a design prior distribution that assigns negligible probability to values of θ under H<sup>0</sup> (predictive approach). Moreover, the decision about the rejection of H<sup>0</sup> can be made under a frequentist framework or by performing a Bayesian analysis. In the latter case, it is possible to incorporate in the methodology pre-experimental information possibly available through the specification of an analysis prior distribution. By combining frequentist and Bayesian procedures of analysis, with both the conditional and predictive approaches, we obtain the four power functions described in this chapter. Let us remark that the Bayesian predictive power is the one that allows to add more flexibility to the sample size calculations. At the same time, it let the researcher take into account prior knowledge, as well uncertainty on the design value. However, no design uncertainty can be involved by considering a point-mass design distribution. On the other hand, if no information is available, it is possible to elicit a non-informative analysis prior and let the analysis be based entirely on the data.

### Author details

#### Valeria Sambucini

Address all correspondence to: valeria.sambucini@uniroma1.it

Department of Statistical Sciences, Sapienza Università di Roma, Sapienza, Italy

### References


[7] Brutti P, De Santis F, Gubbiotti S. Robust Bayesian sample size determination in clinical trials. Statistics in Medicine. 2008;27(13):2290-2306. DOI: 10.1002/sim.3175

hypothesis is true represents an essential element of the method. It can be realized by assuming that the true θ is equal to a fixed design value θD, suitably selected inside H<sup>1</sup> (conditional approach); alternatively, we can introduce uncertainty on the guessed design value by introducing a design prior distribution that assigns negligible probability to values of θ under H<sup>0</sup> (predictive approach). Moreover, the decision about the rejection of H<sup>0</sup> can be made under a frequentist framework or by performing a Bayesian analysis. In the latter case, it is possible to incorporate in the methodology pre-experimental information possibly available through the specification of an analysis prior distribution. By combining frequentist and Bayesian procedures of analysis, with both the conditional and predictive approaches, we obtain the four power functions described in this chapter. Let us remark that the Bayesian predictive power is the one that allows to add more flexibility to the sample size calculations. At the same time, it let the researcher take into account prior knowledge, as well uncertainty on the design value. However, no design uncertainty can be involved by considering a point-mass design distribution. On the other hand, if no information is available, it is possible to elicit a non-informative

analysis prior and let the analysis be based entirely on the data.

Address all correspondence to: valeria.sambucini@uniroma1.it

Raton: Chapman and Hall/CRC; 2008

10.1111/j.1467-985X.2006.00408.x

2002;17(2):193-208. DOI: 10.1214/ss/1030550861

Department of Statistical Sciences, Sapienza Università di Roma, Sapienza, Italy

[1] Ryan TP. Sample Size Determination and Power. Haboken: Wiley; 2013

[2] Chow SC, Wang H, Shao J. Sample Size Calculations in Clinical Research. 2nd ed. Boca

[3] Julious SA. Sample Sizes for Clinical Trials. Boca Raton: Chapman and Hall/CRC; 2010. [4] Wang F, Gelfand AE. A simulation-based approach to Bayesian sample size determination for performance under a given model and for separating models. Statistical Science.

[5] De Santis F. Sample size determination for robust Bayesian analysis. Journal of the American Statistical Association. 2006;101(473):278-291. DOI: 10.1198/016214505000000510 [6] Sahu SK, Smith TMF. A Bayesian method of sample size determination with practical applications. Journal of the Royal Statistical Society: Series A. 2006;169:235-253. DOI:

Author details

94 Bayesian Inference

Valeria Sambucini

References


Provisional chapter

### **Converting Graphic Relationships into Conditional Probabilities in Bayesian Network** Converting Graphic Relationships into Conditional

DOI: 10.5772/intechopen.70057

Loc Nguyen

Additional information is available at the end of the chapter Loc Nguyen

Probabilities in Bayesian Network

http://dx.doi.org/10.5772/intechopen.70057 Additional information is available at the end of the chapter

### Abstract

Bayesian network (BN) is a powerful mathematical tool for prediction and diagnosis applications. A large Bayesian network can constitute many simple networks, which in turn are constructed from simple graphs. A simple graph consists of one child node and many parent nodes. The strength of each relationship between a child node and a parent node is quantified by a weight and all relationships share the same semantics such as prerequisite, diagnostic, and aggregation. The research focuses on converting graphic relationships into conditional probabilities in order to construct a simple Bayesian network from a graph. Diagnostic relationship is the main research object, in which sufficient diagnostic proposition is proposed for validating diagnostic relationship. Relationship conversion is adhered to logic gates such as AND, OR, and XOR, which are essential features of the research.

Keywords: diagnostic relationship, Bayesian network, transformation coefficient

### 1. Introduction

Bayesian network (BN) is a directed acyclic graph (DAG) consists of a set of nodes and a set of arcs. Each node is a random variable. Each arc represents a relationship between two nodes. The strength of a relationship in a graph can be quantified by a number called weight. There are some important relationships such as prerequisite, diagnostic, and aggregation. The difference between BN and normal graph is that the strength of every relationship in BN is represented by a conditional probability table (CPT) whose entries are conditional probabilities of a child node given parent nodes. There are two main approaches to construct a BN, which are as follows

© 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and eproduction in any medium, provided the original work is properly cited.


This research focuses on the second approach in which relationships are converted into CPTs. Essentially, relationship conversion aims to determine conditional probabilities based on weights and meanings of relationships. We will have different ways to convert graphic weights into CPTs for different relationships. It is impossible to convert all relationships but some of them such as diagnostic, aggregation, and prerequisite are mandatory ones that we must specify as computable CPTs of BN. Especially, these relationships are adhered to logic X-gates [1] such as AND-gate, OR-gate, and SIGMA-gate. The X-gate inference in this research is derived and inspired from noisy OR-gate described in the book "Learning Bayesian Networks" Neapolitan ([2], pp. 157–159). Díez and Druzdzel [3] also researched OR/MAX, AND/MIN, and noisy XOR inferences but they focused on canonical models, deterministic models, and ICI models whereas I focused on logic gate and graphic relationships. So, their research is different from mine but we share the same result that is AND-gate model. In general, my research focuses on applied probability adhered to Bayesian network, logic gates, and Bayesian user modeling [4]. The scientific results are shared with Millán and Pérez-de-la-Cruz [4].

Factor graph [5] represents factorization of a global function into many partial functions. If joint distribution of BN is considered as the global function and CPTs are considered as partial functions, the sumproduct algorithm [6] of factor graph is applied into calculating posterior probabilities of variables in BN. Pearl's propagation algorithm [7] is very successful in BN inference. The application of factor graph into BN is only realized if all CPT (s) of BN are already determined whereas this research focuses on defining such CPTs firstly. I did not use factor graph for constructing BN. The concept "X-gate inference" only implies how to convert simple graph into BN. However, the arrange sum with a fixed variable mentioned in this research is the "not-sum" ([6], p. 499) of factor graph. Essentially, X-gate probability shown in Eq. (10) is as same as λ message in the Pearl's algorithm ([6], p. 518) but I use the most basic way to prove the X-gate probability.

As default, the research is applied in learning context in which BN is used to assess students' knowledge. Evidences are tests, exams, exercises, etc. and hypotheses are learning concepts, knowledge items, etc. Note that diagnostic relationship is very important to Bayesian evaluation in learning context because it is used to evaluate student's mastery of concepts (knowledge items) over entire BN. Now, we start relationship conversion with a research on diagnostic relationship in the next section.

### 2. Diagnostic relationship

In some opinions like mine, the diagnostic relationship should be from hypothesis to evidence. For example, disease is hypothesis and symptom is evidence. The symptom must be conditionally dependent on disease. Given a symptom, calculating the posterior probability of disease is essentially to diagnose likelihood of such disease ([8], p. 1666). Inversely, the arc from evidence to hypothesis implies prediction where evidence and hypothesis represent observation and event, respectively. Given an observation, calculating the posterior probability of the event is essentially to predict/assert such event ([8], p. 1666). Figure 1 shows diagnosis and prediction.

The weight w of the relationship between X and D is 1. Figure 1 depicts simplest graph with two random variables. We need to convert diagnostic relationship into conditional probabilities in order to construct a simplest BN from the simplest graph. Note that hypothesis is binary but evidence can be numerical. In learning context, evidence D can be test, exam, exercise, etc. The conditional probability of D given X (likelihood function) is P(D|X). The posterior probability of X is P(X|D), which is used to evaluate student's mastery over concept (hypothesis) X given evidence D. Eq. (1) specifies CPT of D when D is binary (0 and 1)

$$P(D|X) = \begin{cases} \quad \text{ } D \text{ if } X = 1\\ 1 - D \text{ if } X = 0 \end{cases} \tag{1}$$

Eq (1) is our first relationship conversion. It implies

• The first approach aims to learn BN from training data by learning machine algorithms. • The second approach is that experts define some graph patterns according to specific relationships and then, BN is constructed based on such patterns along with determined CPTs.

This research focuses on the second approach in which relationships are converted into CPTs. Essentially, relationship conversion aims to determine conditional probabilities based on weights and meanings of relationships. We will have different ways to convert graphic weights into CPTs for different relationships. It is impossible to convert all relationships but some of them such as diagnostic, aggregation, and prerequisite are mandatory ones that we must specify as computable CPTs of BN. Especially, these relationships are adhered to logic X-gates [1] such as AND-gate, OR-gate, and SIGMA-gate. The X-gate inference in this research is derived and inspired from noisy OR-gate described in the book "Learning Bayesian Networks" Neapolitan ([2], pp. 157–159). Díez and Druzdzel [3] also researched OR/MAX, AND/MIN, and noisy XOR inferences but they focused on canonical models, deterministic models, and ICI models whereas I focused on logic gate and graphic relationships. So, their research is different from mine but we share the same result that is AND-gate model. In general, my research focuses on applied probability adhered to Bayesian network, logic gates, and Bayesian user

modeling [4]. The scientific results are shared with Millán and Pérez-de-la-Cruz [4].

way to prove the X-gate probability.

98 Bayesian Inference

relationship in the next section.

2. Diagnostic relationship

Factor graph [5] represents factorization of a global function into many partial functions. If joint distribution of BN is considered as the global function and CPTs are considered as partial functions, the sumproduct algorithm [6] of factor graph is applied into calculating posterior probabilities of variables in BN. Pearl's propagation algorithm [7] is very successful in BN inference. The application of factor graph into BN is only realized if all CPT (s) of BN are already determined whereas this research focuses on defining such CPTs firstly. I did not use factor graph for constructing BN. The concept "X-gate inference" only implies how to convert simple graph into BN. However, the arrange sum with a fixed variable mentioned in this research is the "not-sum" ([6], p. 499) of factor graph. Essentially, X-gate probability shown in Eq. (10) is as same as λ message in the Pearl's algorithm ([6], p. 518) but I use the most basic

As default, the research is applied in learning context in which BN is used to assess students' knowledge. Evidences are tests, exams, exercises, etc. and hypotheses are learning concepts, knowledge items, etc. Note that diagnostic relationship is very important to Bayesian evaluation in learning context because it is used to evaluate student's mastery of concepts (knowledge items) over entire BN. Now, we start relationship conversion with a research on diagnostic

In some opinions like mine, the diagnostic relationship should be from hypothesis to evidence. For example, disease is hypothesis and symptom is evidence. The symptom must be conditionally dependent on disease. Given a symptom, calculating the posterior probability of

$$P(D|X=0) + P(D|X=1) = D + 1 - D = 1$$

Evidence D can be used to diagnose hypothesis X if the so-called sufficient diagnostic proposition is satisfied, as seen in Table 1.

The concept of sufficient evidence is borrowed from the concept of sufficient statistics and it is inspired from equivalence of variables T and T' in the research ([4], pp. 292-295). The proposition can be restated that evidence D is only used to assess hypotheses if it is sufficient evidence. As a convention, the proposition is called diagnostic condition and hypotheses have uniform distribution. The assumption of hypothetic uniform distribution (P(X = 1) = P(X = 0)) implies that we cannot assert whether or not given hypothesis is true before we observe its evidence.

In learning context, D can be totally used to assess student's mastery of X if diagnostic condition is satisfied. Derived from such condition, Eq. (2) specifies transformation coefficient k given uniform distribution of X.

Figure 1. Diagnosis and prediction with hypothesis X and evidence D.

D is equivalent to X in diagnostic relationship if P(X|D) = kP(D|X) given uniform distribution of X and the transformation coefficient k is independent from D. In other words, k is constant with regards to D and so D is called sufficient evidence.

Table 1. Sufficient diagnostic proposition.

$$k = \frac{P(X|D)}{P(D|X)}\tag{2}$$

We need to prove that Eq. (1) satisfies diagnostic condition. Suppose the prior probability of X is uniform.

$$P(X=0) = P(X=1)$$

we have

$$P(X|D) = \frac{P(D|X)P(X)}{P(D)} = \frac{P(D|X)P(X)}{P(D|X=0)P(X=0) + P(D|X=1)P(X=1)}$$

ðdue to Bayes'ruleÞ

$$\begin{aligned} &= \frac{P(D|X)P(X)}{P(X)\Big(P(D|X=0) + P(D|X=1)\Big)} \\\\ &\qquad \Big(\text{due to } P(X=0) = P(X=1)\Big) \\\\ &= \frac{P(D|X)}{P(D|X=0) + P(D|X=1)} = 1 \ast P(D|X) \\\\ &\qquad \Big(\text{due to } P(D|X=0) + P(D|X=1) = 1\Big) \end{aligned}$$

It is easy to infer that the transformation coefficient k is 1, if D is binary. In practice, evidence D is often a test whose grade ranges within an interval {0, 1, 2,…, η}. Eq. (3) specifies CPT of D in this case

$$P(D|\mathbf{X}) = \begin{cases} \frac{D}{S} \text{if } \mathbf{X} = \mathbf{1} \\\\ \frac{\eta}{S} - \frac{D}{S} \text{if } \mathbf{X} = \mathbf{0} \end{cases} \tag{3}$$

Where

$$D \in \{0, 1, 2, \dots, \eta\}$$

$$S = \sum\_{D=0}^{n} D = \frac{\eta(\eta + 1)}{2}$$

As a convention, PðDjXÞ ¼ 0, ∀D ∉ f0, 1, 2, …, ηg. Eq. (3) implies that if student has mastered concept (X = 1), the probability that she/he completes the exercise/test D is proportional to her/ his mark on D PðDjXÞ ¼ <sup>D</sup> S � �. We also have

$$P(D|\mathbf{X}=0) + P(D|\mathbf{X}=1) = \frac{D}{S} + \frac{\eta - D}{S} = \frac{\eta}{S} = \frac{2}{(\eta + 1)}$$

$$\sum\_{D=0}^{\eta} P(D|\mathbf{X}=1) = \sum\_{D=0}^{\eta} \frac{D}{S} = \frac{\sum\_{D=0}^{\eta} D}{S} = \frac{S}{S} = 1$$

$$\sum\_{D=0}^{\eta} P(D|\mathbf{X}=0) = \sum\_{D=0}^{\eta} \frac{\eta - D}{S} = \frac{\sum\_{D=0}^{\eta} (\eta - D)}{S} = \frac{\sum\_{D=0}^{\eta} \eta - \sum\_{D=0}^{\eta} D}{S} = \frac{\eta(\eta + 1) - S}{S} = \frac{2S - S}{S} = 1$$

We need to prove that Eq. (3) satisfies diagnostic condition. Suppose the prior probability of X is uniform.

$$P(X=0) = P(X=1)$$

The assumption of prior uniform distribution of X implies that we do not determine if student has mastered X yet. Similarly, we have

$$P(X|D) = \frac{P(D|X)P(X)}{P(D)} = \frac{P(D|X)}{P(D|X=0) + P(D|X=1)} = \frac{\eta + 1}{2}P(D|X) \neq 0$$

So, the transformation coefficient k is <sup>η</sup>þ<sup>1</sup> <sup>2</sup> if D ranges in {0, 1, 2,…, η}.

In the most general case, discrete evidence D ranges within an arbitrary integer interval fa, a þ 1, a þ 2, …, bg. In other words, D is bounded integer variable whose lower bound and upper bound are a and b, respectively. Eq. (4) specifies CPT of D, where D ∈fa, a þ 1, a þ 2, …, bg.

$$P(D|X) = \begin{cases} \frac{D}{S} \text{if } X = 1\\ \frac{b+a}{S} - \frac{D}{S} \text{if } X = 0 \end{cases} \tag{4}$$

Where

η

<sup>k</sup> <sup>¼</sup> <sup>P</sup>ðXjD<sup>Þ</sup>

D is equivalent to X in diagnostic relationship if P(X|D) = kP(D|X) given uniform distribution of X and the transformation coefficient k is independent from D. In other words, k is constant with regards to D and so D is called sufficient evidence.

We need to prove that Eq. (1) satisfies diagnostic condition. Suppose the prior probability of X

PðX ¼ 0Þ ¼ PðX ¼ 1Þ

<sup>P</sup>ðD<sup>Þ</sup> <sup>¼</sup> <sup>P</sup>ðDjXÞPðX<sup>Þ</sup>

<sup>¼</sup> <sup>P</sup>ðDjXÞPðX<sup>Þ</sup>

due to PðX ¼ 0Þ ¼ PðX ¼ 1Þ

PðDjX ¼ 0Þ þ PðDjX ¼ 1Þ

D

<sup>S</sup> if <sup>X</sup> <sup>¼</sup> <sup>1</sup>

<sup>S</sup> if <sup>X</sup> <sup>¼</sup> <sup>0</sup>

due to PðDjX ¼ 0Þ þ PðDjX ¼ 1Þ ¼ 1

ðdue to Bayes'ruleÞ

<sup>¼</sup> <sup>P</sup>ðDjX<sup>Þ</sup>

It is easy to infer that the transformation coefficient k is 1, if D is binary. In practice, evidence D is often a test whose grade ranges within an interval {0, 1, 2,…, η}. Eq. (3) specifies CPT of D in

> η <sup>S</sup> � <sup>D</sup>

D ∈ f0, 1, 2, …, ηg

<sup>D</sup> <sup>¼</sup> <sup>η</sup>ð<sup>η</sup> <sup>þ</sup> <sup>1</sup><sup>Þ</sup> 2

8 >><

>>:

PðXÞ �

�

�

PðDjXÞ ¼

<sup>S</sup> <sup>¼</sup> <sup>X</sup><sup>n</sup> D¼0

PðDjX ¼ 0ÞPðX ¼ 0Þ þ PðDjX ¼ 1ÞPðX ¼ 1Þ

PðDjX ¼ 0Þ þ PðDjX ¼ 1Þ

is uniform.

100 Bayesian Inference

we have

this case

Where

<sup>P</sup>ðXjDÞ ¼ <sup>P</sup>ðDjXÞPðX<sup>Þ</sup>

Table 1. Sufficient diagnostic proposition.

<sup>P</sup>ðDjX<sup>Þ</sup> (2)

�

¼ 1 � PðDjXÞ

� ■

(3)

�

$$\begin{aligned} D &\in \{a, a+1, a+2, \dots, b\} \\ S &= a + (a+1) + (a+2) + \dots + b = \frac{(b+a)(b-a+1)}{2} \end{aligned}$$

Note, PðDjXÞ ¼ 0, ∀D ∉ fa, a þ 1, a þ 2, …, bg. According to the diagnostic condition, we need to prove the equality PðXjDÞ ¼ kPðDjXÞ, where

$$k = \frac{b - a + 1}{2}$$

Similarly, we have

$$P(X|D) = \frac{P(D|X)P(X)}{P(D)} = \frac{P(D|X)}{P(D|X=0) + P(D|X=1)} = \frac{b-a+1}{2}P(D|X) = 1$$

If evidence D is continuous in the real interval [a, b] with note that a and b are real numbers, Eq. (5) specifies probability density function (PDF) of continuous evidence D ∈ ½a, b�. The PDF pðDjXÞ replaces CPT in case of continuous random variable.

$$p(D|X) = \begin{cases} \frac{2D}{b^2 - a^2} \text{if } X = 1 \\\\ \frac{2}{b - a} - \frac{2D}{b^2 - a^2} \text{if } X = 0 \end{cases}$$

where

$$D \in [a, b] \text{ where } a \text{ and } b \text{ are real numbers}$$

$$S = \int\_{a}^{b} \text{Dd}D = \frac{b^2 - a^2}{2} \tag{5}$$

As a convention, [a, b] is called domain of continuous evidence, which can be replaced by open or half-open intervals such as (a, b), (a, b], and [a, b). Of course we have pðDjXÞ ¼ 0, ∀D ∉ ½a, b�. In learning context, evidence D is often a test whose grade ranges within real interval [a, b].

Functions p(D|X = 1) and p(D|X = 0) are valid PDFs due to

$$\int\_{D} p(D|X=1) \mathrm{d}D = \int\_{a}^{b} \frac{2D}{b^{2} - a^{2}} \mathrm{d}D = \frac{1}{b^{2} - a^{2}} \int\_{a}^{b} 2D \mathrm{d}D = 1.$$

$$\int\_{D} p(D|X=0) \mathrm{d}D = \frac{2}{b-a} \int\_{a}^{b} \mathrm{d}D - \frac{1}{b^{2} - a^{2}} \int\_{a}^{b} 2D \mathrm{d}D = 1.$$

According to the diagnostic condition, we need to prove the equality

$$P(X|D) = kp(D|X)$$

where,

$$k = \frac{b - a}{2}$$

When D is continuous, its probability is calculated in ε-vicinity where ε is very small number. As usual, ε is bias if D is measure values produced from equipment. The probability of D given X, where D + ε∈ [a, b] and D – ε∈ [a, b] is

Converting Graphic Relationships into Conditional Probabilities in Bayesian Network http://dx.doi.org/10.5772/intechopen.70057 103

$$\begin{split}P(D|\mathbf{X}) = \left. \int\_{D-\varepsilon}^{D+\varepsilon} p(D|\mathbf{X}) \mathrm{d}D = \begin{cases} \displaystyle & \int\_{0}^{D+\varepsilon} \frac{2D}{b^{2} - a^{2}} \mathrm{d}D \text{ if } \mathbf{X} = 1\\ \displaystyle & \int\_{D-\varepsilon}^{D+\varepsilon} \left(\frac{2}{b-a} - \frac{2D}{b^{2} - a^{2}}\right) \mathrm{d}D \text{ if } \mathbf{X} = 0 \end{cases} \\ \displaystyle &= \left\{ \begin{array}{c} \frac{4\varepsilon D}{b^{2} - a^{2}} \text{ if } \mathbf{X} = 1\\ \frac{4\varepsilon}{b-a} - \frac{4\varepsilon D}{b^{2} - a^{2}} \text{ if } \mathbf{X} = 0 \end{array} \right. \end{split}$$

In fact, we have

<sup>P</sup>ðXjDÞ ¼ <sup>P</sup>ðDjXÞPðX<sup>Þ</sup>

where

102 Bayesian Inference

interval [a, b].

where,

pðDjXÞ replaces CPT in case of continuous random variable.

Functions p(D|X = 1) and p(D|X = 0) are valid PDFs due to

pðDjX ¼ 1ÞdD ¼

<sup>p</sup>ðDj<sup>X</sup> <sup>¼</sup> <sup>0</sup>Þd<sup>D</sup> <sup>¼</sup> <sup>2</sup>

According to the diagnostic condition, we need to prove the equality

ð

D

ð

D

X, where D + ε∈ [a, b] and D – ε∈ [a, b] is

pðDjXÞ ¼

<sup>P</sup>ðD<sup>Þ</sup> <sup>¼</sup> <sup>P</sup>ðDjX<sup>Þ</sup>

PðDjX ¼ 0Þ þ PðDjX ¼ 1Þ

2D <sup>b</sup><sup>2</sup> � <sup>a</sup><sup>2</sup>

� <sup>2</sup><sup>D</sup> <sup>b</sup><sup>2</sup> � <sup>a</sup><sup>2</sup> if X ¼ 1

if X ¼ 0

If evidence D is continuous in the real interval [a, b] with note that a and b are real numbers, Eq. (5) specifies probability density function (PDF) of continuous evidence D ∈ ½a, b�. The PDF

> 2 b � a

D ∈½a, b� where a and b are real numbers

As a convention, [a, b] is called domain of continuous evidence, which can be replaced by open or half-open intervals such as (a, b), (a, b], and [a, b). Of course we have pðDjXÞ ¼ 0, ∀D ∉ ½a, b�. In learning context, evidence D is often a test whose grade ranges within real

<sup>D</sup>d<sup>D</sup> <sup>¼</sup> <sup>b</sup><sup>2</sup> � <sup>a</sup><sup>2</sup>

8 >>><

>>>:

S ¼ ð b

a

ð b

2D <sup>b</sup><sup>2</sup> � <sup>a</sup><sup>2</sup>

b � a

ð b

a

PðXjDÞ ¼ kpðDjXÞ

<sup>k</sup> <sup>¼</sup> <sup>b</sup> � <sup>a</sup> 2

When D is continuous, its probability is calculated in ε-vicinity where ε is very small number. As usual, ε is bias if D is measure values produced from equipment. The probability of D given

<sup>d</sup><sup>D</sup> <sup>¼</sup> <sup>1</sup>

<sup>d</sup><sup>D</sup> � <sup>1</sup>

<sup>b</sup><sup>2</sup> � <sup>a</sup><sup>2</sup>

<sup>b</sup><sup>2</sup> � <sup>a</sup><sup>2</sup>

ð b

2DdD ¼ 1

2DdD ¼ 1:

a

ð b

a

a

<sup>¼</sup> <sup>b</sup> � <sup>a</sup> <sup>þ</sup> <sup>1</sup> 2

PðDjXÞ ■

<sup>2</sup> (5)

$$P(X|D) = \frac{P(D|X)P(X)}{P(D|X=0)P(X=0) + P(D|X=1)P(X=1)} = \frac{P(D|X)}{P(D|X=0) + P(D|X=1)}$$

$$\left(\text{due to Bayes' rule and the assumption } P(X=0) = P(X=1)\right)$$

$$= \frac{b-a}{4\varepsilon}P(D|X) = kp(D|X) \blacksquare$$

In general, Eq. (6) summarizes CPT of evidence of single diagnostic relationship.

$$\begin{aligned} P(D|X) &= \begin{cases} & \frac{D}{S} \text{ if } X = 1\\ \frac{M}{S} - \frac{D}{S} \text{ if } X = 0 \end{cases} \\ k &= \frac{N}{2} \end{aligned}$$

Where,

$$N = \begin{cases} 2 \text{ if } D \in \{0, 1\} \\ \eta + 1 \text{ if } D \in \{0, 1, 2, \dots, \eta\} \\ b - a + 1 \text{ if } D \in \{a, a + 1, a + 2, \dots, b\} \\ b - a \text{ if } D \text{ continuous and } D \in [a, b] \end{cases}$$

$$M = \begin{cases} 1 \text{ if } D \in \{0, 1\} \\ \eta \text{ if } D \in \{0, 1, 2, \dots, \eta\} \\ b + a \text{ if } D \in \{a, a + 1, a + 2, \dots, b\} \\ b + a \text{ if } D \text{ continuous and } D \in [a, b] \end{cases}$$

$$S = \sum\_{D} D = \frac{NM}{2} = \begin{cases} 1 \text{ if } D \in \{0, 1\} \\ \frac{\eta(\eta+1)}{2} \text{ if } D \in \{0, 1, 2, \dots, \eta\} \\ \frac{(b+a)(b-a+1)}{2} \text{ if } D \in \{a, a+1, a+2, \dots, b\} \\ \frac{b^2 - a^2}{2} \text{ if } D \text{ continuous and } D \in [a, b] \end{cases} \tag{6}$$

In general, if the conditional probability P(D|X) is specified by Eq. (6), the diagnostic condition will be satisfied. Note that the CPT P(D|X) is the PDF p(D|X) in case of continuous evidence. The diagnostic relationship will be extended with more than one hypothesis. The next section will mention how to determine CPTs of a simple graph with one child node and many parent nodes based on X-gate inferences.

### 3. X-gate inferences

Given a simple graph consisting of one child variable Y and n parent variables Xi, as shown in Figure 2, each relationship from Xi to Y is quantified by normalized weight wi where 0 ≤ wi ≤ 1. A large graph is an integration of many simple graphs. Figure 2 shows the DAG of a simple BN. As aforementioned, the essence of constructing simple BN is to convert graphic relationships of simple graph into CPTs of simple BN.

Child variable Y is called target and parent variables Xis are called sources. Especially, these relationships are adhered to X-gates such as AND-gate, OR-gate, and SIGMA-gate. These gates are originated from logic gate [1]. For instance, AND-gate and OR-gate represent prerequisite relationship. SIGMA-gate represents aggregation relationship. Therefore, relationship conversion is to determined X-gate inference. The simple graph shown in Figure 2 is also called X-gate graph or X-gate network. Please distinguish the letter "X" in the term "X-gate inference" which implies logic operators (AND, OR, XOR, etc.) from the "variable X".

All variables are binary and they represent events. The probability P(X) indicates event X occurs. Thus, P(X) implicates P(X = 1) and P(not(X)) implicates P(X = 0). Eq. (7) specifies the simple NOT-gate inference.

Figure 2. Simple graph or simple network.

$$\begin{aligned} P\left(\text{not}(X)\right) &= P(\overline{X}) = P(X=0) = 1 - P(X=1) = 1 - P(X) \\ P\left(\text{not}\left(\text{not}(X)\right)\right) &= P(X) \end{aligned} \tag{7}$$

X-gate inference is based on three assumptions mentioned in Ref. ([2], p. 157), which are as follows

In general, if the conditional probability P(D|X) is specified by Eq. (6), the diagnostic condition will be satisfied. Note that the CPT P(D|X) is the PDF p(D|X) in case of continuous evidence. The diagnostic relationship will be extended with more than one hypothesis. The next section will mention how to determine CPTs of a simple graph with one child node and many parent

Given a simple graph consisting of one child variable Y and n parent variables Xi, as shown in Figure 2, each relationship from Xi to Y is quantified by normalized weight wi where 0 ≤ wi ≤ 1. A large graph is an integration of many simple graphs. Figure 2 shows the DAG of a simple BN. As aforementioned, the essence of constructing simple BN is to convert graphic relation-

Child variable Y is called target and parent variables Xis are called sources. Especially, these relationships are adhered to X-gates such as AND-gate, OR-gate, and SIGMA-gate. These gates are originated from logic gate [1]. For instance, AND-gate and OR-gate represent prerequisite relationship. SIGMA-gate represents aggregation relationship. Therefore, relationship conversion is to determined X-gate inference. The simple graph shown in Figure 2 is also called X-gate graph or X-gate network. Please distinguish the letter "X" in the term "X-gate

All variables are binary and they represent events. The probability P(X) indicates event X occurs. Thus, P(X) implicates P(X = 1) and P(not(X)) implicates P(X = 0). Eq. (7) specifies the

inference" which implies logic operators (AND, OR, XOR, etc.) from the "variable X".

nodes based on X-gate inferences.

ships of simple graph into CPTs of simple BN.

3. X-gate inferences

104 Bayesian Inference

simple NOT-gate inference.

Figure 2. Simple graph or simple network.


Figure 3 shows the extended X-gate network with accountable variables Ais ([2], p. 158).

The strength of each relationship from source Xi to target Y is quantified by a weight 0 ≤ wi ≤ 1. According to the assumption of inhibition, probability of Ii = OFF is pi, which is set to be the weight wi.

$$p\_i = w\_i$$

If notation wi is used, we focus on the strength of relationship. If notation pi is used, we focus on probability of OFF inhibition. In probabilistic inference, pi is also prior probability of Xi = 1. However, we will assume each Xi has uniform distribution later on. Eq. (8) specifies probabilities of inhibitions Iis and accountable variables Ais.

Figure 3. Extended X-gate network with accountable variables Ais.

$$\begin{aligned} P(I\_i = \text{OFF}) &= p\_i = w\_i \\ P(I\_i = \text{ON}) &= 1 - p\_i = 1 - w\_i \\ P(A\_i = \text{ON} | \mathbf{X}\_i = 1, I\_i = \text{OFF}) &= 1 \\ P(A\_i = \text{ON} | \mathbf{X}\_i = 1, I\_i = \text{ON}) &= 0 \\ P(A\_i = \text{ON} | \mathbf{X}\_i = 0, I\_i = \text{OFF}) &= 0 \\ P(A\_i = \text{ON} | \mathbf{X}\_i = 0, I\_i = \text{ON}) &= 0 \\ P(A\_i = \text{OFF} | \mathbf{X}\_i = 1, I\_i = \text{OFF}) &= 1 \\ P(A\_i = \text{OFF} | \mathbf{X}\_i = 0, I\_i = \text{OFF}) &= 1 \\ P(A\_i = \text{OFF} | \mathbf{X}\_i = 0, I\_i = \text{OFF}) &= 1 \end{aligned} \tag{8}$$

$$P(A\_i = \text{OFF} | \mathbf{X}\_i = 0, I\_i = \text{OFF}) = 1$$

According to Eq. (8), given probability P(Ai=ON | Xi=1, Ii=OFF), it is assured 100% confident that accountable variables Ai is turned on if source Xi is 1 and inhibition Ii is turned off. Eq. (9) specifies conditional probability of accountable variables Ai (s) given Xi (s), which is corollary of Eq. (8).

$$P(A\_i = \text{ON} | \mathbf{X}\_i = 1) = p\_i = w\_i$$

$$P(A\_i = \text{ON} | \mathbf{X}\_i = \mathbf{0}) = \mathbf{0}$$

$$P(A\_i = \text{OFF} | \mathbf{X}\_i = 1) = 1 - p\_i = 1 - w\_i$$

$$P(A\_i = \text{OFF} | \mathbf{X}\_i = \mathbf{0}) = \mathbf{1}$$

Appendix A1 is the proof of Eq. (9). As a definition, the set of all Xis is complete if and only if

$$P(X\_1 \cup X\_2 \cup \cdots \cup X\_n) = P(\Omega) = \sum\_{i=1}^n w\_i = 1$$

The set of all Xis is mutually exclusive if and only if

$$X\_i \cap X\_j = \bigotimes\_{\prime} \forall i \neq j$$

For each Xi, there is only one Ai and vice versa, which establishes a bijection between Xis and Ais. Obviously, the fact that the set of all Xis is complete is equivalent to the fact that the set of all Ai (s) is complete. We will prove by contradiction that "the fact that the set of all Xi (s) is mutually exclusive is equivalent to the fact that the set of all Ai (s) is mutually exclusive." Suppose Xi <sup>∩</sup> Xj <sup>¼</sup> <sup>∅</sup>, <sup>∀</sup><sup>i</sup> 6¼ <sup>j</sup> but <sup>∃</sup><sup>i</sup> 6¼ <sup>j</sup>: Ai <sup>∩</sup> Aj <sup>¼</sup> <sup>B</sup> 6¼ <sup>∅</sup>. Let <sup>B</sup>�<sup>1</sup> 6¼ <sup>∅</sup> be preimage of <sup>B</sup>. Due to <sup>B</sup> <sup>⊆</sup> Ai and <sup>B</sup> <sup>⊆</sup> Aj, we have <sup>B</sup>�<sup>1</sup> <sup>⊆</sup> Xi and <sup>B</sup>�<sup>1</sup> <sup>⊆</sup> Xj, which causes that Xi <sup>∩</sup> Xj <sup>¼</sup> <sup>B</sup>�<sup>1</sup> 6¼ <sup>∅</sup>. There is a contradiction and so we have

$$X\_i \cap X\_j = \bigotimes\_{\prime} \forall i \neq j \Rightarrow A\_i \cap A\_j = \bigotimes\_{\prime} \forall i \neq j$$

By similar proof, we have

$$A\_i \cap A\_j = \mathfrak{Q}, \forall i \neq j \Rightarrow X\_i \cap X\_j = \mathfrak{Q}, \forall i \neq j \blacksquare$$

The extended X-gate network shown in Figure 3 is interpretation of simple network shown in Figure 2. Specifying CPT of the simple network is to determine the conditional probability P(Y =1| X1, X2,…, Xn) based on extended X-gate network. The X-gate inference is represented by such probability P(Y =1| X1, X2,…, Xn) specified by Eq. (10) ([2], p. 159).

$$P(Y|X\_1, X\_2, \dots, X\_n) = \sum\_{A\_1, A\_2, \dots, A\_n} P(Y|A\_1, A\_2, \dots, A\_n) \prod\_{i=1}^n P(A\_i|X\_i) \tag{10}$$

Appendix A2 is the proof of Eq. (10). It is necessary to make some mathematical notations because Eq. (10) is complicated, which is relevant to arrangements of Xi (s). Given the set Ω = {X1, X2,…, Xn} where all variables are binary, Table 2 specifies binary arrangements of Ω.

Given Ω = {X1, X2,…, Xn} where |Ω| = n is cardinality of Ω.

PðIi ¼ OFFÞ ¼ pi ¼ wi PðIi ¼ ONÞ ¼ 1 � pi ¼ 1 � wi PðAi ¼ ONjXi ¼ 1, Ii ¼ OFFÞ ¼ 1 PðAi ¼ ONjXi ¼ 1, Ii ¼ ONÞ ¼ 0 PðAi ¼ ONjXi ¼ 0, Ii ¼ OFFÞ ¼ 0 PðAi ¼ ONjXi ¼ 0, Ii ¼ ONÞ ¼ 0 PðAi ¼ OFFjXi ¼ 1, Ii ¼ OFFÞ ¼ 0 PðAi ¼ OFFjXi ¼ 1, Ii ¼ ONÞ ¼ 1 PðAi ¼ OFFjXi ¼ 0, Ii ¼ OFFÞ ¼ 1 PðAi ¼ OFFjXi ¼ 0, Ii ¼ ONÞ ¼ 1

According to Eq. (8), given probability P(Ai=ON | Xi=1, Ii=OFF), it is assured 100% confident that accountable variables Ai is turned on if source Xi is 1 and inhibition Ii is turned off. Eq. (9) specifies conditional probability of accountable variables Ai (s) given Xi (s), which is corollary

> PðAi ¼ ONjXi ¼ 1Þ ¼ pi ¼ wi PðAi ¼ ONjXi ¼ 0Þ ¼ 0 PðAi ¼ OFFjXi ¼ 1Þ ¼ 1 � pi ¼ 1 � wi PðAi ¼ OFFjXi ¼ 0Þ ¼ 1

Appendix A1 is the proof of Eq. (9). As a definition, the set of all Xis is complete if and only if

Xi ∩ Xj ¼ ∅, ∀i 6¼ j

For each Xi, there is only one Ai and vice versa, which establishes a bijection between Xis and Ais. Obviously, the fact that the set of all Xis is complete is equivalent to the fact that the set of all Ai (s) is complete. We will prove by contradiction that "the fact that the set of all Xi (s) is mutually exclusive is equivalent to the fact that the set of all Ai (s) is mutually exclusive." Suppose Xi <sup>∩</sup> Xj <sup>¼</sup> <sup>∅</sup>, <sup>∀</sup><sup>i</sup> 6¼ <sup>j</sup> but <sup>∃</sup><sup>i</sup> 6¼ <sup>j</sup>: Ai <sup>∩</sup> Aj <sup>¼</sup> <sup>B</sup> 6¼ <sup>∅</sup>. Let <sup>B</sup>�<sup>1</sup> 6¼ <sup>∅</sup> be preimage of <sup>B</sup>. Due to <sup>B</sup> <sup>⊆</sup> Ai and <sup>B</sup> <sup>⊆</sup> Aj, we have <sup>B</sup>�<sup>1</sup> <sup>⊆</sup> Xi and <sup>B</sup>�<sup>1</sup> <sup>⊆</sup> Xj, which causes that Xi <sup>∩</sup> Xj <sup>¼</sup> <sup>B</sup>�<sup>1</sup> 6¼ <sup>∅</sup>. There

Xi ∩ Xj ¼ ∅, ∀i 6¼ j ) Ai ∩ Aj ¼ ∅, ∀i 6¼ j

i¼1

wi ¼ 1

<sup>P</sup>ðX<sup>1</sup> <sup>∪</sup> <sup>X</sup><sup>2</sup> <sup>∪</sup> <sup>⋯</sup><sup>∪</sup> XnÞ ¼ <sup>P</sup>ðΩÞ ¼ <sup>X</sup><sup>n</sup>

The set of all Xis is mutually exclusive if and only if

is a contradiction and so we have

By similar proof, we have

of Eq. (8).

106 Bayesian Inference

(8)

(9)

Let a(Ω) be an arrangement of Ω which is a set of n instances {X1=x1, X2=x2,…, Xn=xn} where xi is 1 or 0. The number of all a (Ω) is 2|Ω<sup>|</sup> . For instance, given Ω = {X1, X2}, there are 22 =4 arrangements as follows:

$$\begin{aligned} a(\mathcal{Q}) &= \{X\_1 = 1, X\_2 = 1\}, a(\mathcal{Q}) = \{X\_1 = 1, X\_2 = 0\}, a(\mathcal{Q}) = \{X\_1 = 0, X\_2 = 1\}, a(\mathcal{Q}) \\ &= \{X\_1 = 0, X\_2 = 0\}. \end{aligned}$$

Let a(Ω:{Xi}) be the arrangement of Ω with fixed Xi. The number of all a(Ω:{Xi}) is 2|Ω|�<sup>1</sup> . Similarly, for instance, a(Ω:{X1, X2, X3}) is an arrangement of Ω with fixed X1, X2, X3. The number of all a(Ω:{X1, X2, X3}) is 2<sup>|</sup>Ω|�<sup>3</sup> .

Let c(Ω) and c(Ω:{Xi}) be the number of arrangements a(Ω) and a(Ω:{Xi}), respectively. Such c(Ω) and c(Ω:{Xi}) are called arrangement counters. As usual, counters c(Ω) and c(Ω:{Xi}) are equal to 2|Ω<sup>|</sup> and 2|Ω|�<sup>1</sup> , respectively but they will vary according to specific cases.

Let X a F � aðΩÞ � and Y a F � aðΩÞ � denote sum and product of values generated from function F acting on every a(Ω). The number of arrangements on which F acts is c(Ω).

Let x denote the X-gate operator, for instance, x = ⊙ for AND-gate, x = ⊕ for OR-gate, x = not ⊙ for NAND-gate, x = not ⊕ for NOR-gate, x = ⊗ for XOR-gate, x = not ⊗ for XNOR-gate, x = ⊎ for U-gate, x ¼ þ for SIGMA-gate. Given an x-operator, let s(Ω:{Xi}) and s(Ω) be sum of all PðX1xX2x…xXnÞ through every arrangement of Ω with and without fixed Xi, respectively.

$$\begin{aligned} s(\Omega) &= \sum\_{a} P\left(X\_1 \ge X\_2 \ge \dots \ge X\_n | a(\Omega)\right) = \sum\_{a} P\left(Y = 1 | a(\Omega)\right) \\ s(\Omega : \{X\_i\}) &= \sum\_{a} P\left(X\_1 \ge X\_2 \ge \dots \ge X\_n | a(\Omega : \{X\_i\})\right) = \sum\_{a} P\left(Y = 1 | a(\Omega : \{X\_i\})\right) \end{aligned}$$

For example, s(Ω) and s(Ω:{Xi}) for OR-gate are:

$$\begin{aligned} s(\Omega) &= \sum\_{a} P\left(X\_1 \oplus X\_2 \oplus \dots \oplus X\_n | a(\Omega)\right) \\ s(\Omega : \{X\_i\}) &= \sum\_{a} P\left(X\_1 \oplus X\_2 \oplus \dots \oplus X\_n | a(\Omega : \{X\_i\})\right) \end{aligned}$$

Such s(Ω) and s(Ω:{Xi}) are called arrangement sum. They are acting function F. Note that Ω can be any set of binary variables.

Table 2. Binary arrangements.

It is not easy to produce all binary arrangements of Ω. Table 3 shows a code snippet written by Java programming language for producing such all arrangements.

Each element of the list "arrangements" is a binary arrangement a(Ω) presented by an array of bits (0 and 1). The method "create(int[] a, int i)" which is recursive method, is the main one that generates arrangements. The method call "ArrangementGenerator.parse(2, n)" will list all possible binary arrangements.

Eq. (11) specifies the connection between s(Ω:{Xi = 1}) and s(Ω:{Xi = 0}), between c(Ω:{Xi = 1}) and c(Ω:{Xi = 0}).

$$\begin{aligned} s(\mathfrak{Q} : \{X\_i = 1\}) + s(\mathfrak{Q} : \{X\_i = 0\}) &= s(\mathfrak{Q}) \\ c(\mathfrak{Q} : \{X\_i = 1\}) + c(\mathfrak{Q} : \{X\_l = 0\}) &= c(\mathfrak{Q}) \end{aligned} \tag{11}$$

It is easy to draw Eq. (11) when the set of all arrangements a(Ω:{Xi = 1) is complement of the set of all arrangements a(Ω:{Xi = 0).

Let K be a set of Xis whose values are 1 and let L be a set of Xis whose values are 0. K and L are mutually complementary. Eq. (12) determines sets K and L.

$$\begin{cases} K = \{ i \colon X\_i = 1 \} \\ L = \{ i \colon X\_i = 0 \} \\ K \cap L = \mathcal{Q} \\ K \cup L = \{ 1, 2, \ldots, n \} \end{cases} \tag{12}$$

The AND-gate inference represents prerequisite relationship satisfying AND-gate condition specified by Eq. (13).

$$P(Y=1|A\_i=\text{OFF for some }i) = 0\tag{13}$$

From Eq. (10), we have

$$\begin{aligned} P(Y=1|X\_1, X\_2, \ldots, X\_n) &= \sum\_{A\_1, A\_2, \ldots, A\_n} P(Y=1|A\_1, A\_2, \ldots, A\_n) \prod\_{i=1}^n P(A\_i|X\_i) \\ &= \prod\_{i=1}^n P(A\_i = \text{ON}|X\_i) \\ &\quad \left(\text{Due to } P(Y=1|A\_i = \text{OFF for some } i) = 0\right) \\ &= \left(\prod\_{i \in K} P(A\_i = \text{ON}|X\_i = 1)\right) \left(\prod\_{i \notin K} P(A\_i = \text{ON}|X\_i = 0)\right) \\ &= \left(\prod\_{i \in K} p\_i\right) \left(\prod\_{i \notin K} 0\right) = \left\{\prod\_{i=1}^n p\_i \text{ if all } X\_i(s) \text{ are } 1 \\ &\quad 0 \text{ if there exists at least one } X\_i = 0 \end{aligned}$$

(Due to Eq. (9))

```
public class ArrangementGenerator {
private ArrayList<int[]> arrangements;
private int n;
private int r;
private ArrangementGenerator(int n, int r) {
      this.n = n;
      this.r = r;
      this.arrangements = new ArrayList();
}
private void create(int[] a, int i) {
     for(int j = 0; j < n; j++) {
          a[i] = j;
        if(i < r - 1)
              create(a, i + 1);
        else if(i == r -1) {
              int[] b = new int[a.length];
              for(int k = 0; k < a.length; k++) b[k] = a[k];
              arrangements.add(b);
        }
     }
}
public int[] get(int i) {
     return arrangements.get(i);
}
public long size() {
     return arrangements.size();
}
public static ArrangementGenerator parse(int n, int r) {
     ArrangementGenerator arr =
          new ArrangementGenerator(n, r);
     int[] a = new int[r];
     for(int i=0; i<r; i++) a[i] = -1;
     arr.create(a, 0);
     return arr;
}
}
```
It is not easy to produce all binary arrangements of Ω. Table 3 shows a code snippet written by

Each element of the list "arrangements" is a binary arrangement a(Ω) presented by an array of bits (0 and 1). The method "create(int[] a, int i)" which is recursive method, is the main one that generates arrangements. The method call "ArrangementGenerator.parse(2, n)" will list all possi-

Eq. (11) specifies the connection between s(Ω:{Xi = 1}) and s(Ω:{Xi = 0}), between c(Ω:{Xi = 1})

sðΩ : fXi ¼ 1gÞ þ sðΩ : fXi ¼ 0gÞ ¼ sðΩÞ

It is easy to draw Eq. (11) when the set of all arrangements a(Ω:{Xi = 1) is complement of the set

Let K be a set of Xis whose values are 1 and let L be a set of Xis whose values are 0. K and L are

K ¼ fi: Xi ¼ 1g L ¼ fi: Xi ¼ 0g K ∩ L ¼ ∅

8 >>>><

>>>>:

K∪ L ¼ {1, 2, …, n}

The AND-gate inference represents prerequisite relationship satisfying AND-gate condition

<sup>A</sup>1, <sup>A</sup>2, …, An

Yn i¼1

8 < :

PðAi ¼ ONjXiÞ

��Y i∉K

pi if all XiðsÞ are 1

Due to PðY ¼ 1jAi ¼ OFF for some iÞ ¼ 0

<sup>¼</sup> <sup>Y</sup><sup>n</sup> i¼1

PðAi ¼ ONjXi ¼ 1Þ

¼

i∉K 0 !

<sup>c</sup>ð<sup>Ω</sup> : <sup>f</sup>Xi <sup>¼</sup> <sup>1</sup>gÞ þ <sup>c</sup>ð<sup>Ω</sup> : <sup>f</sup>Xi <sup>¼</sup> <sup>0</sup>gÞ ¼ <sup>c</sup>ðΩ<sup>Þ</sup> (11)

PðY ¼ 1jAi ¼ OFF for some iÞ ¼ 0 (13)

PðAi ¼ ONjXi ¼ 0Þ

0 if there exists at least one Xi ¼ 0

Yn i¼1

�

�

PðAijXiÞ

PðY ¼ 1jA1, A2, …, AnÞ

(12)

Java programming language for producing such all arrangements.

mutually complementary. Eq. (12) determines sets K and L.

<sup>P</sup>ð<sup>Y</sup> <sup>¼</sup> <sup>1</sup>jX1, X2, …, XnÞ ¼ <sup>X</sup>

�

¼ �Y i ∈K

<sup>¼</sup> <sup>Y</sup> i∈ K pi ! <sup>Y</sup>

ble binary arrangements.

of all arrangements a(Ω:{Xi = 0).

specified by Eq. (13).

From Eq. (10), we have

(Due to Eq. (9))

and c(Ω:{Xi = 0}).

108 Bayesian Inference

Table 3. Code snippet generating all binary arrangements.

In general, Eq. (14) specifies AND-gate inference.

$$P(X\_1 \odot X\_2 \odot \dots \odot X\_n) = P(Y = 1 | X\_1, X\_2, \dots, X\_n) = \begin{cases} \prod\_{i=1}^n p\_i \text{ if all } X\_i(\mathbf{s}) \text{ are } 1\\ 0 \text{ if there exists at least one } X\_i = 0\\ 1 \text{ of there exists at least one } X\_i = 0\\ 1 \text{ of the } 1 \text{ and } X\_i(\mathbf{s}) \text{ are } 1\\ 1 \text{ if there exists at least one } X\_i = 0 \end{cases} \tag{14}$$

The AND-gate inference was also described in ([3], p. 33). Eq. (14) varies according to two cases whose arrangement counters are listed as follows

L ¼ ∅

$$c(\Omega : \{X\_i = 1\}) = 1,\\ c(\Omega : \{X\_i = 0\}) = 0,\\ c(\Omega) = 1.$$

L 6¼ ∅

$$c(\Omega : \{X\_i = 1\}) = 2^{n-1} - 1,\\ c(\Omega : \{X\_i = 0\}) = 2^{n-1},\\ c(\Omega) = 2^n - 1.$$

The OR-gate inference represents prerequisite relationship satisfying OR-gate condition specified by Eq. (15) ([2], p. 157).

$$P(Y=1|A\_i=\text{ON for some }i) = 1\tag{15}$$

The OR-gate condition implies

$$P(Y=0|A\_i=\text{ON for some }i) = 0$$

From Eq. (10), we have ([2], p. 159)

$$\begin{aligned} P(Y=0|X\_1, X\_2, \dots, X\_n) &= \sum\_{A\_1, A\_2, \dots, A\_n} P(Y=1|A\_1, A\_2, \dots, A\_n) \prod\_{i=1}^n P(A\_i|X\_i) \\ &= \prod\_{i=1}^n P(A\_i = \text{OFF}|X\_i) \end{aligned}$$

$$\left(\text{due to } P(Y=1|A\_i = \text{ON for some } i) = 0\right)$$

$$= \left(\prod\_{i \in K} P(A\_i = \text{OFF}|X\_i = 1)\right) \left(\prod\_{i \notin K} P(A\_i = \text{OFF}|X\_i = 0)\right)$$

$$= \left(\prod\_{i \in K} (1 - p\_i)\right) \left(\prod\_{i \notin K} 1\right) = \begin{cases} \prod\_{i \in K} (1 - p\_i) \text{if } K \neq \mathcal{O} \\\ 1 \text{ if } K = \mathcal{O} \end{cases}$$

(Due to Eq. (9))

In general, Eq. (16) specifies OR-gate inference.

$$P(X\_1 \oplus X\_2 \oplus \dots \oplus X\_n) = 1 - P(Y = 0 | X\_1, X\_2, \dots, X\_n) = \begin{cases} 1 - \prod\_{i \in K} (1 - p\_i) \text{ if } K \neq \mathcal{Q} \\ \\ \qquad \text{ } 0 \text{ if } K = \mathcal{Q} \\ \end{cases} \tag{16}$$

$$P(Y = 0 | X\_1, X\_2, \dots, X\_n) = \begin{cases} \prod\_{i \in K} (1 - p\_i) \text{ if } K \neq \mathcal{Q} \\ \qquad \text{ } 1 \text{ if } K = \mathcal{Q} \end{cases} \tag{17}$$

where K is the set of Xis whose values are 1. The OR-gate inference was mentioned in Refs. ([2], p. 158) and ([3], p. 20). Eq. (16) varies according to two cases whose arrangement counters are listed as follows

K 6¼ ∅

$$c(\Omega : \{X\_i = 1\}) = 2^{n-1}, \\ c(\Omega : \{X\_i = 0\}) = 2^{n-1} - 1, \\ c(\Omega) = 2^n - 1.$$

$$K = \mathfrak{O}$$

$$c(\Omega : \{X\_i = 1\}) = 0, \\ c(\Omega : \{X\_i = 0\}) = 1, \\ c(\Omega) = 1.$$

According to De Morgan's rule with regard to AND-gate and OR-gate, we have

$$\begin{aligned} P\left(\text{not}(X\_1 \odot X\_2 \odot \dots \odot X\_n)\right) &= P\left(\left(\text{not}(X\_1)\right) \oplus \left(\text{not}(X\_2)\right) \oplus \dots \oplus \left(\text{not}(X\_n)\right)\right) \\ &= \begin{cases} 1 - \prod\_{i \in L} \left(1 - (1 - p\_i)\right) \text{if } L \neq \mathcal{Q} \\ & \qquad \qquad \qquad \qquad 0 \text{ if } L = \mathcal{Q} \end{cases} \end{aligned}$$

(Due to Eq. (16))

L ¼ ∅

110 Bayesian Inference

L 6¼ ∅

ified by Eq. (15) ([2], p. 157).

The OR-gate condition implies

From Eq. (10), we have ([2], p. 159)

(Due to Eq. (9))

listed as follows

<sup>P</sup>ð<sup>Y</sup> <sup>¼</sup> <sup>0</sup>jX1, X2, …, XnÞ ¼ <sup>X</sup>

�

In general, Eq. (16) specifies OR-gate inference.

PðX<sup>1</sup> ⊕ X<sup>2</sup> ⊕… ⊕ XnÞ ¼ 1 � PðY ¼ 0jX1, X2, …, XnÞ ¼

PðY ¼ 0jX1, X2, …, XnÞ ¼

cðΩ : fXi ¼ 1gÞ ¼ 1, cðΩ : fXi ¼ 0gÞ ¼ 0, cðΩÞ ¼ 1:

The OR-gate inference represents prerequisite relationship satisfying OR-gate condition spec-

PðY ¼ 0jAi ¼ ON for some iÞ ¼ 0

PðAi ¼ OFFjXiÞ

PðAi ¼ OFFjXi ¼ 1Þ ! <sup>Y</sup>

> Y i∈ K

8 < :

where K is the set of Xis whose values are 1. The OR-gate inference was mentioned in Refs. ([2], p. 158) and ([3], p. 20). Eq. (16) varies according to two cases whose arrangement counters are

ð1 � pi

i∉K 1 !

¼

due to PðY ¼ 1jAi ¼ ON for some iÞ ¼ 0

ð1 � pi Þ ! <sup>Y</sup>

<sup>A</sup>1, <sup>A</sup>2, …, An

<sup>¼</sup> <sup>Y</sup><sup>n</sup> i¼1

<sup>¼</sup> <sup>Y</sup> i ∈K

<sup>¼</sup> <sup>Y</sup> i∈K , cðΩÞ ¼ <sup>2</sup><sup>n</sup> � <sup>1</sup>:

Yn i¼1

�

i∉K

ð1 � pi

ð1 � pi

Y i ∈K

8 < :

<sup>1</sup> � <sup>Y</sup> i ∈K

Þ if K 6¼ ∅

1 if K ¼ ∅

8 < : PðAijXiÞ

PðAi ¼ OFFjXi ¼ 0Þ !

Þif K 6¼ ∅

1 if K ¼ ∅

Þ if K 6¼ ∅

0 if K ¼ ∅

(16)

PðY ¼ 1jAi ¼ ON for some iÞ ¼ 1 (15)

PðY ¼ 1jA1, A2, …, AnÞ

<sup>c</sup>ð<sup>Ω</sup> : <sup>f</sup>Xi <sup>¼</sup> <sup>1</sup>gÞ ¼ <sup>2</sup><sup>n</sup>�<sup>1</sup> � <sup>1</sup>, cð<sup>Ω</sup> : <sup>f</sup>Xi <sup>¼</sup> <sup>0</sup>gÞ ¼ <sup>2</sup><sup>n</sup>�<sup>1</sup>

According to Eq. (14), we also have

$$P\left(\text{not}(X\_1 \oplus X\_2 \oplus \dots \oplus X\_n)\right) = P\left(\left(\text{not}(X\_1)\right) \odot \left(\text{not}(X\_2)\right) \odot \dots \odot \left(\text{not}(X\_n)\right)\right)$$

$$= \begin{cases} \prod\_{i=1}^n P\left(\text{not}(X\_i)\right) \text{ if all not } (X\_i)(s) \text{ are } 1\\ 0 \text{ if there exists at least one not } (X\_i) = 0 \end{cases}$$

$$= \begin{cases} \prod\_{i=1}^n (1 - p\_i) \text{ if all } X\_i(s) \text{ are } 0\\ 0 \text{ if there exists at least one } X\_i = 1 \end{cases}$$

In general, Eq. (17) specifies NAND-gate inference and NOR-gate inference derived from AND-gate and OR-gate

$$P\left(\text{not}(X\_1 \oplus X\_2 \oplus \dots \oplus X\_n)\right) = \begin{cases} 1 - \prod\_{i \in L} p\_i \text{ if } L \neq \mathcal{Q} \\\\ 0 \text{ if } L = \mathcal{Q} \end{cases} \tag{17}$$

$$P\left(\text{not}(X\_1 \oplus X\_2 \oplus \dots \oplus X\_n)\right) = \begin{cases} \prod\_{i=1}^n q\_i \text{ if } K = \mathcal{Q} \\\\ 0 \text{ if } K \neq \mathcal{Q} \end{cases}$$

where K and L are the sets of Xis whose values are 1 and 0, respectively.

Suppose the number of sources Xis is even. Let O be the set of Xis whose indices are odd. Let O<sup>1</sup> and O<sup>2</sup> be subsets of O, in which all Xis are 1 and 0, respectively. Let E be the set of Xis whose indices are even. Let E<sup>1</sup> and E<sup>2</sup> be the subsets of E, in which all Xis are 1 and 0, respectively.

$$\begin{cases} E = \{2, 4, 6, \dots, n\} \\ E\_1 \subseteq E \\ E\_2 \subseteq E \\ E\_1 \cup E\_2 = E \\ E\_1 \cap E\_2 = \mathcal{O} \\ X\_i = 1, \forall i \in E\_1 \\ X\_i = 0, \forall i \in E\_2 \end{cases} \quad \text{and} \begin{cases} O = \{1, 3, 5, \dots, n - 1\} \\ O\_1 \subseteq O \\ O\_2 \subset O \\ O\_1 \cup O\_2 = O \\ O\_1 \cap O\_2 = O \\ X\_i = 1, \forall i \in O\_1 \\ X\_i = 0, \forall i \in O\_2 \end{cases}$$

Thus, O<sup>1</sup> and E<sup>1</sup> are the subsets of K. Sources Xis and target Y follow XOR-gate if one of two XOR-gate conditions specified by Eq. (18) is satisfied.

$$P\left(Y=1\left|\begin{cases}A\_i = \text{ON for } i \in O\\A\_i = \text{OFF for } i \notin O\end{cases}\right)\right) = P(Y=1|A\_1 = \text{ON}, A\_2 = \text{OFF}, \dots, A\_{n-1} = \text{ON}, A\_n = \text{OFF}) = 1\tag{18}$$

$$P\left(Y=1\left|\begin{cases}A\_i = \text{ON for } i \in E\\A\_i = \text{OFF for } i \notin E\end{cases}\right)\right) = P(Y=1|A\_1 = \text{OFF}, A\_2 = \text{ON}, \dots, A\_{n-1} = \text{OFF}, A\_n = \text{ON}) = 1\tag{19}$$

From Eq. (10), we have

$$P(Y=1|X\_1, X\_2, \dots, X\_n) = \sum\_{A\_1, A\_2, \dots, A\_n} P(Y=1|A\_1, A\_2, \dots, A\_n) \prod\_{i=1}^n P(A\_i|X\_i)$$

If both XOR-gate conditions are not satisfied then,

$$P(Y = 1 | X\_1, X\_2, \dots, X\_n) = 0$$

If the first XOR-gate condition is satisfied, we have

$$\begin{aligned} P(Y=1|\mathbf{X}\_1, \mathbf{X}\_2, \dots, \mathbf{X}\_n) \\ &= P(Y=1|A\_1=\text{ON}, A\_2=\text{OFF}, \dots, A\_{n-1}=\text{ON}, A\_n=\text{OFF}) \prod\_{i=1}^n P(A\_i|\mathbf{X}\_i) \\ &= \left(\prod\_{i \in O} P(A\_i=\text{ON}|\mathbf{X}\_i)\right) \left(\prod\_{i \in E} P(A\_i=\text{OFF}|\mathbf{X}\_i)\right) \end{aligned}$$

We have

$$\begin{aligned} \prod\_{i \in O} P(A\_i = \text{ON} | \mathbf{X}\_i) \\ &= \left( \prod\_{i \in O\_1} P(A\_i = \text{ON} | \mathbf{X}\_i = 1) \right) \* \left( \prod\_{i \in O\_2} P(A\_i = \text{ON} | \mathbf{X}\_i = 0) \right) \\ &= \left( \prod\_{i \in O\_1} p\_i \right) \* \left( \prod\_{i \in O\_2} 0 \right) = \left\{ \prod\_{i \in O\_1} p\_i \text{ if } O\_2 = \mathcal{O} \\ &\qquad \text{0 if } O\_2 \neq \mathcal{O} \end{aligned}$$

(Due to Eq. (9))

We also have

E ¼ f2, 4, 6, …, ng

and

8

>>>>>>>>>>>><

>>>>>>>>>>>>:

Thus, O<sup>1</sup> and E<sup>1</sup> are the subsets of K. Sources Xis and target Y follow XOR-gate if one of two

<sup>A</sup>1, <sup>A</sup>2, …, An

¼ PðY ¼ 1jA<sup>1</sup> ¼ ON, A<sup>2</sup> ¼ OFF,…, An�<sup>1</sup> ¼ ON, An ¼ OFFÞ

i ∈E

PðAi ¼ ONjXi ¼ 1Þ !

¼

8 < :

� <sup>Y</sup> i ∈ O<sup>2</sup> 0 !

PðY ¼ 1jX1, X2, …, XnÞ ¼ 0

PðAi ¼ OFFjXiÞ !

> � <sup>Y</sup> i ∈ O<sup>2</sup>

> > pi if O<sup>2</sup> ¼ ∅

0 if O<sup>2</sup> 6¼ ∅

Y i∈ O<sup>1</sup>

PðAi ¼ ONjXi ¼ 0Þ !

O ¼ f1, 3, 5, …, n � 1g

¼ PðY ¼ 1jA<sup>1</sup> ¼ ON, A<sup>2</sup> ¼ OFF,…, An�<sup>1</sup> ¼ ON, An ¼ OFFÞ ¼ 1

¼ PðY ¼ 1jA<sup>1</sup> ¼ OFF, A<sup>2</sup> ¼ ON,…, An�<sup>1</sup> ¼ OFF, An ¼ ONÞ ¼ 1

Yn i¼1

> Yn i¼1

PðAijXiÞ

PðAijXiÞ

PðY ¼ 1jA1, A2, …, AnÞ

(18)

O<sup>1</sup> ⊆ O O<sup>2</sup> ⊆ O O1∪ O<sup>2</sup> ¼ O O<sup>1</sup> ∩ O<sup>2</sup> ¼ ∅ Xi ¼ 1, ∀i∈ O<sup>1</sup> Xi ¼ 0, ∀i∈ O<sup>2</sup>

E<sup>1</sup> ⊆ E E<sup>2</sup> ⊆ E E1∪ E<sup>2</sup> ¼ E E<sup>1</sup> ∩ E<sup>2</sup> ¼ ∅ Xi ¼ 1, ∀i∈ E<sup>1</sup> Xi ¼ 0, ∀i∈ E<sup>2</sup>

8

>>>>>>>>>>>><

>>>>>>>>>>>>:

Ai ¼ ON for i∈O Ai ¼ OFF for i ∉ O

Ai ¼ ON for i∈E Ai ¼ OFF for i ∉ E

! ( )

! ( )

P Y ¼ 1

112 Bayesian Inference

P Y ¼ 1

We have

� � � � �

� � � � �

From Eq. (10), we have

XOR-gate conditions specified by Eq. (18) is satisfied.

<sup>P</sup>ð<sup>Y</sup> <sup>¼</sup> <sup>1</sup>jX1, X2, …, XnÞ ¼ <sup>X</sup>

If both XOR-gate conditions are not satisfied then,

If the first XOR-gate condition is satisfied, we have

PðAi ¼ ONjXiÞ ! <sup>Y</sup>

PðY ¼ 1jX1, X2, …, XnÞ

PðAi ¼ ONjXiÞ

<sup>¼</sup> <sup>Y</sup> i∈ O<sup>1</sup>

<sup>¼</sup> <sup>Y</sup> i∈ O<sup>1</sup> pi !

<sup>¼</sup> <sup>Y</sup> i ∈ O

Y i∈ O

$$\begin{aligned} &\prod\_{i \in E} P(A\_i = \text{OFF} | \mathbf{X}\_i) \\ &= \left(\prod\_{i \in E\_1} P(A\_i = \text{OFF} | \mathbf{X}\_i = 1)\right) \* \left(\prod\_{i \in E\_2} P(A\_i = \text{OFF} | \mathbf{X}\_i = 0)\right) \\ &= \left(\prod\_{i \in E\_1} (1 - p\_i)\right) \left(\prod\_{i \in E\_2} 1\right) = \begin{cases} \prod\_{i \in E\_1} (1 - p\_i) \text{ if } E\_1 \neq \mathcal{O} \\ & \qquad \qquad 1 \text{ if } E\_1 = \mathcal{O} \end{cases} \end{aligned}$$

(Due to Eq. (9))

Given the first XOR-gate condition, it implies

$$P(Y = 1 | X\_1, X\_2, \dots, X\_n) = \left(\prod\_{i \in O} P(A\_i = ON | X\_i)\right) \left(\prod\_{i \in E} P(A\_i = OFF | X\_i)\right)$$

$$= \left\{ \left(\prod\_{i \in O\_1} p\_i\right) \left(\prod\_{i \in E\_1} (1 - p\_i)\right) \text{if } O\_2 = \mathcal{Q} \text{ and } E\_1 \neq \mathcal{Q} \right\}$$

$$\prod\_{i \in O\_1} p\_i \text{if } O\_2 = \mathcal{Q} \text{ and } E\_1 = \mathcal{Q}$$

$$0 \text{ if } O\_2 \neq \mathcal{Q}$$

Similarly, given the second XOR-gate condition, we have

$$P(Y = 1 | \mathbf{X}\_1, \mathbf{X}\_2, \dots, \mathbf{X}\_n) = \left(\prod\_{i \in E} P(A\_i = \text{ON} | \mathbf{X}\_i)\right) \left(\prod\_{i \in O} P(A\_i = \text{OFF} | \mathbf{X}\_i)\right)$$

$$= \left\{ \left(\prod\_{i \in E\_1} p\_i\right) \left(\prod\_{i \in O\_1} (1 - p\_i)\right) \text{if } E\_2 = \mathfrak{Q} \text{ and } O\_1 \neq \mathfrak{Q} \right\}$$

$$\prod\_{i \in E\_1} p\_i \text{ if } E\_2 = \mathfrak{Q} \text{ and } O\_1 = \mathfrak{Q}$$

$$0 \text{ if } E\_2 \neq \mathfrak{Q}$$

If one of XOR-gate conditions is satisfied then,

$$\begin{aligned} P(Y=1|\mathbf{X}\_1, \mathbf{X}\_2, \dots, \mathbf{X}\_n) \\ = \left(\prod\_{i \in O} P(A\_i = \text{ON}|\mathbf{X}\_i)\right) \left(\prod\_{i \in E} P(A\_i = \text{OFF}|\mathbf{X}\_i)\right) + \left(\prod\_{i \in E} P(A\_i = \text{ON}|\mathbf{X}\_i)\right) \left(\prod\_{i \in O} P(A\_i = \text{OFF}|\mathbf{X}\_i)\right) \end{aligned}$$

This implies Eq. (19) to specify XOR-gate inference.

PðX<sup>1</sup> ⊗ X<sup>2</sup> ⊗…⊗ XnÞ ¼ PðY ¼ 1jX1, X2, …, XnÞ ¼ Y i ∈ O<sup>1</sup> pi ! <sup>Y</sup> i ∈E<sup>1</sup> ð1 � pi Þ ! <sup>þ</sup> <sup>Y</sup> i∈E<sup>1</sup> pi ! <sup>Y</sup> i∈ O<sup>1</sup> ð1 � pi Þ ! if O<sup>2</sup> ¼ ∅ and E<sup>2</sup> ¼ ∅ Y i ∈ O<sup>1</sup> pi ! <sup>Y</sup> i ∈E<sup>1</sup> ð1 � pi Þ ! if O<sup>2</sup> ¼ ∅ and E<sup>1</sup> 6¼ ∅ and E<sup>2</sup> 6¼ ∅ Y i ∈ O<sup>1</sup> pi if O<sup>2</sup> ¼ ∅ and E<sup>1</sup> ¼ ∅ Y i∈ E<sup>1</sup> pi ! <sup>Y</sup> i ∈ O<sup>1</sup> ð1 � pi Þ ! if E<sup>2</sup> ¼ ∅ and O<sup>1</sup> 6¼ ∅ and O<sup>2</sup> 6¼ ∅ Y i∈ E<sup>1</sup> pi if E<sup>2</sup> ¼ ∅ and O<sup>1</sup> ¼ ∅ 0 if O<sup>2</sup> 6¼ ∅ and E<sup>2</sup> 6¼ ∅ 0 if n < 2 or n is odd 8 >>>>>>>>>>>>>>>>>>>>>>>>< >>>>>>>>>>>>>>>>>>>>>>>>: where O ¼ f1, 3, 5, …, n � 1g O<sup>1</sup> ⊆ O O<sup>2</sup> ⊆ O O1∪ O<sup>2</sup> ¼ O O<sup>1</sup> ∩ O<sup>2</sup> ¼ ∅ and E ¼ f2, 4, 6, …, ng E<sup>1</sup> ⊆ E E<sup>2</sup> ⊆ E E1∪ E<sup>2</sup> ¼ E E<sup>1</sup> ∩ E<sup>2</sup> ¼ ∅ 8 >>>>>>>>>>< >>>>>>>>>>: 8 >>>>>>>>>>< >>>>>>>>>>: (19)

Where,

Given n ≥ 2 and n is even, Eq. (19) varies according to six cases whose arrangement counters are listed as follows

Xi ¼ 1, ∀i∈ E<sup>1</sup> Xi ¼ 0, ∀i∈ E<sup>2</sup>

Xi ¼ 1, ∀i∈ O<sup>1</sup> Xi ¼ 0, ∀i∈ O<sup>2</sup>

$$O\_2 = \mathcal{O} \text{ and } E\_2 = \mathcal{O}$$

$$c(\Omega : \{X\_i = 1\}) = 1, c(\Omega : \{X\_i = 0\}) = 0, c(\Omega) = 1.$$

$$O\_2 = \mathcal{O} \text{ and } E\_1 \neq \mathcal{O} \text{ and } E\_2 \neq \mathcal{O}$$

$$c(\Omega : \{X\_i = 1\}) = 2^\ddagger - 2, c(\Omega : \{X\_i = 0\}) = 0, c(\Omega) = 2^\ddagger - 2.$$

$$O\_2 = \mathcal{O} \text{ and } E\_1 = \mathcal{O}$$

$$c(\Omega : \{X\_i = 1\}) = 1, c(\Omega : \{X\_i = 0\}) = 0, c(\Omega) = 1.$$

$$E\_2 = \mathcal{O} \text{ and } O\_1 \neq \mathcal{O} \text{ and } O\_2 \neq \mathcal{O}$$

$$c(\Omega : \{X\_i = 1\}) = 2^{\ddagger - 1} - 1, c(\Omega : \{X\_i = 0\}) = 2^{\ddagger - 1} - 1, c(\Omega) = 2^\ddagger - 2.$$

$$E\_2 = \mathcal{O} \text{ and } O\_1 = \mathcal{O}$$

$$c(\Omega : \{X\_i = 1\}) = 0, c(\Omega : \{X\_i = 0\}) = 1, c(\Omega) = 1.$$

$$\mathcal{O}\_2 \neq \mathcal{O} \text{ and } E\_2 \neq \mathcal{O}$$

$$c(\Omega : \{X\_i = 1\}) = \left(2^{\frac{\pi}{2} - 1} - 1\right) \left(2^{\frac{\pi}{2}} - 1\right), c(\Omega : \{X\_i = 0\}) = 2^{\frac{\pi}{2} - 1} \left(2^{\frac{\pi}{2}} - 1\right), c(\Omega) = \left(2^{\frac{\pi}{2}} - 1\right)^2.$$

Suppose the number of sources Xis is even. According to XNOR-gate inference [1], the output is on if all inputs get the same value 1 (or 0). Sources Xi (s) and target Y follow XNOR-gate if one of two XNOR-gate conditions specified by Eq. (20) is satisfied.

$$\begin{array}{l}P(Y=1|A\_i=\text{ON}, \forall i) = 1\\P(Y=1|A\_i=\text{OFF}, \forall i) = 1\end{array}\tag{20}$$

From Eq. (10), we have

PðX<sup>1</sup> ⊗ X<sup>2</sup> ⊗…⊗ XnÞ ¼ PðY ¼ 1jX1, X2, …, XnÞ

ð1 � pi Þ

ð1 � pi Þ

and

Given n ≥ 2 and n is even, Eq. (19) varies according to six cases whose arrangement counters

O<sup>2</sup> ¼ ∅ and E<sup>2</sup> ¼ ∅

cðΩ : fXi ¼ 1gÞ ¼ 1, cðΩ : fXi ¼ 0gÞ ¼ 0, cðΩÞ ¼ 1:

O<sup>2</sup> ¼ ∅ and E<sup>1</sup> 6¼ ∅ and E<sup>2</sup> 6¼ ∅

O<sup>2</sup> ¼ ∅ and E<sup>1</sup> ¼ ∅

cðΩ : fXi ¼ 1gÞ ¼ 1, cðΩ : fXi ¼ 0gÞ ¼ 0, cðΩÞ ¼ 1:

E<sup>2</sup> ¼ ∅ and O<sup>1</sup> 6¼ ∅ and O<sup>2</sup> 6¼ ∅

E<sup>2</sup> ¼ ∅ and O<sup>1</sup> ¼ ∅

cðΩ : fXi ¼ 1gÞ ¼ 0, cðΩ : fXi ¼ 0gÞ ¼ 1, cðΩÞ ¼ 1:

<sup>2</sup>�<sup>1</sup> � <sup>1</sup>, cð<sup>Ω</sup> : <sup>f</sup>Xi <sup>¼</sup> <sup>0</sup>gÞ ¼ <sup>2</sup>

<sup>2</sup> � 2, cðΩ : fXi ¼ 0gÞ ¼ 0, cðΩÞ ¼ 2

n

<sup>2</sup>�<sup>1</sup> � <sup>1</sup>, cðΩÞ ¼ <sup>2</sup>

!

!

i∈ O<sup>1</sup>

ð1 � pi Þ

> Y i ∈ O<sup>1</sup>

Y i∈ E<sup>1</sup>

E ¼ f2, 4, 6, …, ng

E<sup>1</sup> ⊆ E E<sup>2</sup> ⊆ E E1∪ E<sup>2</sup> ¼ E E<sup>1</sup> ∩ E<sup>2</sup> ¼ ∅ Xi ¼ 1, ∀i∈ E<sup>1</sup> Xi ¼ 0, ∀i∈ E<sup>2</sup>

8

>>>>>>>>>><

>>>>>>>>>>:

if O<sup>2</sup> ¼ ∅ and E<sup>2</sup> ¼ ∅

pi if O<sup>2</sup> ¼ ∅ and E<sup>1</sup> ¼ ∅

pi if E<sup>2</sup> ¼ ∅ and O<sup>1</sup> ¼ ∅

(19)

0 if O<sup>2</sup> 6¼ ∅ and E<sup>2</sup> 6¼ ∅ 0 if n < 2 or n is odd

> n <sup>2</sup> � 2:

> > n <sup>2</sup> � 2:

if O<sup>2</sup> ¼ ∅ and E<sup>1</sup> 6¼ ∅ and E<sup>2</sup> 6¼ ∅

if E<sup>2</sup> ¼ ∅ and O<sup>1</sup> 6¼ ∅ and O<sup>2</sup> 6¼ ∅

!

<sup>þ</sup> <sup>Y</sup> i∈E<sup>1</sup> pi ! <sup>Y</sup>

i ∈E<sup>1</sup>

i ∈ O<sup>1</sup>

O ¼ f1, 3, 5, …, n � 1g

¼

Y i ∈ O<sup>1</sup> pi ! <sup>Y</sup>

8

114 Bayesian Inference

>>>>>>>>>>>>>>>>>>>>>>>><

>>>>>>>>>>>>>>>>>>>>>>>>:

where

Where,

are listed as follows

i ∈E<sup>1</sup>

ð1 � pi Þ

Y i ∈ O<sup>1</sup> pi ! <sup>Y</sup>

Y i∈ E<sup>1</sup> pi ! <sup>Y</sup>

O<sup>1</sup> ⊆ O O<sup>2</sup> ⊆ O O1∪ O<sup>2</sup> ¼ O O<sup>1</sup> ∩ O<sup>2</sup> ¼ ∅ Xi ¼ 1, ∀i∈ O<sup>1</sup> Xi ¼ 0, ∀i∈ O<sup>2</sup>

cðΩ : fXi ¼ 1gÞ ¼ 2

cðΩ : fXi ¼ 1gÞ ¼ 2

n

n

8

>>>>>>>>>><

>>>>>>>>>>:

!

$$P(Y=1|X\_1, X\_2, \dots, X\_n) = \sum\_{A\_{1\prime}A\_{2\prime}\dots A\_n} P(Y=1|A\_1, A\_2, \dots, A\_n) \prod\_{i=1}^n P(A\_i|X\_i)$$

If both XNOR-gate conditions are not satisfied then,

$$P(Y = 1 | X\_1, X\_2, \dots, X\_n) = 0$$

If Ai = ON for all i, we have

$$\begin{aligned} P(Y=1|X\_1, X\_2, \dots, X\_n) &= P(Y=1|A\_i = \text{ON}, \forall i) \prod\_{i=1}^n P(A\_i = \text{ON}|X\_i) \\ &= \prod\_{i=1}^n P(A\_i = \text{ON}|X\_i) = \begin{cases} \prod\_{i=1}^n p\_i \text{if } L = \textsf{\textsf{\textsf{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox}{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox}{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox}{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox}{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox}{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox}{\reflectbox{\reflectbox{\reflectbox{\reflectbox}{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox}{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox}{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox}{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox{\reflectbox}{\reflectbox{\reflectbox{\reflectbox{\reflectbox$$

(Please see similar proof in AND-gate inference)

If Ai = OFF for all i, we have

$$P(Y = 1 | X\_1, X\_2, \dots, X\_n) = \prod\_{i=1}^n P(A\_i = \text{OFF} | X\_i) = \begin{cases} \prod\_{i \in K} (1 - p\_i) \text{ if } K \neq \emptyset \\\ 1 & \text{if } K = \emptyset \end{cases}$$

(Please see similar proof in OR-gate inference)

If one of XNOR-gate conditions is satisfied then,

$$P(Y=1|X\_1, X\_2, \dots, X\_n) = \prod\_{i=1}^n P(A\_i = \text{ON}|X\_i) + \prod\_{i=1}^n P(A\_i = \text{OFF}|X\_i)$$

This implies Eq. (21) to specify XNOR-gate inference.

$$P\left(\text{not}(X\_1 \otimes X\_2 \otimes \dots \otimes X\_n)\right) = P(Y = 1 | X\_1, X\_2, \dots, X\_n) = \begin{cases} \prod\_{i=1}^n p\_i + \prod\_{i=1}^n (1 - p\_i) \text{ if } L = \mathcal{Q} \\\\ \prod\_{i \in K} (1 - p\_i) \text{ if } L \neq \mathcal{Q} \text{ and } K \neq \mathcal{Q} \\\\ 1 \text{ if } L \neq \mathcal{Q} \text{ and } K = \mathcal{Q} \end{cases} \tag{21}$$

where K and L are the sets of Xis whose values are 1 and 0, respectively. Eq. (21) varies according to three cases whose arrangement counters are listed as follows

$$L = \mathfrak{Q}$$

$$c(\Omega : \{X\_i = 1\}) = 1, c(\Omega : \{X\_i = 0\}) = 0, c(\Omega) = 1.$$

$$L \neq \mathfrak{Q} \text{ and } K \neq \mathfrak{Q}$$

$$c(\Omega : \{X\_i = 1\}) = 2^{n-1} - 1, c(\Omega : \{X\_i = 0\}) = 2^{n-1} - 1, c(\Omega) = 2^n - 2.$$

$$L \neq \mathfrak{Q} \text{ and } K = \mathfrak{Q}$$

$$c(\Omega : \{X\_i = 1\}) = 0, c(\Omega : \{X\_i = 0\}) = 1, c(\Omega) = 1.$$

Let U be a set of indices such that Ai = ON and let α ≥ 0 and β ≥ 0 be predefined numbers. The U-gate inference is defined based on α, β and cardinality of U. Table 4 specifies four common U-gate conditions.

Note that U-gate condition on |U| can be arbitrary and it is only relevant to Ais (ON or OFF) and the way to combine Ais. For example, AND-gate and OR-gate are specific cases of U-gate with |U| = n and |U| ≥ 1, respectively. XOR-gate and XNOR-gate are also specific cases of U-gate with specific conditions on Ai (s). However, it must be assured that there is at least one combination of Ais satisfying the predefined U-gate condition, which causes that U-gate probability is not always equal to 0. In this research, U-gate is the most general nonlinear gate where U-gate probability contains products of weights (see Table 5). Later on, we will research a so-called SIGMA-gate that contains only linear combination of weights (sum of weights, see Eq. (23)). Shortly, each X-gate is a pattern owning a particular X-gate inference that is X-gate probability P(X<sup>1</sup> � X<sup>2</sup> �…� Xn). Each X-gate inference is based on particular X-gate condition (s) relevant to only variables Ais.

From Eq. (10), we have

$$P(Y=1|X\_1, X\_2, \dots, X\_n) = \sum\_{A\_1, A\_2, \dots, A\_n} P(Y=1|A\_1, A\_2, \dots, A\_n) \prod\_{i=1}^n P(A\_i|X\_i)$$

Let U be the set of all possible U (s), we have


Table 4. U-gate conditions.

P �

116 Bayesian Inference

U-gate conditions.

(s) relevant to only variables Ais.

<sup>P</sup>ð<sup>Y</sup> <sup>¼</sup> <sup>1</sup>jX1, X2, …, XnÞ ¼ <sup>X</sup>

Let U be the set of all possible U (s), we have

From Eq. (10), we have

notðX<sup>1</sup> ⊗ X<sup>2</sup> ⊗… ⊗ XnÞ

�

¼ PðY ¼ 1jX1, X2, …, XnÞ ¼

where K and L are the sets of Xis whose values are 1 and 0, respectively. Eq. (21) varies

L ¼ ∅

cðΩ : fXi ¼ 1gÞ ¼ 1, cðΩ : fXi ¼ 0gÞ ¼ 0, cðΩÞ ¼ 1:

L 6¼ ∅ and K 6¼ ∅

<sup>c</sup>ð<sup>Ω</sup> : <sup>f</sup>Xi <sup>¼</sup> <sup>1</sup>gÞ ¼ <sup>2</sup><sup>n</sup>�<sup>1</sup> � <sup>1</sup>, cð<sup>Ω</sup> : <sup>f</sup>Xi <sup>¼</sup> <sup>0</sup>gÞ ¼ <sup>2</sup><sup>n</sup>�<sup>1</sup> � <sup>1</sup>, cðΩÞ ¼ <sup>2</sup><sup>n</sup> � <sup>2</sup>:

L 6¼ ∅ and K ¼ ∅

cðΩ : fXi ¼ 1gÞ ¼ 0, cðΩ : fXi ¼ 0gÞ ¼ 1, cðΩÞ ¼ 1:

Let U be a set of indices such that Ai = ON and let α ≥ 0 and β ≥ 0 be predefined numbers. The U-gate inference is defined based on α, β and cardinality of U. Table 4 specifies four common

Note that U-gate condition on |U| can be arbitrary and it is only relevant to Ais (ON or OFF) and the way to combine Ais. For example, AND-gate and OR-gate are specific cases of U-gate with |U| = n and |U| ≥ 1, respectively. XOR-gate and XNOR-gate are also specific cases of U-gate with specific conditions on Ai (s). However, it must be assured that there is at least one combination of Ais satisfying the predefined U-gate condition, which causes that U-gate probability is not always equal to 0. In this research, U-gate is the most general nonlinear gate where U-gate probability contains products of weights (see Table 5). Later on, we will research a so-called SIGMA-gate that contains only linear combination of weights (sum of weights, see Eq. (23)). Shortly, each X-gate is a pattern owning a particular X-gate inference that is X-gate probability P(X<sup>1</sup> � X<sup>2</sup> �…� Xn). Each X-gate inference is based on particular X-gate condition

<sup>A</sup>1, <sup>A</sup>2, …, An

PðY ¼ 1jA1, A2, …, AnÞ

Yn i¼1

PðAijXiÞ

according to three cases whose arrangement counters are listed as follows

Yn i¼1

ð1 � pi

Y i ∈K

8 >>>>>><

>>>>>>:

pi <sup>þ</sup> <sup>Y</sup><sup>n</sup> i¼1

ð1 � pi

Þ if L 6¼ ∅ and K 6¼ ∅

1 if L 6¼ ∅ and K ¼ ∅

Þ if L ¼ ∅

(21)

$$\begin{aligned} P(Y=1|X\_1, X\_2, \dots, X\_n) &= \sum\_{\mathcal{U} \in \mathcal{U}} P(Y=1|A\_1, A\_2, \dots, A\_n) \prod\_{i=1}^n P(A\_i|X\_i) \\ &= \sum\_{\mathcal{U} \in \mathcal{U}} \prod\_{i \in \mathcal{U}} P(A\_i = \text{ON}|X\_i) \prod\_{j \notin \mathcal{U}} P(A\_j = \text{OFF}|X\_j) \end{aligned}$$

If Xi ¼ 0, ∀i ∈ U then,

$$P(Y = 1 | \mathbf{X}\_1, \mathbf{X}\_2, \dots, \mathbf{X}\_n) = \sum\_{\mathcal{U} \in \mathcal{U}} \prod\_{i \in \mathcal{U}} 0 \prod\_{j \notin \mathcal{U}} P(A\_j = \mathcal{O}FF|\mathbf{X}\_j) = 0$$

This implies all sets U (s) must be subsets of K. The U-gate probability is rewritten as follows

$$\begin{split}P(Y=1|X\_{\mathcal{l}},X\_{2},\ldots,X\_{n})&=\sum\_{\mathcal{l}\in\mathcal{U}}\prod\_{i\in\mathcal{l}}P(A\_{i}=\operatorname{ON}|X\_{i}=1)\prod\_{j\notin\mathcal{U}}P(A\_{j}=\operatorname{OFF}|X\_{j})\\&=\sum\_{\mathcal{l}\in\mathcal{U}}\prod\_{i\in\mathcal{l}}p\_{i}\prod\_{j\notin\mathcal{U}}P(A\_{j}=\operatorname{OFF}|X\_{j})\\&=\sum\_{\mathcal{l}\in\mathcal{U}}\prod\_{i\in\mathcal{l}}p\_{i}\prod\_{j\in\mathcal{K}\cup\mathcal{U}}P(A\_{j}=\operatorname{OFF}|X\_{j}=1)\prod\_{j\notin\mathcal{K}}P(A\_{j}=\operatorname{OFF}|X\_{j}=0)\\&=\sum\_{\mathcal{l}\in\mathcal{U}}\prod\_{i\in\mathcal{l}}p\_{i}\prod\_{j\in\mathcal{K}\cup\mathcal{U}}(1-p\_{j})\prod\_{j\notin\mathcal{K}}1=\sum\_{\mathcal{l}\in\mathcal{U}}\prod\_{i\in\mathcal{l}}p\_{i}\prod\_{j\in\mathcal{K}\cup\mathcal{U}}(1-p\_{j})\end{split}$$

(Due to Eq. (9))

Let PU be the U-gate probability; Table 5 specifies U-gate inference and cardinality of U where U is the set of subsets (U) of K.

Note that the notation <sup>n</sup> j � � denotes the number of combinations of <sup>j</sup> elements taken from <sup>n</sup> elements.

$$
\binom{n}{j} = \frac{n!}{j!(n-j)!}
$$

Arrangement counters relevant to U-gate inference and the set K are listed as follows


Þ

Table 5. U-gate inference.

Converting Graphic Relationships into Conditional Probabilities in Bayesian Network http://dx.doi.org/10.5772/intechopen.70057 119

$$|K| = 0$$

$$c(\Omega : \{X\_i = 1\}) = 0, c(\Omega : \{X\_i = 0\}) = 1, c(\Omega) = 1.$$

$$|K| = 1$$

$$c(\Omega : \{X\_i = 1\}) = 1, c(\Omega : \{X\_i = 0\}) = 0, c(\Omega) = 1.$$

$$|K| = a \text{ and } a > 0$$

$$c(\Omega : \{X\_i = 1\}) = \binom{n-1}{a-1}, c(\Omega : \{X\_i = 0\}) = \binom{n-1}{a}, c(\Omega) = \binom{n}{a}.$$

$$|K| \le a \text{ and } a > 0$$

$$c(\Omega : \{X\_i = 1\}) = \sum\_{j=1}^{a} \binom{n-1}{j-1}, c(\Omega : \{X\_i = 0\}) = \sum\_{j=0}^{a} \binom{n-1}{j}.$$

$$c(\Omega : \{X\_i = 1\}) = \sum\_{j=1}^n \binom{n-1}{j-1}, c(\Omega : \{X\_i = 0\}) = \sum\_{j=0}^n \binom{n-1}{j}, c(\Omega) = \sum\_{j=0}^n \binom{n}{j}.$$

jKj ≥ α and α > 0

$$c(\Omega : \{\mathbf{X}\_i = 1\}) = \sum\_{j=a}^n \binom{n-1}{j-1}, c(\Omega : \{\mathbf{X}\_i = 0\}) = \sum\_{j=a}^{n-1} \binom{n-1}{j}, c(\Omega) = \sum\_{j=a}^n \binom{n}{j}.$$

The SIGMA-gate inference [9] represents aggregation relationship satisfying SIGMA-gate condition specified by Eq. (22).

$$P(Y) = P\left(\sum\_{i=1}^{n} A\_i\right)$$

where the set of Ai is complete and mutually exclusive

$$\begin{aligned} \sum\_{i=1}^{n} w\_i &= 1\\ A\_i \cap A\_j &= \bigotimes, \forall i \neq j \end{aligned} \tag{22}$$

The sigma sum <sup>X</sup><sup>n</sup> <sup>i</sup>¼<sup>1</sup> Ai indicates that <sup>Y</sup> is exclusive union of Ais and here, it does not express arithmetical additions.

$$Y = \sum\_{i=1}^{n} A\_i = \bigcup\_{i=1}^{n} A\_i$$

This implies

Let, SU <sup>¼</sup> <sup>X</sup>

As a convention, Y

<sup>|</sup>U|=0 PU <sup>¼</sup>

<sup>|</sup>U|≥<sup>0</sup> PU <sup>¼</sup>


118 Bayesian Inference




α≤|U|≤β 0<α<n 0<β<n

Table 5. U-gate inference.

U ∈U Y i ∈ U pi Y j ∈ K\U

pi ¼ 1 ifjUj ¼ 0

ð1 � pj

Yn j¼1 ð1 � pj

SU if jKj > 0 1 if jKj ¼ 0

0 if jKj < n

if jKj ≥ α

if jKj ≥ α

0 if jKj < α

jKj j !

jKj j ! if jKj > 0

1 if jKj ¼ 0

if jKj ≥ α

0 if jKj < α

0 if jKj < α

1 if jKj ¼ n 0 if jKj < n

SU if jKj ≥ α 0 if jKj < α

SU if jKj ≥ α 0 if jKj < α

jKj j !

SU if jKj > 0 1 if jKj ¼ 0

SU if jKj ≥ α 0 if jKj < α

min<sup>X</sup> <sup>ð</sup><sup>β</sup>, <sup>j</sup>KjÞ j¼α

min<sup>X</sup> <sup>ð</sup><sup>β</sup>, <sup>j</sup>KjÞ j¼0

jKj α !

8 >>< >>: jUj ¼ 1

(

8 >>< >>:

(

(

8 >>< >>:

(

8 >>><

>>>:

(

8 >>><

>>>:

(

8 >>><

>>>:

X jKj j¼α

Yn i¼1 pi if jKj ¼ n

PU ¼

jUj ¼

PU ¼

jUj ¼

PU ¼

jUj ¼

PU ¼

jUj ¼

PU ¼

jUj ¼

i∈ U

Y j∈ K\U ð1 � pj Þ

PU ¼ PðX1⊎X2⊎…⊎XnÞ ¼ PðY ¼ 1jX1, X2, …, XnÞ

Þ ¼ 1 ifjUj¼jKj

Þ if jKj > 0

1 if jKj ¼ 0

<sup>j</sup>Uj ¼ <sup>2</sup>jK<sup>j</sup> The case |U|≥0 is the same to the case |U|≤n

$$P(Y) = P\left(\sum\_{i=1}^{n} A\_i\right) = P\left(\bigcup\_{i=1}^{n} A\_i\right) = \sum\_{i=1}^{n} P(A\_i)$$

The sigma sum <sup>X</sup><sup>n</sup> <sup>i</sup>¼<sup>1</sup> <sup>P</sup>ðAi<sup>Þ</sup> now expresses arithmetical additions of probabilities <sup>P</sup>(Ai). SIGMA-gate inference requires the set of Ais is complete and mutually exclusive, which means that the set of Xis is complete and mutually exclusive too. The SIGMA-gate probability is [9]

$$\begin{aligned} P(Y|X\_1, X\_2, \dots, X\_n) &= P\left(\sum\_{i=1}^n A\_i \, \middle|\, X\_1, X\_2, \dots, X\_n\right) \\ (\text{due to SIGMA} - \text{gate condition}) \\ &= \sum\_{i=1}^n P(A\_i | X\_1, X\_2, \dots, X\_n) \\ \left(\text{because } A\_i(\text{s}) \text{ are mutually exclusive}\right) \\ &= \sum\_{i=1}^n P(A\_i | X\_i) \\ (\text{because } A\_i \text{ is only dependent on } X\_i) \end{aligned}$$

ðbecause Ai is only dependent on XiÞ

It implies

$$\begin{aligned} P(Y = 1 | \mathbf{X}\_1, \mathbf{X}\_2, \dots, \mathbf{X}\_n) \\ = \sum\_{i=1}^n P(A\_i = \text{ON} | \mathbf{X}\_i) \\ = \left( \sum\_{i \in K} P(A\_i = \text{ON} | \mathbf{X}\_i = 1) \right) + \left( \sum\_{i \notin K} P(A\_i = \text{ON} | \mathbf{X}\_i = 0) \right) \\ = \sum\_{i \in K} w\_i + \sum\_{i \notin K} 0 = \sum\_{i \in K} w\_i \end{aligned}$$

(Due to Eq. (9))

In general, Eq. (23) specifies the theorem of SIGMA-gate inference [9]. The base of this theorem was mentioned by Millán and Pérez-de-la-Cruz ([4], pp. 292-295).

$$\begin{aligned} P(X\_1 + X\_2 + \dots + X\_n) &= P\left(\sum\_{i=1}^n X\_i\right) = P(Y = 1 | X\_1, X\_2, \dots, X\_n) = \sum\_{i \in K} w\_i\\ P(Y = 0 | X\_1, X\_2, \dots, X\_n) &= 1 - \sum\_{i \in K} w\_i = \sum\_{i \in L} w\_i \end{aligned}$$

where the set of Xis is complete and mutually exclusive.

$$\begin{aligned} \sum\_{i=1}^{n} w\_i &= 1\\ X\_i \cap X\_j &= \mathcal{Q}\_\prime \forall i \neq j \end{aligned} \tag{23}$$

The arrangement counters of SIGMA-gate inference are c(Ω:{Xi = 1}) = c(Ω:{Xi = 0}) = 2n�<sup>1</sup> , c(Ω)=2<sup>n</sup> .

Eq. (9) specifies the "clockwise" strength of relationship between Xi and Y. Event Xi = 1 causes event Ai = ON with "clockwise" weight wi . There is a question "given Xi = 0, how likely the event Ai = OFF is". In order to solve this problem, I define a so-called "counterclockwise" strength of relationship between Xi and Y denoted ωi. Event Xi = 0 causes event Ai = OFF with

"counterclockwise" weight ω<sup>i</sup> . In other words, each arc in simple graph is associated with a clockwise weight wi and a counterclockwise weight ω<sup>i</sup> . Such graph is called bi-weight simple graph shown in Figure 4.

With bi-weight simple graph, all X-gate inferences are extended as so-called X-gate bi-inferences. Derived from Eq. (9), Eq. (24) specifies conditional probability of accountable variables with regard to bi-weight graph.

$$\begin{aligned} P(A\_i = \text{ON} | X\_i = 1) &= p\_i = w\_i \\ P(A\_i = \text{ON} | X\_i = 0) &= 1 - \rho\_i = 1 - \omega\_i \\ P(A\_i = \text{OFF} | X\_i = 1) &= 1 - p\_i = 1 - w\_i \\ P(A\_i = \text{OFF} | X\_i = 0) &= \rho\_i = \omega\_i \end{aligned} \tag{24}$$

The probabilities P(Ai = ON | Xi = 0) and P(Ai = OFF | Xi = 1) are called clockwise adder di and counterclockwise adder δi. As usual, di and δ<sup>i</sup> are smaller than wi and ωi. When di = 0, bi-weight graph becomes normal simple graph.

$$d\_i = P(A\_i = \text{ON} | \mathbf{X}\_i = \mathbf{0}) = 1 - \rho\_i = 1 - \omega\_i$$

$$\delta\_i = P(A\_i = \text{OFF} | \mathbf{X}\_i = 1) = 1 - p\_i = 1 - w\_i$$

The total clockwise weight or total counterclockwise weight is defined as sum of clockwise weight and clockwise adder or sum of counterclockwise weight and counterclockwise adder. Eq. (25) specifies such total weights Wi and Wi. These weights are also called relationship powers.

$$\begin{aligned} \mathcal{W}\_i &= w\_i + d\_i \\ \mathcal{W}\_i &= w\_i + \delta\_i \end{aligned}$$

where

SIGMA-gate inference requires the set of Ais is complete and mutually exclusive, which means that the set of Xis is complete and mutually exclusive too. The SIGMA-gate probability is [9]

ðdue to SIGMA � gate conditionÞ

because AiðsÞ are mutually exclusive

ðbecause Ai is only dependent on XiÞ

PðY ¼ 1jX1, X2, …, XnÞ

In general, Eq. (23) specifies the theorem of SIGMA-gate inference [9]. The base of this theorem

wi <sup>¼</sup> <sup>X</sup> i∈L wi

<sup>i</sup>¼<sup>1</sup> wi <sup>¼</sup> <sup>1</sup> Xi ∩ Xj ¼ ∅, ∀i 6¼ j

The arrangement counters of SIGMA-gate inference are c(Ω:{Xi = 1}) = c(Ω:{Xi = 0}) = 2n�<sup>1</sup>

Eq. (9) specifies the "clockwise" strength of relationship between Xi and Y. Event Xi = 1 causes

Ai = OFF is". In order to solve this problem, I define a so-called "counterclockwise" strength of relationship between Xi and Y denoted ωi. Event Xi = 0 causes event Ai = OFF with

i¼1 Xi !

i∈K

X<sup>n</sup>

PðAi ¼ ONjXiÞ

<sup>þ</sup> <sup>X</sup> i∉K

<sup>¼</sup> <sup>X</sup><sup>n</sup> i¼1

<sup>¼</sup> <sup>X</sup><sup>n</sup> i¼1

PðAi ¼ ONjXi ¼ 1Þ !

> <sup>0</sup> <sup>¼</sup> <sup>X</sup> i ∈K wi

was mentioned by Millán and Pérez-de-la-Cruz ([4], pp. 292-295).

i¼1 Ai � � � � �

PðAijX1, X2, …, XnÞ

PðAijXiÞ

X1, X2, …, Xn !

�

PðAi ¼ ONjXi ¼ 0Þ !

<sup>¼</sup> <sup>P</sup>ð<sup>Y</sup> <sup>¼</sup> <sup>1</sup>jX1, X2, …, XnÞ ¼ <sup>X</sup>

. There is a question "given Xi = 0, how likely the event

i∈ K wi

(23)

,

<sup>P</sup>ðYjX1, X2, …, XnÞ ¼ <sup>P</sup> <sup>X</sup><sup>n</sup>

<sup>¼</sup> <sup>X</sup><sup>n</sup> i¼1

�

<sup>¼</sup> <sup>X</sup> i ∈K

<sup>¼</sup> <sup>X</sup> i∈ K

wi <sup>þ</sup><sup>X</sup> i∉K

<sup>P</sup>ðX<sup>1</sup> <sup>þ</sup> <sup>X</sup><sup>2</sup> <sup>þ</sup> … <sup>þ</sup> XnÞ ¼ <sup>P</sup> <sup>X</sup><sup>n</sup>

<sup>P</sup>ð<sup>Y</sup> <sup>¼</sup> <sup>0</sup>jX1, X2, …, XnÞ ¼ <sup>1</sup> �<sup>X</sup>

where the set of Xis is complete and mutually exclusive.

event Ai = ON with "clockwise" weight wi

It implies

120 Bayesian Inference

(Due to Eq. (9))

c(Ω)=2<sup>n</sup> .

$$\begin{array}{l}d\_i = 1 - \rho\_i = 1 - \omega\_i\\\delta\_i = 1 - p\_i = 1 - w\_i\end{array} \tag{25}$$

Given Eq. (25), the set of all Ais is complete if and only if <sup>X</sup><sup>n</sup> <sup>i</sup>¼<sup>1</sup> wi <sup>¼</sup> 1.

Figure 4. Bi-weight simple graph.

By extending aforementioned X-gate inferences, we get bi-inferences for AND-gate, OR-gate, NAND-gate, NOR-gate, XOR-gate, XNOR-gate, and U-gate as shown in Table 6.

The largest cardinalities of K (L) are 2n�<sup>1</sup> and 2<sup>n</sup> with and without fixed Xi. Thus, it is possible to calculate arrangement counters. As a convention, the product of probabilities is 1 if indices set is empty.

$$\prod\_{i \in I} f\_i = 1 \text{ if } I = \bigotimes$$

With regard to SIGMA-gate bi-inference, the sum of all total clockwise weights must be 1 as follows

$$\sum\_{i=1}^{n} W\_i = \sum\_{i=1}^{n} (w\_i + d\_i) = \sum\_{i=1}^{n} (w\_i + 1 - \omega\_i) = 1$$

Derived from Eq. (23), the SIGMA-gate probability for bi-weight graph is

$$\begin{aligned} P(X\_1 + X\_2 + \ldots + X\_n) &= \sum\_{i=1}^n P(A\_i = \text{ON} | X\_i) \\ = \sum\_{i \in K} P(A\_i = \text{ON} | X\_i = 1) + \sum\_{i \in L} P(A\_i = \text{ON} | X\_i = 0) \\ = \sum\_{i \in K} w\_i + \sum\_{i \in L} d\_i \end{aligned}$$

Shortly, Eq. (26) specifies SIGMA-gate bi-inference.

$$P(X\_1 + X\_2 + \dots + X\_n) = \sum\_{i \in K} w\_i + \sum\_{i \in L} d\_i$$

where the set of Xi(s) is complete and mutually exclusive.

$$\begin{aligned} \sum\_{i=1}^{n} W\_i &= 1\\ X\_i \cap X\_j = \mathcal{Q}, \forall i \neq j \end{aligned} \tag{26}$$

The next section will research diagnostic relationship which adheres to X-gate inference.

### 4. Multihypothesis diagnostic relationship

Given a simple graph shown in Figure 2, if we replace the target source Y by an evidence D, we get a so-called multihypothesis diagnostic relationship whose property adheres to X-gate inference. Maybe there are other diagnostic relationships in which X-gate inference is not concerned. However, this research focuses on X-gate inference and so multi-hypothesis diagnostic relationship is called X-gate diagnostic relationship. Sources X1, X2,…, Xn become hypotheses. As a convention, these hypotheses have prior uniform distribution.

According to aforementioned X-gate network shown in Figures 2 and 3, the target variable must be binary whereas evidence D can be numeric. It is impossible to establish the evidence D as direct target variable. Thus, the solution of this problem is to add an augmented target binary variable Yand then, the evidence D is connected directly to Y. In other words, the X-gate diagnostic network have n sources {X1, X2,…, Xn}, one augmented hypothesis Y, and one evidence D. As a convention, X-gate diagnostic network is called X-D network. The CPTs of the entire network are determined based on combination of diagnostic relationship and X-gate inference mentioned in previous sections. Figure 5 depicts the augmented X-D network. Note that variables X1, X2,…, Xn, and Y are always binary.

Appendix A3 is the proof that the augmented X-D network is equivalent to X-D network with regard to variables X1, X2,…, Xn and D. As a convention, augmented X-D network is considered as same as X-D network.

The simplest case of X-D network is NOT-D network having one hypothesis X<sup>1</sup> and one evidence D, equipped with NOT-gate inference. NOT-D network satisfies diagnostic condition because it essentially represents the single diagnostic relationship. Inferred from Eqs. (1) and (7), the conditional probability P(D|X1) and posterior probability P(X1|D) of NOT-D network are

$$P(D|\mathbf{X}\_1) = \begin{cases} 1 - D \text{ if } \mathbf{X}\_1 = 1 \\ \qquad D \text{ if } \mathbf{X}\_1 = 0 \end{cases}$$

$$P(\mathbf{X}\_1|D) = \frac{P(D|\mathbf{X}\_1)P(\mathbf{X}\_1)}{P(\mathbf{X}\_1)\left(P(D|\mathbf{X}\_1 = 0) + P(D|\mathbf{X}\_1 = 1)\right)}$$

(Due to Bayes' rule and uniform distribution of X1)

Figure 5. Augmented X-D network.

By extending aforementioned X-gate inferences, we get bi-inferences for AND-gate, OR-gate,

The largest cardinalities of K (L) are 2n�<sup>1</sup> and 2<sup>n</sup> with and without fixed Xi. Thus, it is possible to calculate arrangement counters. As a convention, the product of probabilities is 1 if indices

f <sup>i</sup> ¼ 1 if I ¼ ∅

With regard to SIGMA-gate bi-inference, the sum of all total clockwise weights must be 1 as

i¼1

i¼1

i ∈L

i∈ K

ðwi þ 1 � ωiÞ ¼ 1

PðAi ¼ ONjXi ¼ 0Þ

wi <sup>þ</sup><sup>X</sup> i ∈L di

(26)

PðAi ¼ ONjXiÞ

<sup>ð</sup>wi <sup>þ</sup> diÞ ¼ <sup>X</sup><sup>n</sup>

NAND-gate, NOR-gate, XOR-gate, XNOR-gate, and U-gate as shown in Table 6.

Y i ∈I

Xn i¼1

<sup>¼</sup> <sup>X</sup> i∈ K

<sup>¼</sup> <sup>X</sup> i∈ K

Shortly, Eq. (26) specifies SIGMA-gate bi-inference.

where the set of Xi(s) is complete and mutually exclusive.

4. Multihypothesis diagnostic relationship

eses. As a convention, these hypotheses have prior uniform distribution.

Wi <sup>¼</sup> <sup>X</sup><sup>n</sup> i¼1

wi <sup>þ</sup><sup>X</sup> i ∈L di

Derived from Eq. (23), the SIGMA-gate probability for bi-weight graph is

<sup>P</sup>ðX<sup>1</sup> <sup>þ</sup> <sup>X</sup><sup>2</sup> <sup>þ</sup> … <sup>þ</sup> XnÞ ¼ <sup>X</sup><sup>n</sup>

<sup>P</sup>ðAi <sup>¼</sup> ONjXi <sup>¼</sup> <sup>1</sup>Þ þ<sup>X</sup>

<sup>P</sup>ðX<sup>1</sup> <sup>þ</sup> <sup>X</sup><sup>2</sup> <sup>þ</sup> … <sup>þ</sup> XnÞ ¼ <sup>X</sup>

Xn i¼1

The next section will research diagnostic relationship which adheres to X-gate inference.

Wi ¼ 1

Xi ∩ Xj ¼ ∅, ∀i 6¼ j

Given a simple graph shown in Figure 2, if we replace the target source Y by an evidence D, we get a so-called multihypothesis diagnostic relationship whose property adheres to X-gate inference. Maybe there are other diagnostic relationships in which X-gate inference is not concerned. However, this research focuses on X-gate inference and so multi-hypothesis diagnostic relationship is called X-gate diagnostic relationship. Sources X1, X2,…, Xn become hypoth-

set is empty.

122 Bayesian Inference

follows

$$=\frac{P(D|X\_1)}{P(D|X\_1=0) + P(D|X\_1=1)} = 1 \ast P(D|X\_1)$$

$$\left(\text{due to } P(D|X\_1=0) + P(D|X\_1=1) = 1\right)$$

It implies NOT-D network satisfies diagnostic condition. Let

$$\Omega = \{X\_1, X\_2, \dots, X\_n\}$$

$$n = |\Omega|$$

We will validate whether the CPT of diagnostic relationship, P(D|X) specified by Eq. (6), still satisfies diagnostic condition within general case, X-D network. In other words, X-D network is general case of single diagnostic relationship.

Recall from dependencies shown in Figure 5, Eq. (27) specifies the joint probability of X-D network.

$$P(\Omega, Y, D) = P(X\_1, X\_2, \dots, X\_n, Y, D) = P(D|Y)P(Y|X\_1, X\_2, \dots, X\_n) \prod\_{i=1}^n P(X\_i) \tag{27}$$
 
$$\text{where } \Omega \text{ = \{X1, \ X2, \dots, Xn\}.}$$

Eq. (28) specifies the conditional probability of D given Xi (likelihood function) and the posterior probability of Xi given D.

$$\begin{aligned} P(D|\mathbf{X}\_i) &= \frac{P(\mathbf{X}\_{i\cdot}|D)}{P(\mathbf{X}\_i)} = \frac{\sum\_{\{\Omega, Y, D\} \cup \{\mathbf{X}\_{i\cdot} D\}} P(\Omega, Y, D)}{\sum\_{\{\Omega, Y, D\} \cup \{\mathbf{X}\_i\}} P(\Omega, Y, D)} \\\\ P(\mathbf{X}\_i|D) &= \frac{P(\mathbf{X}\_{i\cdot}|D)}{P(D)} = \frac{\sum\_{\{\Omega, Y, D\} \cup \{\mathbf{X}\_{i\cdot} D\}} P(\Omega, Y, D)}{\sum\_{\{\Omega, Y, D\} \cup \{\mathbf{D}\}} P(\Omega, Y, D)} \end{aligned} \tag{28}$$

where Ω = {X1, X2,…, Xn} and the sign "\" denotes the subtraction (excluding) operator in set theory [10]. Eq. (29) specifies the joint probability P(Xi, D) and the marginal probability P(D) given uniform distribution of all sources. Appendix A4 is the proof of Eq. (29).

$$\begin{aligned} P(\mathbf{X}\_{i\prime}D) &= \frac{1}{2^n S} \left( (2D - M)\mathbf{s}(\boldsymbol{\Omega} : \{\mathbf{X}\_i\}) + 2^{n-1}(M - D) \right) \\ P(D) &= \frac{1}{2^n S} \left( (2D - M)\mathbf{s}(\boldsymbol{\Omega}) + 2^n(M - D) \right) \end{aligned} \tag{29}$$

where s(Ω) and s(Ω:{Xi}) are specified in Table 2. From Eqs. (28–30) specifies conditional probability P(D|Xi), posterior probability P(Xi|D), and transformation coefficient for X-gate inference.

Converting Graphic Relationships into Conditional Probabilities in Bayesian Network http://dx.doi.org/10.5772/intechopen.70057 125

$$P(D|\mathbf{X}\_i = 1) = \frac{P(\mathbf{X}\_i = 1, D)}{P(\mathbf{X}\_i = 1)} = \frac{(2D - M)s(\Omega : \{\mathbf{X}\_i = 1\}) + 2^{n-1}(M - D)}{2^{n-1}S}$$

$$P(D|\mathbf{X}\_i = 0) = \frac{P(\mathbf{X}\_i = 0, D)}{P(\mathbf{X}\_i = 0)} = \frac{(2D - M)s(\Omega : \{\mathbf{X}\_i = 0\}) + 2^{n-1}(M - D)}{2^{n-1}S}$$

$$P(\mathbf{X}\_i = 1 | D) = \frac{P(\mathbf{X}\_i = 1, D)}{P(D)} = \frac{(2D - M)s(\Omega : \{\mathbf{X}\_i = 1\}) + 2^{n-1}(M - D)}{(2D - M)s(\Omega) + 2^n(M - D)}\tag{30}$$

$$P(\mathbf{X}\_i = 0 | D) = 1 - P(\mathbf{X}\_i = 1 | D) = \frac{(2D - M)s(\Omega : \{\mathbf{X}\_i = 0\}) + 2^{n-1}(M - D)}{(2D - M)s(\Omega) + 2^n(M - D)}$$

$$k = \frac{P(\mathbf{X}\_i | D)}{P(D|\mathbf{X}\_i)} = \frac{2^{n-1}S}{(2D - M)s(\Omega) + 2^n(M - D)}$$

The transformation coefficient is rewritten as follows

$$k = \frac{2^{n-1}\mathcal{S}}{2D\left(s(\Omega) - 2^{n-1}\right) + M\left(2^n - s(\Omega)\right)}$$

Note that S, D, and M are abstract symbols and there is no proportional connection between 2n�<sup>1</sup> S and D for all D, specified by Eq. (6). Assuming that such proportional connection 2n�<sup>1</sup> S = aD<sup>j</sup> exists for all D where a is arbitrary constant. Given binary case when D = 0 and S = 1, we have

$$2^{n-1} = 2^{n-1} \* 1 = 2^{n-1}S = aD^j = a \* 0^j = 0^j$$

There is a contradiction, which implies that it is impossible to reduce k into the following form

$$k = \frac{aD^j}{bD^j}$$

Therefore, if k is constant with regard to D then,

$$2D\left(s(\Omega) - 2^{n-1}\right) + M\left(2^n - s(\Omega)\right) = \mathbb{C} \neq 0, \forall D.$$

where C is constant. We have

$$\begin{aligned} &\sum\_{\mathcal{D}} \left( 2D \left( s(\Omega) - 2^{n-1} \right) + M \left( 2^n - s(\Omega) \right) \right) = \sum\_{\mathcal{D}} \mathcal{C} \\ &\Rightarrow 2S \Big( s(\Omega) - 2^{n-1} \Big) + NM \Big( 2^n - s(\Omega) \Big) = NC \\ &\Rightarrow 2^n S = NC \end{aligned}$$

It is implied that

<sup>¼</sup> <sup>P</sup>ðDjX1<sup>Þ</sup>

�

is general case of single diagnostic relationship.

posterior probability of Xi given D.

network.

124 Bayesian Inference

inference.

It implies NOT-D network satisfies diagnostic condition. Let

<sup>P</sup>ðDjXiÞ ¼ <sup>P</sup>ðXi, D<sup>Þ</sup>

<sup>P</sup>ðXijDÞ ¼ <sup>P</sup>ðXi, D<sup>Þ</sup>

2nS �

> 2nS �

<sup>P</sup>ðDÞ ¼ <sup>1</sup>

<sup>P</sup>ðXi, DÞ ¼ <sup>1</sup>

<sup>P</sup>ðXi<sup>Þ</sup> <sup>¼</sup>

<sup>P</sup>ðD<sup>Þ</sup> <sup>¼</sup>

given uniform distribution of all sources. Appendix A4 is the proof of Eq. (29).

PðDjX<sup>1</sup> ¼ 0Þ þ PðDjX<sup>1</sup> ¼ 1Þ

due to PðDjX<sup>1</sup> ¼ 0Þ þ PðDjX<sup>1</sup> ¼ 1Þ ¼ 1

Ω ¼ fX1, X2, …, Xng n ¼ jΩj

We will validate whether the CPT of diagnostic relationship, P(D|X) specified by Eq. (6), still satisfies diagnostic condition within general case, X-D network. In other words, X-D network

Recall from dependencies shown in Figure 5, Eq. (27) specifies the joint probability of X-D

where Ω ¼ {X1, X2,…, Xn}:

Eq. (28) specifies the conditional probability of D given Xi (likelihood function) and the

X

X

X

where Ω = {X1, X2,…, Xn} and the sign "\" denotes the subtraction (excluding) operator in set theory [10]. Eq. (29) specifies the joint probability P(Xi, D) and the marginal probability P(D)

where s(Ω) and s(Ω:{Xi}) are specified in Table 2. From Eqs. (28–30) specifies conditional probability P(D|Xi), posterior probability P(Xi|D), and transformation coefficient for X-gate

<sup>ð</sup>2<sup>D</sup> � <sup>M</sup>Þsð<sup>Ω</sup> : <sup>f</sup>XigÞ þ <sup>2</sup><sup>n</sup>�<sup>1</sup>

<sup>ð</sup>2<sup>D</sup> � <sup>M</sup>ÞsðΩÞ þ <sup>2</sup><sup>n</sup>ð<sup>M</sup> � <sup>D</sup><sup>Þ</sup>

X

fΩ,Y,Dg\fXi,Dg

fΩ,Y,Dg\fXig

fΩ,Y,Dg\fXi,Dg

fΩ,Y,Dg\fDg

PðΩ, Y, DÞ

PðΩ, Y, DÞ

ðM � DÞ

�

�

PðΩ, Y, DÞ

PðΩ, Y, DÞ

PðΩ,Y,DÞ ¼ PðX1, X2, …, Xn, Y, DÞ ¼ PðDjYÞPðYjX1, X2, …, XnÞ

¼ 1 � PðDjX1Þ

�

Yn i¼1

PðXiÞ

(27)

(28)

(29)

$$P(X\_1 \otimes X\_2 \otimes \dots \otimes X\_n) = \prod\_{i \in K} p\_i \prod\_{i \in L} d\_i$$

$$P(X\_1 \oplus X\_2 \oplus \dots \oplus X\_n) = 1 - \prod\_{i \in K} \delta\_i \prod\_{i \in L} \rho\_i$$

$$P\left(\text{not}(X\_1 \otimes X\_2 \otimes \dots \otimes X\_n)\right) = 1 - \prod\_{i \in L} \rho\_i \prod\_{i \in K} \delta\_i$$

$$P\left(\text{not}(X\_1 \oplus X\_2 \oplus \dots \oplus X\_n)\right) = \prod\_{i \in L} d\_i \prod\_{i \in K} p\_i$$

$$P(X\_1 \otimes X\_2 \otimes \dots \otimes X\_n) = \prod\_{i \in O\_1} p\_i \prod\_{i \in O\_2} \delta\_i \prod\_{i \in E\_1} \rho\_i + \prod\_{i \in E\_1} p\_i \prod\_{i \in E\_2} d\_i \prod\_{i \in O\_1} \delta\_i \prod\_{i \in O\_2} \rho\_i$$

$$P\left(\text{not}(X\_1 \otimes X\_2 \otimes \dots \otimes X\_n)\right) = \prod\_{i \in K} p\_i \prod\_{i \in L} d\_i + \prod\_{i \in K} \delta\_i \prod\_{i \in L} \rho\_i$$

$$P(X\_1 \cup X\_2 \cup \dots \cup X\_n) = \sum\_{i \in L} \left(\prod\_{i \in U \wedge E\_1} p\_i \prod\_{i \in U \wedge L} d\_i\right) \left(\prod\_{i \in L \cap k} \delta\_i \prod\_{i \in U \wedge L} \rho\_i\right)$$

There are four common conditions of U: |U|=α, |U|≥α, |U|≤β, and α≤|U|≤β. Note that U is the complement of U,

$$\overline{\mathcal{U}} = \{1, 2, \ldots, n\} \backslash \mathcal{U}$$

The largest cardinality of U is:

<sup>j</sup>Uj ¼ <sup>2</sup><sup>n</sup>

Table 6. Bi-inferences for AND-gate, OR-gate, NAND-gate, NOR-gate, XOR-gate, XNOR-gate, and U-gate.

$$k = \frac{2^{n-1}\mathcal{S}}{2D\left(s(\Omega) - 2^{n-1}\right) + M\left(2^n - s(\Omega)\right)} = \frac{NC}{2C} = \frac{N}{2}$$

This holds

$$\begin{split} 2^{n}S = N\Big(2D\Big(s(\varOmega) - 2^{n-1}\Big) + M\Big(2^{n} - s(\varOmega)\Big)\Big) &= 2ND\Big(s(\varOmega) - 2^{n-1}\Big) + 2S\Big(2^{n} - s(\varOmega)\Big) \\ &\Rightarrow 2ND\Big(s(\varOmega) - 2^{n-1}\Big) - 2S\Big(s(\varOmega) - 2^{n-1}\Big) = 0 \\ &\Rightarrow (ND - S)\Big(s(\varOmega) - 2^{n-1}\Big) = 0 \end{split}$$

Assuming ND = S we have

$$ND = \mathcal{S} = 2NM \Rightarrow D = 2M$$

There is a contradiction because M is maximum value of D. Therefore, if k is constant with regard to D then s(Ω)=2n�<sup>1</sup> . Inversely, if s(Ω)=2n�<sup>1</sup> then k is

$$k = \frac{2^{n-1}S}{2D(2^{n-1} - 2^{n-1}) + M(2^n - 2^{n-1})} = \frac{N}{2}$$

Given X-D network is combination of diagnostic relationship and X-gate inference:

$$P(Y=1|X\_1, X\_2, \dots, X\_n) = P(X\_1 \ge X\_2 \ge \dots \ge X\_s)$$

$$P(D|Y) = \begin{cases} \frac{D}{S} \text{if } Y = 1\\ \frac{M}{S} - \frac{D}{S} \text{if } Y = 0 \end{cases}$$

The diagnostic condition of X-D network is satisfied if and only if

$$s(\Omega) = \sum\_{a} \mathcal{P}\left(Y = 1 | a(\Omega) \right) = 2^{|\Omega|-1}, \forall \Omega \neq \mathcal{Q}$$

At that time, the transformation coefficient becomes:

$$k = \frac{N}{2}$$

Note that weights pi = wi and ρ<sup>i</sup> = ωi, which are inputs of s(Ω), are abstract variables. Thus, the equality s(Ω)=2|Ω|�<sup>1</sup> implies all abstract variables are removed and so s(Ω) does not depend on weights.

Table 7. Diagnostic theorem.

<sup>k</sup> <sup>¼</sup> <sup>2</sup><sup>n</sup>�<sup>1</sup><sup>S</sup>

<sup>s</sup>ðΩÞ � <sup>2</sup><sup>n</sup>�<sup>1</sup>

� þ M �

Table 6. Bi-inferences for AND-gate, OR-gate, NAND-gate, NOR-gate, XOR-gate, XNOR-gate, and U-gate.

<sup>P</sup>ðX1⊙X2⊙…⊙XnÞ ¼ <sup>Y</sup>

<sup>P</sup>ðX<sup>1</sup> <sup>⊕</sup> <sup>X</sup><sup>2</sup> <sup>⊕</sup> … <sup>⊕</sup> XnÞ ¼ <sup>1</sup> � <sup>Y</sup>

notðX1⊙X2⊙…⊙XnÞ

i ∈ O<sup>1</sup> pi Y i ∈ O<sup>2</sup> di Y i ∈E<sup>1</sup> δi Y i ∈E<sup>2</sup>

> 0 @

notðX<sup>1</sup> ⊗ X<sup>2</sup> ⊗ … ⊗ XnÞ

U ∈ U

notðX<sup>1</sup> ⊕ X<sup>2</sup> ⊕ … ⊕ XnÞ

P �

> P �

<sup>P</sup>ðX<sup>1</sup> <sup>⊗</sup> <sup>X</sup><sup>2</sup> <sup>⊗</sup> … <sup>⊗</sup> XnÞ ¼ <sup>Y</sup>

<sup>P</sup>ðX1⊎X2⊎…⊎XnÞ ¼ <sup>X</sup>

P � i∈ K pi Y i ∈L di

� <sup>¼</sup> <sup>Y</sup> i ∈ L di Y i∈ K pi

� <sup>¼</sup> <sup>1</sup> � <sup>Y</sup> i∈ L ρi Y i∈ K δi

� <sup>¼</sup> <sup>Y</sup> i ∈K pi Y i ∈L di <sup>þ</sup> <sup>Y</sup> i ∈K δi Y i∈ L ρi

Y <sup>i</sup> <sup>∈</sup> <sup>U</sup> ∩ <sup>K</sup> pi Y <sup>i</sup> <sup>∈</sup> <sup>U</sup> ∩ <sup>L</sup> di

There are four common conditions of U: |U|=α, |U|≥α, |U|≤β, and α≤|U|≤β. Note that U is the complement of U,

U ¼ f1, 2, …, ng\U

<sup>j</sup>Uj ¼ <sup>2</sup><sup>n</sup>

i ∈K δi Y i ∈L ρi

<sup>ρ</sup><sup>i</sup> <sup>þ</sup> <sup>Y</sup> i∈ E<sup>1</sup> pi Y i ∈ E<sup>2</sup> di Y i∈ O<sup>1</sup> δi Y i ∈ O<sup>2</sup> ρi

1 <sup>A</sup> <sup>Y</sup> <sup>i</sup> <sup>∈</sup>U ∩ <sup>K</sup> δi Y <sup>i</sup> <sup>∈</sup>U ∩ <sup>L</sup> ρi

0 B@

<sup>2</sup><sup>n</sup> � <sup>s</sup>ðΩ<sup>Þ</sup>

� � 2S �

�

ND ¼ S ¼ 2NM ) D ¼ 2M

There is a contradiction because M is maximum value of D. Therefore, if k is constant with

S

Þ þ <sup>M</sup>ð2<sup>n</sup> � <sup>2</sup><sup>n</sup>�<sup>1</sup>

Þ ¼ N 2

. Inversely, if s(Ω)=2n�<sup>1</sup> then k is

<sup>s</sup>ðΩÞ � <sup>2</sup><sup>n</sup>�<sup>1</sup>

) ðND � SÞ

<sup>k</sup> <sup>¼</sup> <sup>2</sup><sup>n</sup>�<sup>1</sup>

<sup>2</sup>Dð2<sup>n</sup>�<sup>1</sup> � <sup>2</sup><sup>n</sup>�<sup>1</sup>

��

<sup>s</sup>ðΩÞ � <sup>2</sup><sup>n</sup>�<sup>1</sup>

<sup>2</sup><sup>n</sup> � <sup>s</sup>ðΩ<sup>Þ</sup>

¼ 2ND �

<sup>s</sup>ðΩÞ � <sup>2</sup><sup>n</sup>�<sup>1</sup>

� ¼ 0

� <sup>¼</sup> NC <sup>2</sup><sup>C</sup> <sup>¼</sup> <sup>N</sup> 2

<sup>s</sup>ðΩÞ � <sup>2</sup><sup>n</sup>�<sup>1</sup>

� ¼ 0 � þ 2S �

1 CA

<sup>2</sup><sup>n</sup> � <sup>s</sup>ðΩ<sup>Þ</sup>

�

2D �

> � þ M �

) 2ND �

<sup>s</sup>ðΩÞ � <sup>2</sup><sup>n</sup>�<sup>1</sup>

This holds

<sup>2</sup>nS <sup>¼</sup> <sup>N</sup>

� 2D �

The largest cardinality of U is:

126 Bayesian Inference

Assuming ND = S we have

regard to D then s(Ω)=2n�<sup>1</sup>

In general, the event that k is constant with regard to D is equivalent to the event s(Ω)=2n�<sup>1</sup> . This implies diagnostic theorem stated in Table 7.

The diagnostic theorem is the optimal way to validate the diagnostic condition.

The Eq. (30) becomes simple with AND-gate inference. Recall that Eq. (14) specified AND-gate inference as follows

$$P(X\_1 \odot X\_2 \odot \dots \odot X\_n) = P(Y = 1 | X\_1, X\_2, \dots, X\_n) = \begin{cases} \prod\_{i=1}^n p\_i \text{ if all } X\_i(\mathbf{s}) \text{ are } 1\\ 0 \text{ if there exists at least one } X\_i = 0 \end{cases}$$

Due to only one case X<sup>1</sup> = X<sup>2</sup> =…= Xn = 1, we have

$$s(\Omega) = s(\Omega : \{X\_i = 1\}) = \prod\_{i=1}^n p\_i$$

Due to Xi = 0, we have

$$s(\Omega : \{\mathbf{X}\_i = \mathbf{0}\}) = \mathbf{0}$$

Derived from Eq. (30), Eq. (31) specifies conditional probability P(D|Xi), posterior probability P(Xi|D), and transformation coefficient according to X-D network with AND-gate reference called AND-D network.

$$P(D|\mathbf{X}\_i = 1) = \frac{(2D - M)\prod\_{i=1}^n p\_i + 2^{n-1}(M - D)}{2^{n-1}S}$$

$$P(D|\mathbf{X}\_i = 0) = \frac{M - D}{S}$$

$$P(\mathbf{X}\_i = 1|D) = \frac{(2D - M)\prod\_{i=1}^n p\_i + 2^{n-1}(M - D)}{(2D - M)\prod\_{i=1}^n p\_i + 2^n(M - D)}\tag{31}$$

$$P(\mathbf{X}\_i = 0|D) = \frac{2^{n-1}(M - D)}{(2D - M)\prod\_{i=1}^n p\_i + 2^n(M - D)}$$

$$k = \frac{2^{n-1}S}{(2D - M)\prod\_{i=1}^n p\_i + 2^n(M - D)}$$

For convenience, we validate diagnostic condition with a case of two sources Ω = {X1, X2}, p<sup>1</sup> = p<sup>2</sup> = w<sup>1</sup> = w<sup>2</sup> = 0.5, D ∈{0, 1, 2, 3}. According to diagnostic theorem stated in Table 7, if s(Ω) 6¼ 2 for given X-gate then, such X-gate does not satisfy diagnostic condition.

i¼1

Given AND-gate inference, by applying Eq. (14), we have

$$s(\Omega) = (0.5 \ast 0.5) + 0 + 0 + 0 = 0.25$$

Given OR-gate inference, by applying Eq. (16), we have

$$s(\Omega) = (1 - 0.5 \ast 0.5) + (1 - 0.5) + (1 - 0.5) + 0 = \mathfrak{Z} - \mathfrak{Z} \ast 0.5 \ast 0.5 = 1.75$$

Given XOR-gate inference, by applying Eq. (19), we have

$$s(\Omega) = (0.5 \ast 0.5 + 0.5 \ast 0.5) + 0.5 + 0.5 + 0 = 1.5$$

Given XNOR-gate inference, by applying Eq. (21), we have

$$s(\Omega) = (0.5 \ast 0.5 + 0.5 \ast 0.5) + 0.5 + 0.5 + 1 = 2.5$$

Given SIGMA-gate inference, by applying Eq. (23), we have

$$s(\Omega) = (0.5 + 0.5) + 0.5 + 0.5 + 0 = 2$$

It is asserted that AND-gate, OR-gate, XOR-gate, and XNOR-gate do not satisfy diagnostic condition and so they should not be used to assess hypotheses. However, it is not asserted if Ugate and SIGMA-gate satisfy such diagnostic condition. It is necessary to expend equation for SIGMA-gate diagnostic network (called SIGMA-D network) in order to validate it.

In case of SIGMA-gate inference, by applying Eq. (23), we have

Converting Graphic Relationships into Conditional Probabilities in Bayesian Network http://dx.doi.org/10.5772/intechopen.70057 129

$$\sum\_{i} w\_{i} = 1$$

$$s(\Omega) = 2^{n-1} \sum\_{i} w\_{i} = 2^{n-1}$$

$$s(\Omega : \{X\_{i} = 1\}) = 2^{n-1} w\_{i} + 2^{n-2} \sum\_{j \neq i} w\_{j} = 2^{n-1} w\_{i} + 2^{n-2} (1 - w\_{i}) = 2^{n-2} (1 + w\_{i})$$

$$s(\Omega : \{X\_{i} = 0\}) = s(\Omega) - s(\Omega : \{X\_{i} = 1\}) = 2^{n-2} (1 - w\_{i})$$

<sup>P</sup>ðDjXi <sup>¼</sup> <sup>1</sup>Þ ¼ <sup>ð</sup>2<sup>D</sup> � <sup>M</sup><sup>Þ</sup>

128 Bayesian Inference

<sup>P</sup>ðXi <sup>¼</sup> <sup>1</sup>jDÞ ¼ <sup>ð</sup>2<sup>D</sup> � <sup>M</sup><sup>Þ</sup>

<sup>P</sup>ðXi <sup>¼</sup> <sup>0</sup>jDÞ ¼ <sup>2</sup><sup>n</sup>�<sup>1</sup>

<sup>k</sup> <sup>¼</sup> <sup>2</sup><sup>n</sup>�<sup>1</sup>

ð2D � MÞ

for given X-gate then, such X-gate does not satisfy diagnostic condition.

Given AND-gate inference, by applying Eq. (14), we have

Given OR-gate inference, by applying Eq. (16), we have

Given XOR-gate inference, by applying Eq. (19), we have

Given XNOR-gate inference, by applying Eq. (21), we have

Given SIGMA-gate inference, by applying Eq. (23), we have

In case of SIGMA-gate inference, by applying Eq. (23), we have

Y<sup>n</sup> i¼1

Y<sup>n</sup> i¼1

Y<sup>n</sup> i¼1

Y<sup>n</sup> i¼1

S

<sup>P</sup>ðDjXi <sup>¼</sup> <sup>0</sup>Þ ¼ <sup>M</sup> � <sup>D</sup>

ð2D � MÞ

ð2D � MÞ

Y<sup>n</sup> i¼1

For convenience, we validate diagnostic condition with a case of two sources Ω = {X1, X2}, p<sup>1</sup> = p<sup>2</sup> = w<sup>1</sup> = w<sup>2</sup> = 0.5, D ∈{0, 1, 2, 3}. According to diagnostic theorem stated in Table 7, if s(Ω) 6¼ 2

sðΩÞ¼ð0:5 � 0:5Þ þ 0 þ 0 þ 0 ¼ 0:25

sðΩÞ¼ð1 � 0:5 � 0:5Þþð1 � 0:5Þþð1 � 0:5Þ þ 0 ¼ 3 � 3 � 0:5 � 0:5 ¼ 1:75

sðΩÞ¼ð0:5 � 0:5 þ 0:5 � 0:5Þ þ 0:5 þ 0:5 þ 0 ¼ 1:5

sðΩÞ¼ð0:5 � 0:5 þ 0:5 � 0:5Þ þ 0:5 þ 0:5 þ 1 ¼ 2:5

sðΩÞ¼ð0:5 þ 0:5Þ þ 0:5 þ 0:5 þ 0 ¼ 2

It is asserted that AND-gate, OR-gate, XOR-gate, and XNOR-gate do not satisfy diagnostic condition and so they should not be used to assess hypotheses. However, it is not asserted if Ugate and SIGMA-gate satisfy such diagnostic condition. It is necessary to expend equation for

SIGMA-gate diagnostic network (called SIGMA-D network) in order to validate it.

pi <sup>þ</sup> <sup>2</sup><sup>n</sup>�<sup>1</sup>

pi <sup>þ</sup> <sup>2</sup><sup>n</sup>�<sup>1</sup>

ðM � DÞ

pi <sup>þ</sup> <sup>2</sup><sup>n</sup>ð<sup>M</sup> � <sup>D</sup><sup>Þ</sup>

pi <sup>þ</sup> <sup>2</sup><sup>n</sup>ð<sup>M</sup> � <sup>D</sup><sup>Þ</sup>

pi <sup>þ</sup> <sup>2</sup><sup>n</sup>ð<sup>M</sup> � <sup>D</sup><sup>Þ</sup>

2<sup>n</sup>�<sup>1</sup> S

S

ðM � DÞ

ðM � DÞ

(31)

It is necessary to validate SIGMA-D network with SIGMA-gate bi-inference. By applying Eq. (26), we recalculate these quantities as follows

$$s(\Omega) = 2^{n-1} \sum\_{i} w\_i + 2^{n-1} \sum\_{i} d\_i = 2^{n-1} \sum\_{i} (w\_i + d\_i) = 2^{n-1}$$

$$\left(\text{due to} \sum\_{i} (w\_i + d\_i) = 1\right)$$

$$s(\Omega : \{\mathbf{X}\_i = 1\}) = 2^{n-1} w\_i + 2^{n-2} \sum\_{j \neq i} w\_j + 2^{n-2} \sum\_{i} d\_i = 2^{n-2} w\_i + 2^{n-2} \sum\_{i} (w\_i + d\_i) = 2^{n-2} (1 + w\_i)$$

$$s(\Omega : \{\mathbf{X}\_i = 0\}) = s(\Omega) - s(\Omega : \{\mathbf{X}\_i = 1\}) = 2^{n-2} (1 - w\_i)$$

Obviously, quantities s(Ω), s(Ω:{Xi=1}), and s(Ω:{Xi = 0}) are kept intact. According to diagnostic theorem, we conclude that SIGMA-D network does satisfy diagnostic condition due to s(Ω)=2n�<sup>1</sup> . Thus, SIGMA-D network can be used to assess hypotheses.

Eq. (32), an immediate consequence of Eq. (30), specifies conditional probability P(D|Xi), posterior probability P(Xi|D), and transformation coefficient for SIGMA-D network.

$$\begin{aligned} P(D|X\_i = 1) &= \frac{(2D - M)w\_i + M}{2S} \\ P(D|X\_i = 0) &= \frac{(M - 2D)w\_i + M}{2S} \\ P(X\_i = 1|D) &= \frac{(2D - M)w\_i + M}{2M} \\ P(X\_i = 0|D) &= \frac{(M - 2D)w\_i + M}{2M} \\ k &= \frac{N}{2} \end{aligned} \tag{32}$$

In case of SIGMA-gate, the augmented variable Y can be removed from X-D network. The evidence D is now established as direct target variable. Figure 6 shows a so-called direct SIGMA-gate diagnostic network (direct SIGMA-D network).

Derived from Eq. (23), the CPT of direct SIGMA-D network is determined by Eq. (33).

$$P(D|X\_1, X\_2, \ldots, X\_n) = \sum\_{i \in K} \frac{D}{S} w\_i + \sum\_{j \in L} \frac{M - D}{S} w\_j$$

where the set of Xi (s) is complete and mutually exclusive.

$$\begin{aligned} \sum\_{i=1}^{n} w\_i &= 1\\ \mathbf{X}\_i \cap \mathbf{X}\_j &= \textsf{\textsf{\textsf{\textsf{\textsf{\textsf{\textsf{\textsf{\textsf{\textsf{\textsf{\textsf{\textsf{\textsf{\textsf{\textsf{\textsf{\textsf{\textsf{\textsf{\textsf{\textsf{\textsf{\mathcal{\textsf{\textsf{\mathcal{\textsf{\mathcal{\textsf{\mathcal{\mathcal{\beta}}}}}}}}}}}}}}}}}}}} \end{\}}} \end{\}} \end{\}} \end{\}} \end{\}} \tag{33}} \end{aligned}} \tag{33}$$

Eq. (33) specifies valid CPT due to

$$\begin{aligned} \sum\_{D} \mathbf{P}(D|\mathbf{X}\_{1}, \mathbf{X}\_{2}, \dots, \mathbf{X}\_{n}) &= \frac{1}{S} \sum\_{i \in K} w\_{i} \sum\_{D} D + \frac{1}{S} \sum\_{j \in L} w\_{j} \sum\_{D} (M - D) \\ &= \frac{1}{S} \sum\_{i \in K} Sw\_{i} + \frac{1}{S} \sum\_{j \in L} w\_{j} (NM - S) = \frac{1}{S} \sum\_{i \in K} Sw\_{i} + \frac{1}{S} \sum\_{j \in L} Sw\_{j} = \sum\_{i=1}^{n} w\_{i} = 1 \end{aligned}$$

From dependencies shown in Figure 6, Eq. (34) specifies the joint probability of direct SIGMA-D network.

$$P(\mathbf{X}\_1, \mathbf{X}\_2, \dots, \mathbf{X}\_n, \mathbf{Y}, \mathbf{D}) = P(\mathbf{D} | \mathbf{X}\_1, \mathbf{X}\_2, \dots, \mathbf{X}\_n) \prod\_{\mathbf{n}}^{i=1} P(\mathbf{X}\_i) \tag{34}$$

Inferred from Eq. (29), Eq. (35) specifies the joint probability P(Xi, D) and the marginal probability P(D) of direct SIGMA-D network, given uniform distribution of all sources.

$$\begin{aligned} P(X\_i, D) &= \frac{1}{\mathfrak{Z}^n} s(\Omega : \{X\_i\}) \\ P(D) &= \frac{1}{\mathfrak{Z}^n} s(\Omega) \end{aligned} \tag{35}$$

where s(Ω) and s(Ω:{Xi}) are specified in Table 2.

By browsing all variables of direct SIGMA-D network, we have

Figure 6. Direct SIGMA-gate diagnostic network (direct SIGMA-D network).

Converting Graphic Relationships into Conditional Probabilities in Bayesian Network http://dx.doi.org/10.5772/intechopen.70057 131

$$s(\Omega : \{X\_i = 1\}) = 2^{n-1} \frac{D}{S} w\_i + 2^{n-2} \sum\_{j \neq i} \frac{D}{S} w\_j + 2^{n-2} \sum\_{j \neq i} \frac{M - D}{S} w\_j$$

$$= \frac{2^{n-2}}{S} (2Dw\_i + M \sum\_{j \neq i} w\_j) = \frac{2^{n-2}}{S} \left(2Dw\_i + M(1 - w\_i)\right)$$

$$\left(\text{Due to } \sum\_{i=1}^n w\_i = 1\right)$$

$$= \frac{2^{n-2}}{S} \left((2D - M)w\_i + M\right)$$

Similarly, we have

<sup>P</sup>ðDjX1, X2,…, XnÞ ¼ <sup>X</sup>

where the set of Xi (s) is complete and mutually exclusive.

S X i ∈K wi X D D þ 1 S X j∈L wj X D

> Swi þ 1 S X j∈L

¼ 1 S X i ∈K

where s(Ω) and s(Ω:{Xi}) are specified in Table 2.

By browsing all variables of direct SIGMA-D network, we have

Figure 6. Direct SIGMA-gate diagnostic network (direct SIGMA-D network).

Eq. (33) specifies valid CPT due to

<sup>P</sup>ðDjX1, X2,…, XnÞ ¼ <sup>1</sup>

X D

130 Bayesian Inference

network.

i∈K

wi ¼ 1

Xi ∩ Xj ¼ ∅, ∀i 6¼ j

From dependencies shown in Figure 6, Eq. (34) specifies the joint probability of direct SIGMA-D

Inferred from Eq. (29), Eq. (35) specifies the joint probability P(Xi, D) and the marginal proba-

<sup>2</sup><sup>n</sup> <sup>s</sup>ð<sup>Ω</sup> : <sup>f</sup>XigÞ

<sup>2</sup><sup>n</sup> <sup>s</sup>ðΩ<sup>Þ</sup>

PðX1, X2, …, Xn,Y,DÞ ¼ PðDjX1, X2, …, XnÞ

bility P(D) of direct SIGMA-D network, given uniform distribution of all sources.

<sup>P</sup>ðXi, DÞ ¼ <sup>1</sup>

<sup>P</sup>ðDÞ ¼ <sup>1</sup>

Xn i¼1

D

<sup>S</sup> wi <sup>þ</sup><sup>X</sup> j∈L

ðM � DÞ

S X i∈ K

Swi þ 1 S X j ∈L

Y<sup>i</sup>¼<sup>1</sup>

Swj <sup>¼</sup> <sup>X</sup><sup>n</sup>

<sup>n</sup> PðXiÞ (34)

i¼1

wi ¼ 1

(35)

wjðNM � <sup>S</sup>Þ ¼ <sup>1</sup>

M � D <sup>S</sup> wj

(33)

$$s(\varDelta : \{X\_i = 0\}) = 2^{n-1} \frac{M-D}{S} w\_i + 2^{n-2} \sum\_{j \neq i} \frac{M-D}{S} w\_j + 2^{n-2} \sum\_{j \neq i} \frac{D}{S} w\_j = \frac{2^{n-2}}{S} \left( (M-2D) w\_i + M \right)$$

$$s(\varDelta) = 2^{n-1} \sum\_i \frac{D}{S} w\_i + 2^{n-1} \sum\_i \frac{M-D}{S} w\_i = \frac{2^{n-1} M}{S}$$

By applying Eq. (35), s(Ω:{Xi = 0}), s(Ω:{Xi = 1}), and s(Ω), we get the same result with Eq. (32).

$$P(D|X\_i = 1) = \frac{(2D - M)w\_i + M}{2S}$$

$$P(D|X\_i = 0) = \frac{(M - 2D)w\_i + M}{2S}$$

$$P(X\_i = 1|D) = \frac{(2D - M)w\_i + M}{2M}$$

$$P(X\_i = 0|D) = \frac{(M - 2D)w\_i + M}{2M}$$

$$k = \frac{N}{2}$$

Therefore, it is possible to use direct SIGMA-D network to assess hypotheses. It is asserted that SIGMA-D network satisfy diagnostic condition when single relationship, NOT-D network, direct SIGMA-D network are specific cases of SIGMA-D network. There is a question: does an X-D network that is different from SIGMA-D network and not aforementioned exist such that it satisfies diagnostic condition?

Recall that each X-D network is a pattern owning a particular X-gate inference which in turn is based on particular X-gate condition (s) relevant to only variables Ais. The most general nonlinear X-D network is U-D network whereas SIGMA-D network is linear one. The U-gate inference given arbitrary condition on U is

$$P(X\_1 \# X\_2 \# \dots \# X\_n) = \sum\_{\mathcal{U} \in \mathcal{U}} \left( \prod\_{i \in \mathcal{U} \cap X} p\_i \prod\_{i \in \mathcal{U} \cap L} (1 - \rho\_i) \right) \left( \prod\_{i \in \overline{\mathcal{U}} \cap X} (1 - p\_i) \prod\_{i \in \overline{\mathcal{U}} \cap L} \rho\_i \right)$$

Let f be the arrangement sum of U-gate inference.

$$f(p\_i, \rho\_i) = \sum\_{a(\Omega)} \sum\_{\substack{U \in \mathcal{U}}} \left( \prod\_{i \in \mathcal{U} \cap \mathcal{K}} p\_i \prod\_{i \in \mathcal{U} \cap \mathcal{L}} (1 - \rho\_i) \right) \left( \prod\_{i \in \overline{\mathcal{U} \cap \mathcal{K}}} (1 - p\_i) \prod\_{i \in \overline{\mathcal{U} \cap \mathcal{L}}} \rho\_i \right)$$

The function f is sum of many large expressions and each expression is product of four possible sub-products (Π) as follows

$$Expr = \prod\_{i \in \mathcal{U} \cap \mathcal{K}} p\_i \prod\_{i \in \mathcal{U} \cap L} (1 - \rho\_i) \prod\_{i \in \overline{\mathcal{U}} \cap \mathcal{K}} (1 - p\_i) \prod\_{i \in \overline{\mathcal{U}} \cap L} \rho\_i$$

In any case of degradation, there always exist expression Expr (s) having at least 2 subproducts (Π), for example,

$$Expr = \prod\_{i \in \mathcal{U} \cap K} p\_i \prod\_{i \in \mathcal{U} \cap L} (1 - \rho\_i)$$

Consequently, there always exist Expr (s) having at least 5 terms relevant to pi and ρ<sup>i</sup> if n ≥ 5, for example,

$$Expr = p\_1 p\_2 p\_3 (1 - \rho\_4)(1 - \rho\_5)$$

Thus, degree of f will be larger than or equal to 5 given n ≥ 5. According to diagnostic theorem, U-gate network satisfies diagnostic condition if and only if f(pi, ρi)=2n�<sup>1</sup> for all n ≥ 1 and for all abstract variables pi and ρi. Without loss of generality, each pi or ρ<sup>i</sup> is sum of variable x and a variable ai or bi, respectively. Note that all pi, ρi, ai are bi are abstract variables.

$$p\_i = \mathbf{x} + a\_i$$

$$\rho\_i = \mathbf{x} + b\_i$$

The equation <sup>f</sup>�2n�<sup>1</sup> = 0 becomes equation <sup>g</sup>(x) = 0 whose degree is <sup>m</sup> <sup>≥</sup> 5 if <sup>n</sup> <sup>≥</sup> 5.

$$g(\mathbf{x}) = \pm \mathbf{x}^{m} + \mathbb{C}\_{1} \mathbf{x}^{m-1} + \dots + \mathbb{C}\_{m-1} \mathbf{x} + \mathbb{C}\_{m} - \mathbb{Z}^{n-1} = \mathbf{0}$$

where coefficients Ci s are functions of ai and bis. According to Abel-Ruffini theorem [11], equation g(x) = 0 has no algebraic solution when m ≥ 5. Thus, abstract variables pi and ρ<sup>i</sup> cannot be eliminated entirely from g(x) = 0, which causes that there is no specification of U-gate inference P(X1xX2x…xXn) so that diagnostic condition is satisfied.

It is concluded that there is no nonlinear X-D network satisfying diagnostic condition, but a new question is raised: does there exist the general linear X-D network satisfying diagnostic condition? Such linear network is called GL-D network and SIGMA-D network is specific case of GL-D network. The GL-gate probability must be linear combination of weights.

$$P(\mathbf{X}\_1 \mathbf{x} \mathbf{X}\_2 \mathbf{x} \dots \mathbf{x} \mathbf{X}\_n) = \mathbf{C} + \sum\_{i=1}^n \alpha\_i w\_i + \sum\_{i=1}^n \beta\_i d\_i$$

where C is arbitrary constant.

<sup>P</sup>ðX1⊎X2⊎…⊎XnÞ ¼ <sup>X</sup>

Þ ¼ <sup>X</sup> aðΩÞ

fðpi , ρi

132 Bayesian Inference

sub-products (Π) as follows

products (Π), for example,

example,

U ∈U

0 @

Let f be the arrangement sum of U-gate inference.

X U ∈ U

Expr <sup>¼</sup> <sup>Y</sup>

<sup>i</sup><sup>∈</sup> <sup>U</sup> ∩ <sup>K</sup>

pi Y <sup>i</sup> <sup>∈</sup> <sup>U</sup> ∩ <sup>L</sup>

Expr <sup>¼</sup> <sup>Y</sup>

variable ai or bi, respectively. Note that all pi, ρi, ai are bi are abstract variables.

The equation <sup>f</sup>�2n�<sup>1</sup> = 0 becomes equation <sup>g</sup>(x) = 0 whose degree is <sup>m</sup> <sup>≥</sup> 5 if <sup>n</sup> <sup>≥</sup> 5.

inference P(X1xX2x…xXn) so that diagnostic condition is satisfied.

<sup>i</sup><sup>∈</sup> <sup>U</sup> ∩ <sup>K</sup> pi Y <sup>i</sup> <sup>∈</sup> <sup>U</sup> ∩ <sup>L</sup>

0 @

Y <sup>i</sup> <sup>∈</sup> <sup>U</sup> ∩ <sup>K</sup>

pi Y <sup>i</sup> <sup>∈</sup> <sup>U</sup> ∩ <sup>L</sup>

Y <sup>i</sup><sup>∈</sup> <sup>U</sup> ∩ <sup>K</sup>

pi Y <sup>i</sup> <sup>∈</sup> <sup>U</sup> ∩ <sup>L</sup>

ð1 � ρ<sup>i</sup> Þ

1

0 B@

<sup>A</sup> <sup>Y</sup> <sup>i</sup>∈U ∩ <sup>K</sup>

ð1 � pi

ð1 � ρ<sup>i</sup> Þ

ð1 � ρ<sup>i</sup> Þ

The function f is sum of many large expressions and each expression is product of four possible

ð1 � ρ<sup>i</sup>

In any case of degradation, there always exist expression Expr (s) having at least 2 sub-

Consequently, there always exist Expr (s) having at least 5 terms relevant to pi and ρ<sup>i</sup> if n ≥ 5, for

Expr ¼ p1p2p3ð1 � ρ4Þð1 � ρ5Þ

Thus, degree of f will be larger than or equal to 5 given n ≥ 5. According to diagnostic theorem, U-gate network satisfies diagnostic condition if and only if f(pi, ρi)=2n�<sup>1</sup> for all n ≥ 1 and for all abstract variables pi and ρi. Without loss of generality, each pi or ρ<sup>i</sup> is sum of variable x and a

pi ¼ x þ ai

ρ<sup>i</sup> ¼ x þ bi

<sup>ɡ</sup>ðx޼�xm <sup>þ</sup> <sup>C</sup>1xm�<sup>1</sup> <sup>þ</sup> … <sup>þ</sup> Cm�<sup>1</sup><sup>x</sup> <sup>þ</sup> Cm � <sup>2</sup><sup>n</sup>�<sup>1</sup> <sup>¼</sup> <sup>0</sup>

where coefficients Ci s are functions of ai and bis. According to Abel-Ruffini theorem [11], equation g(x) = 0 has no algebraic solution when m ≥ 5. Thus, abstract variables pi and ρ<sup>i</sup> cannot be eliminated entirely from g(x) = 0, which causes that there is no specification of U-gate

<sup>Þ</sup> <sup>Y</sup> <sup>i</sup> <sup>∈</sup>U ∩ <sup>K</sup>

1

0 B@

<sup>A</sup> <sup>Y</sup> <sup>i</sup>∈U ∩ <sup>K</sup>

ð1 � pi

ð1 � pi

<sup>Þ</sup> <sup>Y</sup> <sup>i</sup>∈U ∩ <sup>L</sup>

<sup>Þ</sup> <sup>Y</sup> <sup>i</sup>∈U ∩ <sup>L</sup>

ρi

ρi

1 CA

<sup>Þ</sup> <sup>Y</sup> <sup>i</sup>∈U ∩ <sup>L</sup>

ρi

1 CA

The GL-gate inference is singular if α<sup>i</sup> and β<sup>i</sup> are functions of only Xi as follows

$$P(\mathbf{X}\_1 \mathbf{x} \mathbf{X}\_2 \mathbf{x} \dots \mathbf{x} \mathbf{X}\_n) = \mathbf{C} + \sum\_{i=1}^n h\_i(\mathbf{X}\_i) w\_i + \sum\_{i=1}^n g\_i(\mathbf{X}\_i) d\_i$$

The functions hi and gi are not relevant to Ai because the final equation of GL-gate inference is only relevant to Xi (s) and weights (s). Because GL-D network is a pattern, we only survey singular GL-gate. Mentioned GL-gate is singular by default and it is dependent on how to define functions hi and gi. The arrangement sum with regard to GL-gate is

$$\mathbf{s}(\boldsymbol{\Omega}) = \sum\_{a} \left( \mathbf{C} + \sum\_{i=1}^{n} h\_i(\mathbf{X}\_i) \mathbf{w}\_i + \sum\_{i=1}^{n} g\_i(\mathbf{X}\_i) d\_i \right)$$

$$\mathbf{s} = \mathbf{2}^n \mathbf{C} + \mathbf{2}^{n-1} \sum\_{i=1}^{n} \left( h\_i(\mathbf{X}\_i = 1) + h\_i(\mathbf{X}\_i = 0) \right) \mathbf{w}\_i + \mathbf{2}^{n-1} \sum\_{i=1}^{n} \left( g\_i(\mathbf{X}\_i = 1) + g\_i(\mathbf{X}\_i = 0) \right) d\_i = \mathbf{0}$$

Suppose hi and gi are probability mass functions with regard to Xi. For all i, we have

$$\begin{aligned} 0 \le h\_i(X\_i) \le 1 \\\\ 0 \le g\_i(X\_i) \le 1 \\\\ h\_i(X\_i = 1) + h\_i(X\_i = 0) = 1 \\\\ g\_i(X\_i = 1) + g\_i(X\_i = 0) = 1 \end{aligned}$$

The arrangement sum becomes

$$s(\Omega) = \mathfrak{2}^n \mathbb{C} + \mathfrak{2}^{n-1} \sum\_{i=1}^n (w\_i + d\_i)^2$$

GL-D network satisfies diagnostic condition if

$$\begin{aligned} s(\Omega) &= \mathfrak{Z}^n \mathbb{C} + \mathfrak{Z}^{n-1} \sum\_{i=1}^n (w\_i + d\_i) = \mathfrak{Z}^{n-1} \\ &\Rightarrow \mathfrak{Z} \mathbb{C} + \sum\_{i=1}^n (w\_i + d\_i) = 1 \end{aligned}$$

Suppose the set of Xis is complete.

$$\sum\_{i=1}^{n} (w\_i + d\_i) = 1$$

This implies C = 0. Shortly, Eq. (36) specifies the singular GL-gate inference so that GL-D network satisfies diagnostic condition.

$$P(\mathbf{X}\_1 \mathbf{x} \mathbf{X}\_2 \mathbf{x} \dots \mathbf{x} \mathbf{X}\_n) = \sum\_{i=1}^n h\_i(\mathbf{X}\_i) w\_i + \sum\_{i=1}^n g\_i(\mathbf{X}\_i) d\_i$$

where hi and ɡ<sup>i</sup> are probability mass functions and the set of XiðsÞ is complete: (36)

$$\sum\_{i=1}^{n} W\_i = 1$$

Functions hi(Xi) and gi(Xi) are always linear due to Xi <sup>m</sup> = Xi for all m ≥ 1 when Xi is binary. It is easy to infer that SIGMA-D network is GL-D network with following definition of functions hi and gi.

$$h\_i(X\_i) = 1 - g\_i(X\_i) = X\_{i\nu} \,\forall i$$

According to Millán and Pérez-de-la-Cruz [4], a hypothesis can have multiple evidences as seen in Figure 7. This is multi-evidence diagnostic relationship opposite to aforementioned multihypothesis diagnostic relationship.

Figure 7 depicts the multi-evidence diagnostic network called M-E-D network in which there are m evidences D1, D2,…, Dm and one hypothesis Y. Note that Y has uniform distribution.

In simplest case where all evidences are binary, the joint probability of M-E-D network is

$$P(Y, D\_1, D\_2, \ldots, D\_m) = P(Y) \prod\_{j=1}^m P(D\_j | Y) = P(Y) P(D\_1, D\_2, \ldots, D\_m | Y)$$

The product <sup>Y</sup><sup>m</sup> <sup>j</sup>¼<sup>1</sup> <sup>P</sup>ðDjjY<sup>Þ</sup> is denoted as likelihood function as follows

$$P(D\_1 \wr D\_2 \wr \dots \wr D\_m | Y) = \prod\_{j=1}^m P(D\_j | Y)$$

The posterior probability P(Y | D1, D2,…, Dm) given uniform distribution of Y is

Converting Graphic Relationships into Conditional Probabilities in Bayesian Network http://dx.doi.org/10.5772/intechopen.70057 135

Figure 7. Diagnostic relationship with multiple evidences (M-E-D network).

Figure 8. M-HE-D network.

<sup>s</sup>ðΩÞ ¼ <sup>2</sup>nC <sup>þ</sup> <sup>2</sup><sup>n</sup>�<sup>1</sup>

Suppose the set of Xis is complete.

network satisfies diagnostic condition.

hypothesis diagnostic relationship.

and gi.

134 Bayesian Inference

The product <sup>Y</sup><sup>m</sup>

) <sup>2</sup><sup>C</sup> <sup>þ</sup>X<sup>n</sup>

Xn i¼1

<sup>P</sup>ðX1xX2x…xXnÞ ¼ <sup>X</sup><sup>n</sup>

Functions hi(Xi) and gi(Xi) are always linear due to Xi

PðY, D1, D2, …, DmÞ ¼ PðYÞ

i¼1

Xn i¼1

ðwi þ diÞ ¼ 1

This implies C = 0. Shortly, Eq. (36) specifies the singular GL-gate inference so that GL-D

i¼1

where hi and ɡ<sup>i</sup> are probability mass functions and the set of XiðsÞ is complete: Xn i¼1

Wi ¼ 1

easy to infer that SIGMA-D network is GL-D network with following definition of functions hi

hiðXiÞ ¼ 1 � ɡiðXiÞ ¼ Xi, ∀i

According to Millán and Pérez-de-la-Cruz [4], a hypothesis can have multiple evidences as seen in Figure 7. This is multi-evidence diagnostic relationship opposite to aforementioned multi-

Figure 7 depicts the multi-evidence diagnostic network called M-E-D network in which there are m evidences D1, D2,…, Dm and one hypothesis Y. Note that Y has uniform distribution. In simplest case where all evidences are binary, the joint probability of M-E-D network is

> Ym j¼1

<sup>j</sup>¼<sup>1</sup> <sup>P</sup>ðDjjY<sup>Þ</sup> is denoted as likelihood function as follows

<sup>P</sup>ðD1, D2, …, DmjYÞ ¼ <sup>Y</sup><sup>m</sup>

The posterior probability P(Y | D1, D2,…, Dm) given uniform distribution of Y is

j¼1

PðDjjYÞ

<sup>ð</sup>wi <sup>þ</sup> diÞ ¼ <sup>2</sup><sup>n</sup>�<sup>1</sup>

ðwi þ diÞ ¼ 1

hiðXiÞwi <sup>þ</sup>X<sup>n</sup>

i¼1

ɡiðXiÞdi

PðDjjYÞ ¼ PðYÞPðD1, D2,…, DmjYÞ

<sup>m</sup> = Xi for all m ≥ 1 when Xi is binary. It is

(36)

$$P(Y|D\_1, D\_2, \ldots, D\_m) = \frac{P(Y, D\_1, D\_2, \ldots, D\_m)}{P(Y = 1, D\_1, D\_2, \ldots, D\_m) + P(Y = 0, D\_1, D\_2, \ldots, D\_m)}$$

$$= \frac{1}{\prod\_{j=1}^m P(D\_j | Y = 1) + \prod\_{j=1}^m P(D\_j | Y = 0)} \* P(D\_1, D\_2, \ldots, D\_m | Y)$$

The possible transformation coefficient is

$$\frac{1}{k} = \prod\_{j=1}^{m} P(D\_j | Y = 1) + \prod\_{j=1}^{m} P(D\_j | Y = 0)$$

M-E-D network will satisfy diagnostic condition if k = 1 because all hypotheses and evidence are binary, which leads that following equation specified by Eq. (37) has 2m real roots P(Dj|Y) for all m ≥ 2.

$$\prod\_{j=1}^{m} P(D\_j | Y = 1) + \prod\_{j=1}^{m} P(D\_j | Y = 0) = 1\tag{37}$$

Eq. (37) has no real root given m = 2 according to following proof. Suppose Eq. (37) has 4 real roots as follows

$$a\_1 = P(D\_1 = 1 | Y = 1)$$

$$a\_2 = P(D\_2 = 1 | Y = 1)$$

$$b\_1 = P(D\_1 = 1 | Y = 0)$$

$$b\_2 = P(D\_2 = 1 | Y = 0)$$

From Eq. (37), it holds

$$\begin{cases} a\_1a\_2 + b\_1b\_2 = 1\\ a\_1(1-a\_2) + b\_1b\_2 = 1\\ (1-a\_1)a\_2 + b\_1b\_2 = 1\\ a\_1a\_2 + b\_1(1-b\_2) = 1\\ a\_1a\_2 + (1-b\_1)b\_2 = 1 \end{cases} \Rightarrow \begin{cases} a\_1 = a\_2\\ b\_1 = b\_2\\ a\_1^2 + b\_1^2 = 1\\ a\_1 + 2b\_1^2 = 2\\ b\_1 + 2b\_1^2 = 1\\ b\_1 = 2 \end{cases} \Leftrightarrow \begin{cases} a\_1 = a\_2 = 0\\ b\_1 = b\_2\\ a\_1^2 + b\_1^2 = 1\\ a\_1^2 + b\_1^2 = 1\\ b\_1 = 1.5 \end{cases}$$

The final equation leads a contradiction (b<sup>1</sup> = 2 or b<sup>1</sup> = 1.5) and so it is impossible to apply the sufficient diagnostic proposition into M-E-D network. Such proposition is only used for oneevidence network. Moreover, X-gate inference absorbs many sources and then produces out of one targeted result whereas the M-E-D network essentially splits one source into many results. It is impossible to model M-E-D network by X-gates. The potential solution for this problem is to group many evidences D1, D2,…, Dm into one representative evidence D which in turn is dependent on hypothesis Y but this solution will be inaccurate in specifying conditional probabilities because directions of dependencies become inconsistent (relationships from Dj to D and from Y to D) except that all Djs are removed and D becomes a vector. However, evidence vector does not simplify the hazardous problem and it changes the current problem into a new problem.

Another solution is to reverse the direction of relationship, in which the hypothesis is dependent on evidences so as to take advantages of X-gate inference as usual. However, the reversion method violates the viewpoint in this research where diagnostic relationship must be from hypothesis to evidence. In other words, we should change the viewpoint.

Another solution is based on a so-called partial diagnostic condition that is a loose case of diagnostic condition for M-E-D network, which is defined as follows

$$P(Y|D\_{\dagger}) = kP(D\_{\dagger}|Y).$$

where k is constant with regard to Dj. The joint probability is

$$P(Y\_\prime D\_1, D\_2, \ldots, D\_m) = P(Y) \prod\_{j=1}^m P(D\_j | Y)$$

M-E-D network satisfies partial diagnostic condition. In fact, given all variables are binary, we have

Converting Graphic Relationships into Conditional Probabilities in Bayesian Network http://dx.doi.org/10.5772/intechopen.70057 137

$$P(Y|D\_{\dot{\jmath}}) = \frac{\sum\_{\Psi \backslash \{\underline{Y}, D\_{\dot{\jmath}}\}} P(Y, D\_1, D\_2, \dots, D\_m)}{\sum\_{\Psi \backslash \{D\_{\dot{\jmath}}\}} P(Y, D\_1, D\_2, \dots, D\_m)}$$

(Let Ψ = {D1, D2,…, Dm})

a<sup>1</sup> ¼ PðD<sup>1</sup> ¼ 1jY ¼ 1Þ

a<sup>2</sup> ¼ PðD<sup>2</sup> ¼ 1jY ¼ 1Þ

b<sup>1</sup> ¼ PðD<sup>1</sup> ¼ 1jY ¼ 0Þ

b<sup>2</sup> ¼ PðD<sup>2</sup> ¼ 1jY ¼ 0Þ

<sup>1</sup> ¼ 2

⇔

8 >>>>><

>>>>>:

a<sup>1</sup> ¼ a<sup>2</sup> ¼ 0

or

8 >>>>><

>>>>>:

a<sup>1</sup> ¼ a<sup>2</sup> ¼ 0:5

b<sup>1</sup> ¼ b<sup>2</sup> a2 <sup>1</sup> <sup>þ</sup> <sup>b</sup><sup>2</sup> <sup>1</sup> ¼ 1

b<sup>1</sup> ¼ 1:5

b<sup>1</sup> ¼ b<sup>2</sup> a2 <sup>1</sup> <sup>þ</sup> <sup>b</sup><sup>2</sup> <sup>1</sup> ¼ 1

b<sup>1</sup> ¼ 2

<sup>1</sup> ¼ 2

The final equation leads a contradiction (b<sup>1</sup> = 2 or b<sup>1</sup> = 1.5) and so it is impossible to apply the sufficient diagnostic proposition into M-E-D network. Such proposition is only used for oneevidence network. Moreover, X-gate inference absorbs many sources and then produces out of one targeted result whereas the M-E-D network essentially splits one source into many results. It is impossible to model M-E-D network by X-gates. The potential solution for this problem is to group many evidences D1, D2,…, Dm into one representative evidence D which in turn is dependent on hypothesis Y but this solution will be inaccurate in specifying conditional probabilities because directions of dependencies become inconsistent (relationships from Dj to D and from Y to D) except that all Djs are removed and D becomes a vector. However, evidence vector does not simplify the hazardous problem and it changes the current problem

Another solution is to reverse the direction of relationship, in which the hypothesis is dependent on evidences so as to take advantages of X-gate inference as usual. However, the reversion method violates the viewpoint in this research where diagnostic relationship must be

Another solution is based on a so-called partial diagnostic condition that is a loose case of

PðYjDjÞ ¼ kPðDjjYÞ

M-E-D network satisfies partial diagnostic condition. In fact, given all variables are binary,

Ym j¼1

PðDjjYÞ

from hypothesis to evidence. In other words, we should change the viewpoint.

PðY, D1, D2, …, DmÞ ¼ PðYÞ

diagnostic condition for M-E-D network, which is defined as follows

where k is constant with regard to Dj. The joint probability is

a<sup>1</sup> ¼ a<sup>2</sup> b<sup>1</sup> ¼ b<sup>2</sup> a2 <sup>1</sup> <sup>þ</sup> <sup>b</sup><sup>2</sup> <sup>1</sup> ¼ 1

<sup>a</sup><sup>1</sup> <sup>þ</sup> <sup>2</sup>b<sup>2</sup>

<sup>b</sup><sup>1</sup> <sup>þ</sup> <sup>2</sup>a<sup>2</sup>

From Eq. (37), it holds

8

136 Bayesian Inference

>>>>>>>>><

>>>>>>>>>:

into a new problem.

we have

a1a<sup>2</sup> þ b1b<sup>2</sup> ¼ 1

a1ð1 � a2Þ þ b1b<sup>2</sup> ¼ 1 ð1 � a1Þa<sup>2</sup> þ b1b<sup>2</sup> ¼ 1 a1a<sup>2</sup> þ b1ð1 � b2Þ ¼ 1 a1a<sup>2</sup> þ ð1 � b1Þb<sup>2</sup> ¼ 1

)

8

>>>>>>>>><

>>>>>>>>>:

$$=\frac{P(D\_{\rangle}|Y)\prod\_{k=1,k\neq j}^{m}\left(\sum\_{D\_{k}}P(D\_{k}|Y)\right)}{\prod\_{k=1,k\neq j}^{m}\left(\sum\_{D\_{k}}P(D\_{k}|Y=1)\right)+\prod\_{k=1,k\neq j}^{m}\left(\sum\_{D\_{k}}P(D\_{k}|Y=0)\right)}$$

(Due to uniform distribution of Y)

$$=\frac{P(D\_{\rangle}|Y)\prod\_{k=1, k\neq j}^{m}1}{\prod\_{k=1, k\neq j}^{m}1 + \prod\_{k=1, k\neq j}^{m}1} = \frac{1}{2}P(D\_{\rangle}|Y)$$

$$P\left(\text{Due to }\sum\_{D\_{k}}P(D\_{k}|Y) = P(D\_{k}=0|Y) + P(D\_{k}=1|Y) = 1\right)$$

Partial diagnostic condition expresses a different viewpoint. It is not an optimal solution because we cannot test a disease based on only one symptom while ignoring other obvious symptoms, for example. The equality P(Y|Dj) = 0.5P(Dj|Y) indicates the accuracy is decreased two times. However, Bayesian network provides inference mechanism based on personal belief. It is subjective. You can use partial diagnostic condition if you think that such condition is appropriate to your application.

If we are successful in specifying conditional probabilities of M-E-D network, it is possible to define an extended network which is constituted of n hypotheses X1, X2,…, Xn and m evidences D1, D2,…, Dm. Such extended network represents multi-hypothesis multi-evidence diagnostic relationship, called M-HE-D network. Figure 8 depicts M-HE-D network.

The M-HE-D network is the most general case of diagnostic network, which was mentioned in Ref. ([4], p. 297). We can construct any large diagnostic BN from M-HE-D networks and so the research is still open.

### 5. Conclusion

In short, relationship conversion is to determine conditional probabilities based on logic gates that are adhered to semantics of relationships. The weak point of logic gates is to require that all variables must be binary. For example, in learning context, it is inconvenient for expert to create an assessment BN with studying exercises (evidences) whose marks are only 0 and 1. In order to lessen the impact of such weak point, the numeric evidence is used for extending capacity of simple Bayesian network. However, combination of binary hypothesis and numeric evidence leads to errors or biases in inference. For example, given a student gets maximum grade for an exercise but the built-in inference results out that she/he has not mastered fully the associated learning concept (hypothesis). Therefore, I propose the sufficient diagnostic proposition so as to confirm that numeric evidence is adequate to make complicated inference tasks in BN. The probabilistic reasoning based on evidence is always accurate. Application of the research can go beyond learning context whenever probabilistic deduction relevant to constraints of semantic relationships is required. A large BN can be constituted of many simple BN (s). Inference in large BN is hazardous problem and there are many optimal algorithms for solving such problem. In future, I will research effective inference methods for the special BN that is constituted of X-gate BN (s) mentioned in this research because X-gate BN (s) have precise and useful features of which we should take advantages. For instance, their CPT (s) are simple in some cases and the meanings of their relationships are mandatory in many applications. Moreover, I try my best to research deeply M-E-D network and M-HE-D network whose problems I cannot solve absolutely now.

Two main documents that I referred to do this research are the book "Learning Bayesian Networks" [2] by the author Richard E. Neapolitan and the article "A Bayesian Diagnostic Algorithm for Student Modeling and its Evaluation" [4] by authors Eva Millán and José Luis Pérez-de-la-Cruz. Especially, the SIGMA-gate inference is based on and derived from the work of the Eva Millán and José Luis Pérez-de-la-Cruz. This research is originated from my PhD research "A User Modeling System for Adaptive Learning" [12]. Other references relevant to user modeling, overlay model, and Bayesian network are [13–16]. Please concern these references.

### Appendices

A1. Following is the proof of Eq. (9)

$$P(A\_i = \text{ON} | \mathbf{X}\_i)$$

$$= P(A\_i = \text{ON} | \mathbf{X}\_i, I\_i = \text{ON})P(I\_i = \text{ON}) + P(A\_i = \text{ON} | \mathbf{X}\_i, I\_i = \text{OFF})P(I\_i = \text{OFF})$$

$$= 0 \ast (1 - p\_i) + P(A\_i = \text{ON} | \mathbf{X}\_i, I\_i = \text{OFF})p\_i$$

$$\text{(By applying Eq. (8))}$$

$$= p\_i P(A\_i = \text{ON} | \mathbf{X}\_i, I\_i = \text{OFF})$$

It implies

$$P(A\_i = \text{ON} | \mathbf{X}\_i = 1) = p\_i \\ P(A\_i = \text{ON} | \mathbf{X}\_i = 1, I\_i = \text{OFF}) = p\_i$$

$$P(A\_i = \text{ON} | \mathbf{X}\_i = 0) = p\_i \\ P(A\_i = \text{ON} | \mathbf{X}\_i = 0, I\_i = \text{OFF}) = 0$$

$$P(A\_i = \text{OFF} | \mathbf{X}\_i = 1) = 1 - P(A\_i = \text{ON} | \mathbf{X}\_i = 1) = 1 - p\_i$$

$$P(A\_i = \text{OFF} | \mathbf{X}\_i = 0) = 1 - P(A\_i = \text{ON} | \mathbf{X}\_i = 0) = 1 - 0$$

A2. Following is the proof of Eq. (10)

numeric evidence leads to errors or biases in inference. For example, given a student gets maximum grade for an exercise but the built-in inference results out that she/he has not mastered fully the associated learning concept (hypothesis). Therefore, I propose the sufficient diagnostic proposition so as to confirm that numeric evidence is adequate to make complicated inference tasks in BN. The probabilistic reasoning based on evidence is always accurate. Application of the research can go beyond learning context whenever probabilistic deduction relevant to constraints of semantic relationships is required. A large BN can be constituted of many simple BN (s). Inference in large BN is hazardous problem and there are many optimal algorithms for solving such problem. In future, I will research effective inference methods for the special BN that is constituted of X-gate BN (s) mentioned in this research because X-gate BN (s) have precise and useful features of which we should take advantages. For instance, their CPT (s) are simple in some cases and the meanings of their relationships are mandatory in many applications. Moreover, I try my best to research deeply M-E-D network and M-HE-D

Two main documents that I referred to do this research are the book "Learning Bayesian Networks" [2] by the author Richard E. Neapolitan and the article "A Bayesian Diagnostic Algorithm for Student Modeling and its Evaluation" [4] by authors Eva Millán and José Luis Pérez-de-la-Cruz. Especially, the SIGMA-gate inference is based on and derived from the work of the Eva Millán and José Luis Pérez-de-la-Cruz. This research is originated from my PhD research "A User Modeling System for Adaptive Learning" [12]. Other references relevant to user modeling, overlay model, and Bayesian network are [13–16]. Please concern these

PðAi ¼ ONjXiÞ ¼ PðAi ¼ ONjXi, Ii ¼ ONÞPðIi ¼ ONÞ þ PðAi ¼ ONjXi, Ii ¼ OFFÞPðIi ¼ OFFÞ

ðBy applying Eq: ð8ÞÞ

PðAi ¼ OFFjXi ¼ 1Þ ¼ 1 � PðAi ¼ ONjXi ¼ 1Þ ¼ 1 � pi

PðAi ¼ OFFjXi ¼ 0Þ ¼ 1 � PðAi ¼ ONjXi ¼ 0Þ ¼ 1 ■

PðAi ¼ ONjXi, Ii ¼ OFFÞ

Þ þ PðAi ¼ ONjXi, Ii ¼ OFFÞpi

PðAi ¼ ONjXi ¼ 1, Ii ¼ OFFÞ ¼ pi

PðAi ¼ ONjXi ¼ 0, Ii ¼ OFFÞ ¼ 0

network whose problems I cannot solve absolutely now.

¼ 0 � ð1 � pi

PðAi ¼ ONjXi ¼ 1Þ ¼ pi

PðAi ¼ ONjXi ¼ 0Þ ¼ pi

¼ pi

references.

138 Bayesian Inference

It implies

Appendices

A1. Following is the proof of Eq. (9)

$$P(Y|X\_{1\prime}, X\_{2\prime}, \dots, X\_n) = \frac{P(Y, X\_{1\prime}, X\_{2\prime}, \dots, X\_n)}{P(X\_{1\prime}, X\_{2\prime}, \dots, X\_n)}$$
 
$$(\text{Due to Bayes' rule})$$

$$=\frac{\sum\_{A\_1, A\_2, \dots, A\_n} P(Y, X\_1, X\_2, \dots, X\_n | A\_1, A\_2, \dots, A\_n) \ast P(A\_1, A\_2, \dots, A\_n)}{P(X\_1, X\_2, \dots, X\_n)}$$

ðDue to total probability ruleÞ

$$=\sum\_{A\_{1\prime},A\_{2\prime},\ldots,A\_{n}}P(Y,X\_{1\prime},X\_{2\prime},\ldots,X\_{n}|A\_{1\prime},A\_{2\prime},\ldots,A\_{n}) \ast \frac{P(A\_{1\prime},A\_{2\prime},\ldots,A\_{n})}{P(X\_{1\prime},X\_{2\prime},\ldots,X\_{n})}$$

$$=\sum\_{A\_{1\prime},A\_{2\prime},\ldots,A\_{n}}P(Y|A\_{1\prime},A\_{2\prime},\ldots,A\_{n}) \ast P(X\_{1\prime},X\_{2\prime},\ldots,X\_{n}|A\_{1\prime},A\_{2\prime},\ldots,A\_{n}) \ast \frac{P(A\_{1\prime},A\_{2\prime},\ldots,A\_{n})}{P(X\_{1\prime},X\_{2\prime},\ldots,X\_{n})}$$

(Because Y is conditionally independent from Xis given Ais)

$$=\sum\_{A\_{1\prime}A\_{2\prime},\ldots,A\_{n}}P(Y|A\_{1\prime},A\_{2\prime},\ldots,A\_{n}) \ast \frac{P(X\_{1\prime},X\_{2\prime},\ldots,X\_{n\prime},A\_{1\prime},A\_{2\prime},\ldots,A\_{n})}{P(X\_{1\prime},X\_{2\prime},\ldots,X\_{n})}$$

$$=\sum\_{A\_{1\prime}A\_{2\prime},\ldots,A\_{n}}P(Y|A\_{1\prime},A\_{2\prime},\ldots,A\_{n}) \ast P(A\_{1\prime},A\_{2\prime},\ldots,A\_{n}|X\_{1\prime},X\_{2\prime},\ldots,X\_{n})$$

$$\qquad\qquad\qquad(\text{Due to Bayes' rule})$$

$$=\sum\_{A\_{1\prime}A\_{2\prime},\ldots,A\_{n}}P(Y|A\_{1\prime},A\_{2\prime},\ldots,A\_{n})\prod\_{i=1}^{n}P(A\_{i}|X\_{1\prime},X\_{2\prime},\ldots,X\_{n})$$

(Because Ais are mutually independent)

$$=\sum\_{A\_1, A\_2, \dots, A\_n} P(Y|A\_1, A\_2, \dots, A\_n) \prod\_{i=1}^n P(A\_i|X\_i)$$

(Because each Ai is only dependent on Xi) ■

A3. Following is the proof that the augmented X-D network (shown in Figure 5) is equivalent to X-D network (shown in shown in Figures 2 and 3) with regard to variables X1, X2,…, Xn, and D.

The joint probability of augmented X-D network shown in Figure 5 is

$$P(\mathbf{X}\_1, \mathbf{X}\_2, \dots, \mathbf{X}\_n, \mathbf{Y}, \mathbf{D}) = P(\mathbf{D}|\mathbf{Y})P(\mathbf{Y}|\mathbf{X}\_1, \mathbf{X}\_2, \dots, \mathbf{X}\_n) \prod\_{i=1}^n P(\mathbf{X}\_i)$$

The joint probability of X-D network is

$$P(\mathbf{X}\_1, \mathbf{X}\_2, \dots, \mathbf{X}\_n, \mathbf{D}) = P(\mathbf{D} | \mathbf{X}\_1, \mathbf{X}\_2, \dots, \mathbf{X}\_n) \prod\_{i=1}^n P(\mathbf{X}\_i)$$

By applying total probability rule into X-D network, we have

$$P(\mathbf{X}\_1, \mathbf{X}\_2, \dots, \mathbf{X}\_n, \mathbf{D}) = \frac{P(\mathbf{D}, \mathbf{X}\_1, \mathbf{X}\_2, \dots, \mathbf{X}\_n)}{P(\mathbf{X}\_1, \mathbf{X}\_2, \dots, \mathbf{X}\_n)} \prod\_{i=1}^n P(\mathbf{X}\_i)$$

ðDue to Bayes' ruleÞ

$$=\frac{\sum\_{\boldsymbol{Y}}P(D\_{\boldsymbol{Y}}\boldsymbol{X}\_{1\prime}\boldsymbol{X}\_{2\prime}\ldots\boldsymbol{X}\_{n}|\boldsymbol{Y})P(\boldsymbol{Y})}{P(\boldsymbol{X}\_{1\prime}\boldsymbol{X}\_{2\prime}\ldots\boldsymbol{X}\_{n})}\prod\_{i=1}^{n}P(\boldsymbol{X}\_{i})$$

ðDue to total probability ruleÞ

$$=\frac{\sum\_{Y}P(D,X\_{1},X\_{2},\ldots,X\_{n}|Y)P(Y)}{P(X\_{1},X\_{2},\ldots,X\_{n})}\prod\_{i=1}^{n}P(X\_{i})$$

$$=\left(\sum\_{Y}P(D,X\_{1},X\_{2},\ldots,X\_{n}|Y)\*\frac{P(Y)}{P(X\_{1},X\_{2},\ldots,X\_{n})}\right)\*\prod\_{i=1}^{n}P(X\_{i})$$

$$=\left(\sum\_{Y}P(D|Y)\*\frac{P(X\_{1},X\_{2},\ldots,X\_{n}|Y)P(Y)}{P(X\_{1},X\_{2},\ldots,X\_{n})}\right)\*\prod\_{i=1}^{n}P(X\_{i})$$

(Because D is conditionally independent from all Xi (s) given Y)

$$=\left(\sum\_{\mathbf{Y}}P(D|Y)\*\frac{P(Y,X\_1,X\_2,\ldots,X\_n)}{P(X\_1,X\_2,\ldots,X\_n)}\right)\*\prod\_{i=1}^nP(X\_i)$$

$$=\sum\_{\mathbf{Y}}P(D|Y)P(Y|X\_1,X\_2,\ldots,X\_n)\prod\_{i=1}^nP(X\_i)$$

$$\qquad\qquad\qquad(\text{Due to Bayes' rule})$$

$$=\sum\_{\mathbf{Y}}P(X\_1,X\_2,\ldots,X\_n,Y,D)\blacksquare$$

A4. Following is the proof of Eq. (29)

Given uniform distribution of Xi (s), we have

$$P(X\_1) = P(X\_2) = \dots = P(X\_n) = \frac{1}{2}$$

The joint probability becomes

$$P(\Omega, Y, D) = \frac{1}{2^n} P(Y|X\_1, X\_2, \dots, X\_n) P(D|Y)$$

The joint probability of Xi and D is

PðX1, X2, …, Xn, DÞ ¼ PðDjX1, X2, …, XnÞ

PðX1, X2, …, XnÞ

<sup>Y</sup>PðD, X1, X2, …, XnjYÞPðYÞ PðX1, X2, …, XnÞ

<sup>P</sup>ðD, X1, X2, …, XnjYÞ � <sup>P</sup>ðY<sup>Þ</sup>

!

!

<sup>P</sup>ðDjYÞ � <sup>P</sup>ðX1, X2, …, XnjYÞPðY<sup>Þ</sup>

<sup>P</sup>ðDjYÞ � <sup>P</sup>ðY, X1, X2, …, Xn<sup>Þ</sup>

PðDjYÞPðYjX1, X2, …, XnÞ

ðDue to Bayes' ruleÞ

<sup>P</sup>ðX1Þ ¼ <sup>P</sup>ðX2Þ ¼ <sup>⋯</sup> <sup>¼</sup> <sup>P</sup>ðXnÞ ¼ <sup>1</sup>

<sup>2</sup><sup>n</sup> <sup>P</sup>ðYjX1, X2, …, XnÞPðDjY<sup>Þ</sup>

PðX1, X2, …, Xn, Y, DÞ ■

!

PðX1, X2, …, XnÞ

PðX1, X2, …, XnÞ

Yn i¼1

PðXiÞ

PðX1, X2,…, XnÞ

By applying total probability rule into X-D network, we have

ðDue to Bayes' ruleÞ

¼

¼

140 Bayesian Inference

X

<sup>¼</sup> <sup>X</sup> Y

A4. Following is the proof of Eq. (29)

The joint probability becomes

Given uniform distribution of Xi (s), we have

<sup>¼</sup> <sup>X</sup> Y

> <sup>¼</sup> <sup>X</sup> Y

<sup>P</sup>ðX1, X2, …, Xn, DÞ ¼ <sup>P</sup>ðD, X1, X2, …, Xn<sup>Þ</sup>

<sup>Y</sup>PðD, X1, X2, …, XnjYÞPðYÞ PðX1, X2,…, XnÞ

ðDue to total probability ruleÞ

(Because D is conditionally independent from all Xi (s) given Y)

<sup>¼</sup> <sup>X</sup> Y

> <sup>¼</sup> <sup>X</sup> Y

<sup>P</sup>ðΩ, Y, DÞ ¼ <sup>1</sup>

X

Yn i¼1

Yn i¼1

> Yn i¼1

PðXiÞ

� Yn i¼1

� Yn i¼1

PðXiÞ

2

Yn i¼1

� Yn i¼1

PðXiÞ

PðXiÞ

PðXiÞ

PðXiÞ

PðXiÞ

<sup>P</sup>ðXi, DÞ ¼ <sup>X</sup> <sup>f</sup><sup>Ω</sup>, Y, <sup>D</sup>g\fXi, <sup>D</sup><sup>g</sup> PðΩ,Y,DÞ ¼ PðX<sup>1</sup> ¼ 1, X<sup>2</sup> ¼ 1, …, Xi, …, Xn�<sup>1</sup> ¼ 1, Xn ¼ 1, Y ¼ 1, DÞ þ PðX<sup>1</sup> ¼ 1, X<sup>2</sup> ¼ 1, …, Xi, …, Xn�<sup>1</sup> ¼ 1, Xn ¼ 0, Y ¼ 1, DÞ þ ⋯ þ PðX<sup>1</sup> ¼ 0, X<sup>2</sup> ¼ 0, …, Xi, …, Xn�<sup>1</sup> ¼ 0, Xn ¼ 1, Y ¼ 1, DÞ þ PðX<sup>1</sup> ¼ 0, X<sup>2</sup> ¼ 0, …, Xi, …, Xn�<sup>1</sup> ¼ 0, Xn ¼ 0, Y ¼ 1, DÞ þ PðX<sup>1</sup> ¼ 1, X<sup>2</sup> ¼ 1, …, Xi, …, Xn�<sup>1</sup> ¼ 1, Xn ¼ 1, Y ¼ 0, DÞ þ PðX<sup>1</sup> ¼ 1, X<sup>2</sup> ¼ 1, …, Xi, …, Xn�<sup>1</sup> ¼ 1, Xn ¼ 0, Y ¼ 0, DÞ þ ⋯ þ PðX<sup>1</sup> ¼ 0, X<sup>2</sup> ¼ 0, …, Xi, …, Xn�<sup>1</sup> ¼ 0, Xn ¼ 1, Y ¼ 0, DÞ þ PðX<sup>1</sup> ¼ 0, X<sup>2</sup> ¼ 0, …, Xi, …, Xn�<sup>1</sup> ¼ 0, Xn ¼ 0, Y ¼ 0, DÞ ¼ 1 2n D S � PðY ¼ 1jX<sup>1</sup> ¼ 1, X<sup>2</sup> ¼ 1,…, Xi, …, Xn�<sup>1</sup> ¼ 1, Xn ¼ 1Þ þ PðY ¼ 1jX<sup>1</sup> ¼ 1, X<sup>2</sup> ¼ 1, …, Xi, …, Xn�<sup>1</sup> ¼ 1, Xn ¼ 0Þ þ ⋯ þ PðY ¼ 1jX<sup>1</sup> ¼ 1, X<sup>2</sup> ¼ 1, …, Xi, …, Xn�<sup>1</sup> ¼ 0, Xn ¼ 1Þ þ PðY ¼ 1jX<sup>1</sup> ¼ 1, X<sup>2</sup> ¼ 1, …, Xi, …, Xn�<sup>1</sup> ¼ 0, Xn ¼ 0Þ � þ 1 2n M � D S � PðY ¼ 0jX<sup>1</sup> ¼ 1, X<sup>2</sup> ¼ 1, …, Xi, …, Xn�<sup>1</sup> ¼ 1, Xn ¼ 1Þ þ PðY ¼ 0jX<sup>1</sup> ¼ 1, X<sup>2</sup> ¼ 1, …, Xi, …, Xn�<sup>1</sup> ¼ 1, Xn ¼ 0Þ þ ⋯ þ PðY ¼ 0jX<sup>1</sup> ¼ 1, X<sup>2</sup> ¼ 1,…, Xi, …, Xn�<sup>1</sup> ¼ 0, Xn ¼ 1Þ þ PðY ¼ 0jX<sup>1</sup> ¼ 1, X<sup>2</sup> ¼ 1, …, Xi, …, Xn�<sup>1</sup> ¼ 0, Xn ¼ 0Þ �

(Due to Eq. (6))

The marginal probability of D is

$$\begin{aligned} P(D) &= \sum\_{\{\Omega\_1, \emptyset\_1, \emptyset\_1\}} P(\Omega\_1, \emptyset, D) \\ &= P(X\_1 = 1, X\_2 = 1, \dots, X\_n = 1, Y = 1, D) + P(X\_1 = 1, X\_2 = 1, \dots, X\_n = 0, Y = 1, D) + \cdots \\ &+ P(X\_1 = 0, X\_2 = 0, \dots, X\_n = 1, Y = 1, D) + P(X\_1 = 0, X\_2 = 0, \dots, X\_n = 0, Y = 1, D) \\ &+ P(X\_1 = 1, X\_2 = 1, \dots, X\_n = 1, Y = 0, D) \\ &= \frac{1}{2^n} \sum\_S \Big( P(Y = 1 | X\_1 = 1, X\_2 = 1, \dots, X\_n = 1) + P(Y = 1 | X\_1 = 1, X\_2 = 1, \dots, X\_n = 0) + \cdots \\ &+ P(Y = 1 | X\_1 = 1, X\_2 = 1, \dots, X\_n = 1) + P(Y = 1 | X\_1 = 1, X\_2 = 1, \dots, X\_n = 0) \Big) \\ &+ \frac{1}{2^n} \sum\_S \Big( P(Y = 0 | X\_1 = 1, X\_2 = 1, \dots, X\_n = 1) + P(Y = 0 | X\_1 = 1, X\_2 = 1, \dots, X\_n = 0) + \cdots \\ &+ P(Y = 0 | X\_1 = 1, X\_2 = 1, \dots, X\_n = 1) + P(Y = 0 | X\_1 = 1, X\_2 = 1, \dots, X\_n = 0) \Big) \\ &+ P(X\_1 = 1, X\_2 = 1, \dots, X\_n = 0, Y = 0, D) + \cdots \end{aligned}$$

By applying Table 2, the joint probability P(Xi, D) is determined as follows

$$\begin{split} P(\boldsymbol{X}\_{i\prime},\boldsymbol{D}) &= \frac{1}{2^{n}\mathcal{S}} \Bigg( \sum\_{a} P\Big(\boldsymbol{Y} = 1 | a(\boldsymbol{\Omega} : \{\boldsymbol{X}\_{i}\})\big) + (\boldsymbol{M} - \boldsymbol{D}) \sum\_{a} P\Big(\boldsymbol{Y} = 0 | a(\boldsymbol{\Omega} : \{\boldsymbol{X}\_{i}\})\big) \Bigg) \\ &= \frac{1}{2^{n}\mathcal{S}} \Bigg( \sum\_{a} P\Big(\boldsymbol{Y} = 1 | a(\boldsymbol{\Omega} : \{\boldsymbol{X}\_{i}\})\big) + (\boldsymbol{M} - \boldsymbol{D}) \sum\_{a} \Big( 1 - P\Big(\boldsymbol{Y} = 1 | a(\boldsymbol{\Omega} : \{\boldsymbol{X}\_{i}\})\big) \Bigg) \Bigg) \\ &= \frac{1}{2^{n}\mathcal{S}} \Big( (\boldsymbol{2}\boldsymbol{D} - \boldsymbol{M})\boldsymbol{s}(\boldsymbol{\Omega} : \{\boldsymbol{X}\_{i}\}) + 2^{n-1}(\boldsymbol{M} - \boldsymbol{D}) \Bigg) \end{split}$$

Similarly, the marginal probability P(D) is

$$P(D) = \frac{1}{2^n S} \left( (2D - M)s(\Omega) + 2^n (M - D) \right) \blacksquare$$

### Author details

#### Loc Nguyen

Address all correspondence to: ng\_phloc@yahoo.com

Sunflower Soft Company, An Giang, Vietnam

### References


[9] Nguyen L. Theorem of SIGMA-gate inference in Bayesian network. Wulfenia Journal. 2016;23(3):280–289

<sup>P</sup>ðXi, DÞ ¼ <sup>1</sup>

<sup>¼</sup> <sup>1</sup> <sup>2</sup>nS <sup>D</sup>

142 Bayesian Inference

<sup>¼</sup> <sup>1</sup> 2nS �

Author details

Loc Nguyen

References

Hall; 2003. p. 674

1986;29(3):241–288

<sup>2</sup>nS <sup>D</sup>

Similarly, the marginal probability P(D) is

X a P � X a P �

Y ¼ 1jaðΩ : fXigÞ

<sup>ð</sup>2<sup>D</sup> � <sup>M</sup>Þsð<sup>Ω</sup> : <sup>f</sup>XigÞ þ <sup>2</sup><sup>n</sup>�<sup>1</sup>

<sup>P</sup>ðDÞ ¼ <sup>1</sup>

Address all correspondence to: ng\_phloc@yahoo.com

Sunflower Soft Company, An Giang, Vietnam

ligent Decision-Support Systems; 2007

2nS �

Y ¼ 1jaðΩ : fXigÞ

�

�

�� !

�

<sup>ð</sup>2<sup>D</sup> � <sup>M</sup>ÞsðΩÞ þ <sup>2</sup><sup>n</sup>ð<sup>M</sup> � <sup>D</sup><sup>Þ</sup>

þ ðM � DÞ

ðM � DÞ

[1] Wikipedia. Logic gate. Wikimedia Foundation [Internet]. 2016. [Online]. Available from:

[2] Neapolitan RE. Learning Bayesian Networks. Upper Saddle River, New Jersey: Prentice

[3] Díez FJ, Druzdzel MJ. Canonical Probabilistic Models. Madrid: Research Centre on Intel-

[4] Millán E, Pérez-de-la-Cruz JL. A bayesian diagnostic algorithm for student modeling and its evaluation. User Modeling and User-Adapted Interaction. 2002;12(2-3):281–330

[5] Wikipedia. Factor graph. Wikimedia Foundation [Internet]. 2015. [Online]. Available from: https://en.wikipedia.org/wiki/Factor\_graph [Accessed: February 8, 2017]

[6] Kschischang FR, Frey BJ, Loeliger HA. Factor graphs and the sum-product algorithm.

[7] Pearl J. Fusion, propagation, and structuring in belief networks. Artificial Intelligence.

[8] Millán E, Loboda T, Pérez-de-la-Cruz JL. Bayesian networks for student model engineering.

https://en.wikipedia.org/wiki/Logic\_gate [Accessed June 4, 2016]

IEEE Transactions on Information Theory. 2001;47(2):498–519

Computers & Education. 2010:55(4):1663–1683

þ ðM � DÞ

X a � 1 � P �

� !

X a P �

> � ■

Y ¼ 0jaðΩ : fXigÞ

Y ¼ 1jaðΩ : fXigÞ


**Applications of Bayesian Inference in Life Sciences**

Provisional chapter

### **Bayesian Estimation of Multivariate Autoregressive Hidden Markov Model with Application to Breast Cancer Biomarker Modeling** Bayesian Estimation of Multivariate Autoregressive Hidden Markov Model with Application to Breast

DOI: 10.5772/intechopen.70053

Hamid El Maroufy, El Houcine Hibbah, Abdelmajid Zyad and Taib Ziad Hamid El Maroufy, El Houcine Hibbah,

Cancer Biomarker Modeling

Additional information is available at the end of the chapter Abdelmajid Zyad and Taib Ziad

http://dx.doi.org/10.5772/intechopen.70053 Additional information is available at the end of the chapter

### Abstract

In this work, a first-order autoregressive hidden Markov model (AR(1)HMM) is proposed. It is one of the suitable models to characterize a marker of breast cancer disease progression essentially the progression that follows from a reaction to a treatment or caused by natural developments. The model supposes we have observations that increase or decrease with relation to a hidden phenomenon. We would like to discover if the information about those observations can let us learn about the progression of the phenomenon and permit us to evaluate the transition between its states (supposed discrete here). The hidden states governed by the Markovian process would be the disease stages, and the marker observations would be the depending observations. The parameters of the autoregressive model are selected at the first level according to a Markov process, and at the second level, the next observation is generated from a standard autoregressive model of first order (unlike other models considering the successive observations are independents). A Markov Chain Monte Carlo (MCMC) method is used for the parameter estimation, where we develop the posterior density for each parameter and we use a joint estimation of the hidden states or block update of the states.

Keywords: autoregressive hidden Markov model, breast cancer progression marker, Gibbs sampler, hidden states joint estimation, Markov Chain Monte Carlo

### 1. Introduction

The main motivation behind this work is to characterize progression in breast cancer. In fact, disease progression cannot be assessed correctly without the use of biomarkers, which would

© 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and eproduction in any medium, provided the original work is properly cited.

effectively monitor the evolution of the patient health state, and this is the case for breast cancer. The major challenge in this matter for researchers and clinicians is to unravel the stage of the disease, so as to tailor the treatment for each patient and to monitor the response of a patient to a treatment.

Currently, studies have shown that there is a correlation between the levels of certain markers such as cancer antigen CA15-3, carcinoembryonic antigen (CEA), and serum HER2 Neu with the stage of the disease [1]. This gives an opportunity of using a hidden Markov model (HMM) to predict the stage of the disease based on biomarker data and to address the effectiveness of the treatments in their influence on the transition of the cancer from one state to another. In HMM, we have two constituents: the Markovian hidden process suitable to represent the breast cancer stage and the observation process given by the biomarker data. By the way, we can learn about the disease transition rates and how it progresses from primary breast cancer to advanced cancer stage, for example.

Indeed, HMM is a useful tool for tackling numerous concrete problems in many fields but some possible applications of HMM are in speech processing [2], in biology [3], in disease progression [4], in economics [5, 6], and in gene expression [7]. For a complete review of HMM, the reader is referred to Zucchini and MacDonald [8], in which properties and definitions of HMM are presented in a plausible way with both classical estimation by maximum likelihood method and expectation maximization (EM) algorithm and the new Bayesian inference is addressed.

The model we consider here is a variation of the regular hidden Markov model, since we will use extensions to incorporate dependence among successive observations, suggesting autoregressive dependence among continuous observations. Consequently, we have relaxed the conditional independence assumption from a standard HMM, because we would like to add some dynamics to the patient disease progression and because in reality the current patient biomarker observation is dependent on the past one. In fact, the autoregressive assumption in HMM has shown its advantage over regular HMM that cannot catch the strong dependence between successive observations (e.g., Ref. [9]). A similar model to ours can be found in Ref. [10]. This kind of models, which were first proposed in Ref. [11] to describe econometrics time series, is generalization of both HMM and autoregressive models, will be effective in representing multiple heterogeneous dynamics such as the disease progression dynamics, and can be even generalized to a regime-switching ARMA models such as in Ref. [12].

Moreover, Our model can also be viewed as an extension of the multivariate double-chain Markov model (DCMM) developed by Ref. [13], where there are two discrete Markov chains of first order: the first Markov chain is observed and the second one is hidden. In contrast to this DCMM, our multivariate first-order autoregressive hidden Markov model (MAR(1)HMM) will lead to continuous observations, where each observation conditional on the hidden process will depend on the previous observation according to an autoregressive process of first order. This dynamic is promising for continuous observed disease biomarkers.

Parameter estimation is very challenging for HMM family models since the likelihood is not available in a closed form most of the time. Thus, we call for a Markov Chain Monte Carlo (MCMC) procedure instead of a maximum likelihood-based approach. This choice rises from the fact that the Bayesian analysis uses prior knowledge about the process being measured, and it allows direct probability statements and an approximation of posterior distributions for the parameters. Instead in the maximum likelihood approach, we cannot have declared prior or have exact distribution for the parameters when the likelihood is untractable or when we have missing data (e.g., Refs. [14–16]).

Since the realization of HMM includes two separate entities: the parameters and the hidden states, the Bayesian computation is carried out after augmenting the likelihood by the missing hidden states [17]. The hidden states are sampled using a Gibbs sampler adopting a joint estimation of the hidden states or block update of the states (instead of a single update of each state separately) by means of a forward filtering/backward smoothing algorithm. Given the hidden states, we can compute the autoregressive parameters and the transition probabilities of the Markov chain by Gibbs sampler from their posterior densities after specifying conjugate priors for the parameters. Hence, the MCMC algorithm will alternate between simulating the hidden states and the parameters. Finally, we can obtain posteriors statistics such as the means, standard deviations and confidence intervals after assessing the convergence of the MCMC algorithm.

This chapter is organized as follows: after a preliminary on HMM, a description of the model is given in Section 3. In Section 4, we will give the Bayesian estimation of the parameters and the hidden states and provide the details of the MCMC algorithm, before presenting the results of a simulation studies in Section 5 and we will finish by a conclusion.

### 2. Preliminary

effectively monitor the evolution of the patient health state, and this is the case for breast cancer. The major challenge in this matter for researchers and clinicians is to unravel the stage of the disease, so as to tailor the treatment for each patient and to monitor the response of a patient to a

Currently, studies have shown that there is a correlation between the levels of certain markers such as cancer antigen CA15-3, carcinoembryonic antigen (CEA), and serum HER2 Neu with the stage of the disease [1]. This gives an opportunity of using a hidden Markov model (HMM) to predict the stage of the disease based on biomarker data and to address the effectiveness of the treatments in their influence on the transition of the cancer from one state to another. In HMM, we have two constituents: the Markovian hidden process suitable to represent the breast cancer stage and the observation process given by the biomarker data. By the way, we can learn about the disease transition rates and how it progresses from primary breast cancer

Indeed, HMM is a useful tool for tackling numerous concrete problems in many fields but some possible applications of HMM are in speech processing [2], in biology [3], in disease progression [4], in economics [5, 6], and in gene expression [7]. For a complete review of HMM, the reader is referred to Zucchini and MacDonald [8], in which properties and definitions of HMM are presented in a plausible way with both classical estimation by maximum likelihood method and expectation maximization (EM) algorithm and the new Bayesian inference is addressed.

The model we consider here is a variation of the regular hidden Markov model, since we will use extensions to incorporate dependence among successive observations, suggesting autoregressive dependence among continuous observations. Consequently, we have relaxed the conditional independence assumption from a standard HMM, because we would like to add some dynamics to the patient disease progression and because in reality the current patient biomarker observation is dependent on the past one. In fact, the autoregressive assumption in HMM has shown its advantage over regular HMM that cannot catch the strong dependence between successive observations (e.g., Ref. [9]). A similar model to ours can be found in Ref. [10]. This kind of models, which were first proposed in Ref. [11] to describe econometrics time series, is generalization of both HMM and autoregressive models, will be effective in representing multiple heterogeneous dynamics such as the disease progression dynamics, and can be even generalized

Moreover, Our model can also be viewed as an extension of the multivariate double-chain Markov model (DCMM) developed by Ref. [13], where there are two discrete Markov chains of first order: the first Markov chain is observed and the second one is hidden. In contrast to this DCMM, our multivariate first-order autoregressive hidden Markov model (MAR(1)HMM) will lead to continuous observations, where each observation conditional on the hidden process will depend on the previous observation according to an autoregressive process of first

Parameter estimation is very challenging for HMM family models since the likelihood is not available in a closed form most of the time. Thus, we call for a Markov Chain Monte Carlo (MCMC) procedure instead of a maximum likelihood-based approach. This choice rises from the fact that the Bayesian analysis uses prior knowledge about the process being measured,

order. This dynamic is promising for continuous observed disease biomarkers.

treatment.

148 Bayesian Inference

to advanced cancer stage, for example.

to a regime-switching ARMA models such as in Ref. [12].

Since the model suggested is of the HMM type, we will describe HMM in more detail: an HMM is a stochastic process Xt f g ,Yt <sup>T</sup> <sup>t</sup>¼<sup>0</sup>, where f g Xt <sup>T</sup> <sup>t</sup>¼<sup>0</sup> is a hidden Markov chain (unobservable) and f g Yt <sup>T</sup> <sup>t</sup>¼<sup>0</sup> is a sequence of observable independent random variables such that Yt depends only on Xt for the time <sup>t</sup> = 0,1,…T. Here the process f g Xt <sup>T</sup> <sup>t</sup>¼<sup>0</sup> evolves independently of f g Yt <sup>T</sup> <sup>t</sup>¼<sup>0</sup> and is supposed to be a homogeneous finite Markov chain with probability transition matrix Π of dimension a � a, where a indicates the number of the hidden states and Π<sup>0</sup> = (Π01,…,Π0a) is the initial state distribution.

We denote the probability density function of Yt = yt given Xt = k for k ∈ {1, …, a} with Pxtðyt , θkÞ, where θ<sup>k</sup> refers to the parameters of P when Xt = k. We suppose further that the processes Yt|Xt and Yt <sup>0</sup>|Xt <sup>0</sup> are independent for t 6¼ t 0 . Let Θ = (θ1,…, θa) and θ = (Π0, Π, Θ), and then, the HMM can be described as follows: First, the likelihood of the observations and the hidden states can be decomposed to <sup>P</sup>ðy0, …, yT, x0, …, xT, <sup>θ</sup>Þ ¼ <sup>P</sup>ðy0, …, yTjx0, …, xT, <sup>θ</sup>ÞPðx0,…, xT, <sup>θ</sup>Þ: Since f g Xt <sup>T</sup> t¼0 is a Markov chain, Pðx0, …, xT, θÞ ¼ Π0ðx0Þ Y T t¼1 Πðxtjxt�<sup>1</sup>Þ. Under the conditional independence of

the observations given the hidden states, Pðy0, …, yTjx0, …, xT, θÞ ¼ Px<sup>0</sup> ðy0jθ<sup>x</sup><sup>0</sup> Þ Y T t¼1 Pxtðyt jθxtÞ. Consequently, the likelihood function for the hidden states and the observations is given by Pðy0, y1, …, yT, x0, x1, …, xT, θÞ ¼ Πðx0ÞPx<sup>0</sup> ðy0jθ<sup>x</sup><sup>0</sup> Þ Y T t¼0 Πðxtjxt�<sup>1</sup>ÞPxtðyt jθxtÞ.

### 3. Model description and specification

The MAR(1)HMM model we consider in this work is a hidden Markov model, where conditionally on the latent states, the observations are not independent like it is the case for a regular hidden Markov model. Instead, the current observation is allowed to depend on the previous observation according to an autoregressive model of first order. As in an HMM model, the latent states evolve according to a discrete first-order time homogeneous Markov model. We consider data of n continuous random variables observed over time, each of potentially different lengths, i.e., for each individual i = 1,2, …, n, we observe a vector yi;: ¼ ðyi,ui , …, yi,mi Þ T , with ui < mi.

Define u<sup>0</sup> ¼ min 1 ≤ i ≤ n f g ui and M ¼ max 1 ≤ i ≤ n f g mi and note that the times ui and mi may vary over the entire observation period from u<sup>0</sup> to M with the restriction that ui – mi ≥ 1, for i = 1,2,…,n.

We assume, for i = 1,2,…,n for integer time t = ui,…,mi, that the random variable Yi,t taking nonnegative values depends only on the states Xt and the previous observation Yi,t–1, and based on the model developed by Farcomeni and Arima [10], we get the following model:

$$\left.Y\_{i,t}\right|\_{X\_t=x\_l} = \beta^{(\mathbf{x}\_l)}Y\_{i,t-1} + \mu^{(\mathbf{x}\_l)} + \varepsilon\_{i,t}.\tag{1}$$

The choice of the autoregressive part of the model is motivated by the fact that successive biomarker observations are most of the time correlated from many diseases unlike the hypothesis of independence between observations in HMMs.

We interpret x as the vector of the hidden health states of the patients; in the case of breast cancer, those states would be localized or advanced metastatic breast cancer for example, while y is the vector of the biomarkers observed and measured for the patients. The ε<sup>i</sup> ,<sup>t</sup> are normal variables with mean 0 and variance σ<sup>2</sup> such that ε<sup>i</sup> ,<sup>t</sup> and ε<sup>i</sup> 0,t <sup>0</sup> are uncorrelated, (i, t) 6¼ (i 0 , t 0 ).

The parameters βðxt<sup>Þ</sup> and μðxt<sup>Þ</sup> are parameters taking values in R for each hidden state and σ<sup>2</sup> ∈ R<sup>þ</sup>.

Similar to Ref. [13], the transition matrix of the Markov chain Π is time homogeneous with dimension a � a where a is the number of hidden states, and Π = (Πgh, g = 1,…,a; h = 1,…,a) where Πgh = P(Xt = h|Xt–<sup>1</sup> = g), for g = 1,2,…,a; h = 1,2,…, a; and t = u0+1,…, M. We let the first state Xu<sup>0</sup> to be selected from a discrete distribution with vector of probabilities r = (r1,…,ra). Also we consider the time of initial observation ui, the initial observed state yi,ui , and the number of consecutive time points that were observed mi – ui + 1. Let μ = (μ(1),…, μ(a) ), β = (β(1),…,β(a) ), and θ = (μ, β, σ<sup>2</sup> , r, Π) be the set of all parameters in the model. We suppose that the individuals, i.e., Yi,t, behave independently conditionally on X. Therefore, for i = 1,…,n, Pðyi;:jyi,ui , x, θÞ ¼ Ymi t¼uiþ1 Pðyi,t <sup>j</sup>yi,t�<sup>1</sup>, xt, <sup>Θ</sup><sup>Þ</sup> and <sup>P</sup>ðxjθÞ ¼ <sup>P</sup>ðxu<sup>0</sup> <sup>Þ</sup> Y M t¼u0þ1 Pðxtjxt�<sup>1</sup>, ΠÞ, where Pðxtjxt�<sup>1</sup>, ΠÞ ¼ PðXt ¼ xtjXt�<sup>1</sup> ¼ xt�<sup>1</sup>, ΠÞ ¼ Πxt�<sup>1</sup>, xt : Then, the likelihood density for the observations of all individuals y = (y1,…,yn) given first time vector of observations y<sup>0</sup> ¼ ðy<sup>1</sup>;u<sup>1</sup> , …, yn,un Þ, x, and θ is <sup>P</sup>ðyjy0, x, <sup>θ</sup>Þ ¼ <sup>Y</sup><sup>n</sup> i¼1 Pðyi;:jyi,ui , x, θÞ,

This is due to the conditional independence of the yi , given x and θ. The joint mass of each yi ,. and x given yui and θ can be written as follows: Pðyi;:, xjyi,ui , θÞ ¼ Pðyi;:jyi,ui , x, θÞ � Pðxjyui , θÞ: Using the Markov property of the hidden process, we have after simplification Pðxjyi,ui , θÞ ∝ Pðyi,ui jx, θÞPðxjθÞ ¼ Pðyi,ui jxui , θÞrxu<sup>0</sup> Πxu<sup>0</sup> , xu0þ<sup>1</sup> � ⋯ � ΠxM�<sup>1</sup>, xM : In addition, Pðyi;:jyi,ui , x, θÞ ¼ Ymi

t¼uiþ1 Pðyi,t <sup>j</sup>yi,t�<sup>1</sup>, x, <sup>θ</sup>Þ, and consequently,

3. Model description and specification

Define u<sup>0</sup> ¼ min

150 Bayesian Inference

σ<sup>2</sup> ∈ R<sup>þ</sup>.

θ = (μ, β, σ<sup>2</sup>

Pðyi,t

<sup>P</sup>ðyjy0, x, <sup>θ</sup>Þ ¼ <sup>Y</sup><sup>n</sup>

i¼1

Pðyi;:jyi,ui

Ymi t¼uiþ1 1 ≤ i ≤ n

i.e., for each individual i = 1,2, …, n, we observe a vector yi;: ¼ ðyi,ui

Yi,tj

vector of the biomarkers observed and measured for the patients. The ε<sup>i</sup>

consider the time of initial observation ui, the initial observed state yi,ui

consecutive time points that were observed mi – ui + 1. Let μ = (μ(1),…, μ(a)

y = (y1,…,yn) given first time vector of observations y<sup>0</sup> ¼ ðy<sup>1</sup>;u<sup>1</sup>

Xt¼xt <sup>¼</sup> <sup>β</sup>ðxt<sup>Þ</sup>

1 ≤ i ≤ n

f g ui and M ¼ max

esis of independence between observations in HMMs.

<sup>j</sup>yi,t�<sup>1</sup>, xt, <sup>Θ</sup><sup>Þ</sup> and <sup>P</sup>ðxjθÞ ¼ <sup>P</sup>ðxu<sup>0</sup> <sup>Þ</sup>

, x, θÞ,

with mean 0 and variance σ<sup>2</sup> such that ε<sup>i</sup>

The MAR(1)HMM model we consider in this work is a hidden Markov model, where conditionally on the latent states, the observations are not independent like it is the case for a regular hidden Markov model. Instead, the current observation is allowed to depend on the previous observation according to an autoregressive model of first order. As in an HMM model, the latent states evolve according to a discrete first-order time homogeneous Markov model. We consider data of n continuous random variables observed over time, each of potentially different lengths,

entire observation period from u<sup>0</sup> to M with the restriction that ui – mi ≥ 1, for i = 1,2,…,n.

We assume, for i = 1,2,…,n for integer time t = ui,…,mi, that the random variable Yi,t taking nonnegative values depends only on the states Xt and the previous observation Yi,t–1, and based on the model developed by Farcomeni and Arima [10], we get the following model:

The choice of the autoregressive part of the model is motivated by the fact that successive biomarker observations are most of the time correlated from many diseases unlike the hypoth-

We interpret x as the vector of the hidden health states of the patients; in the case of breast cancer, those states would be localized or advanced metastatic breast cancer for example, while y is the

The parameters βðxt<sup>Þ</sup> and μðxt<sup>Þ</sup> are parameters taking values in R for each hidden state and

Similar to Ref. [13], the transition matrix of the Markov chain Π is time homogeneous with dimension a � a where a is the number of hidden states, and Π = (Πgh, g = 1,…,a; h = 1,…,a) where Πgh = P(Xt = h|Xt–<sup>1</sup> = g), for g = 1,2,…,a; h = 1,2,…, a; and t = u0+1,…, M. We let the first state Xu<sup>0</sup> to be selected from a discrete distribution with vector of probabilities r = (r1,…,ra). Also we

, r, Π) be the set of all parameters in the model. We suppose that the individuals, i.e.,

,<sup>t</sup> and ε<sup>i</sup> 0,t

Yi,t, behave independently conditionally on X. Therefore, for i = 1,…,n, Pðyi;:jyi,ui

Y M

t¼u0þ1

xtjXt�<sup>1</sup> ¼ xt�<sup>1</sup>, ΠÞ ¼ Πxt�<sup>1</sup>, xt : Then, the likelihood density for the observations of all individuals

, …, yi,mi

Yi,t�<sup>1</sup> <sup>þ</sup> <sup>μ</sup>ðxt<sup>Þ</sup> <sup>þ</sup> <sup>ε</sup>i,t: (1)

f g mi and note that the times ui and mi may vary over the

<sup>0</sup> are uncorrelated, (i, t) 6¼ (i

Þ T

, with ui < mi.

,<sup>t</sup> are normal variables

, and the number of

), and

, x, θÞ ¼

Þ, x, and θ is

), β = (β(1),…,β(a)

Pðxtjxt�<sup>1</sup>, ΠÞ, where Pðxtjxt�<sup>1</sup>, ΠÞ ¼ PðXt ¼

, …, yn,un

0 , t 0 ). Pðyi;:, xjyi,ui , θÞ ∝ rxu<sup>0</sup> Pðyi,ui jxui , θÞ Y M t¼u0þ1 Πxt�<sup>1</sup>, xt Ymi t¼uiþ1 Pðyi,t <sup>j</sup>yi,t�<sup>1</sup>, x, <sup>θ</sup>Þ: Finally, under the hypoth-

esis of normal error distribution for the autoregressive parameters of the model (Eq. (1)) and the Chapman-Kolmogorov property, the joint distribution of yi,. and x given yi,ui and θ can be simplified to:

$$\begin{split} P(\boldsymbol{y}\_{i,\boldsymbol{\cdot}},\boldsymbol{x}|\boldsymbol{y}\_{i,\boldsymbol{u}\_{i}}\boldsymbol{\Theta}) & \propto P(\boldsymbol{y}\_{i,\boldsymbol{u}\_{i}}|\boldsymbol{x}\_{\boldsymbol{u}\_{i}},\boldsymbol{\Theta}) \prod\_{h=1}^{a} \boldsymbol{r}\_{h}^{\boldsymbol{\cdot}\left\{\boldsymbol{x}\_{\boldsymbol{u}\_{0}}\right\}} \prod\_{t=\boldsymbol{u}\_{0}+1}^{M} \prod\_{g=1}^{a} \prod\_{h=1}^{a} \prod\_{1} \boldsymbol{\Pi}\_{\mathcal{G},h}^{\boldsymbol{\cdot}\left\{\boldsymbol{x}\_{l},\boldsymbol{x}\_{t-1}\right\}}(\boldsymbol{g},h) \\ & \times \prod\_{t=\boldsymbol{u}\_{i}+1}^{m\_{i}} \prod\_{h=1}^{a} \left[\frac{1}{\sigma} \boldsymbol{\phi}\left(\frac{\boldsymbol{y}\_{i,t}-\boldsymbol{\mu}^{(h)}-\boldsymbol{\theta}^{(h)}\boldsymbol{y}\_{i,t-1}}{\sigma}\right)\right]^{\boldsymbol{X}\left\{\boldsymbol{x}\_{l}\right\}}\boldsymbol{\epsilon} \end{split}$$

where φ denotes the density of a standard normal distribution N ð0; 1Þ and χf g <sup>A</sup> ðxÞ is the usual indicator function of a set A. Finally, the joint distribution of y and x has the following form:

$$\begin{split} &P(y,x|y\_0,\boldsymbol{\theta}) \propto \prod\_{h=1}^{a} r\_h^{\chi\{\boldsymbol{x}\_0\}} \prod\_{t=u\_0+1}^{(h)} \prod\_{g=1}^{M} \prod\_{h=1}^{a} \prod\_{h=1}^{a} \Pi\_{\mathcal{G},h}^{\chi\{\boldsymbol{x}\_{t-\boldsymbol{x}\_{t-1}}\}}(\boldsymbol{\xi},h) \\ &\times \prod\_{i=1}^{n} \prod\_{l=1}^{a} \left[ \frac{1}{\sigma} \phi\left(\frac{y\_{i,i} - \mu^{(l)}}{\sigma}\right) \right]^{\chi\{\boldsymbol{x}\_{l}\}} \times \prod\_{i=1}^{n} \prod\_{t=u\_{i}+1}^{m\_{i}} \prod\_{h=1}^{a} \left[ \frac{1}{\sigma} \phi\left(\frac{y\_{i,i} - \mu^{(b)} - \rho^{(b)} y\_{i,t-1}}{\sigma}\right) \right]^{\chi\{\boldsymbol{x}\_{l}\}} . \end{split} \tag{2}$$

### 4. Bayesian estimation of the model parameters

We will use a Bayesian approach to estimate the model parameters. Inference in the Bayesian framework is obtained through the posterior density, which is proportional to the prior multiplied by the likelihood. The posterior distribution for our model, as in most cases, cannot be derived analytically, and we will approximate it through MCMC methods specifically designed for working with the augmented likelihood with the hidden states. In fact, MCMC methods start by specifying the prior density Π(θ) for the parameters. Since the data Y are available, the general sampling methods work recursively by alternating between simulating the full conditional distributions X given y and θ given x and y.

#### 4.1. Prior distributions

Under the assumption of independence between the parameters <sup>θ</sup> ¼ ðμ, <sup>β</sup>, <sup>σ</sup><sup>2</sup>, r, <sup>Π</sup>Þ, the prior density could be written as <sup>P</sup>ðθÞ ¼ <sup>P</sup>ðrÞPðΠÞPðμÞPðβÞPðσ<sup>2</sup>Þ. <sup>r</sup> is the parameters of a multinomial distribution; hence, the natural choice for the prior would be a Dirichlet distribution

$$\tau \sim \mathbb{D}(a\_{01}, \dots, a\_{0t}). \text{ Leter on, } \sum\_{j=1}^{t} \Pi\_{\vec{\imath}\vec{\imath}} = 1, \text{ and we assume that } \Pi\_{\vec{\imath}} \sim \mathbb{D}(\delta\_{\vec{\imath}1}, \dots, \delta\_{\vec{\imath}t}) \text{ for each row } t$$

of the transition matrix. This choice of the Dirichlet prior can be even the default Dð1;…; 1Þ as recently discussed in Ref. [18]. In fact, a Dirichlet prior is justified because the posterior density of each row of the transition matrix is proportional to the density of a Dirichlet distribution, and hence, choosing a Dirichlet prior would give a posterior Dirichlet. This can be justified as follows for a given set of parameters λ ¼ ðλ1, …, λaÞ from a discrete or from a multinomial density:

$$\pi(\mathbf{x}\_1, \dots, \mathbf{x}\_d, \lambda\_1, \dots, \lambda\_d) = \frac{n!}{\mathbf{x}\_1! \dots \mathbf{x}\_{\nu}!} \lambda\_1^{\mathbf{x}\_1} \dots \lambda\_d^{\mathbf{x}\_d} \text{ for the nonnegative integers } \mathbf{x}\_1, \dots, \mathbf{x}\_d, \text{ with } \sum\_{i=1}^d \mathbf{x}\_i = n.$$

This probability mass function can be expressed, using the gamma function Γ, as

$$\pi(\mathbf{x}\_1, \dots, \mathbf{x}\_d, \lambda\_1, \dots, \lambda\_d) = \frac{\Gamma\left(\sum\_{i=1}^d \lambda\_i\right)}{\prod\_{i=1}^d \prod\_{i=1}^{x\_i}} \prod\_{i=1}^d \lambda\_i^{x\_i}. \text{ This form shows its resemblance to the Dirichlet}$$

distribution, and by starting from supposing the prior λ ∝ Dðα0, …, αaÞ, the posterior is PðλjxÞ∝PðλÞPðxjλÞ∝ Y i λxi i Y i λαi�<sup>1</sup> <sup>i</sup> ∝ Y i λxiþαi�<sup>1</sup> <sup>i</sup> ∝ Dðx<sup>1</sup> þ α1, …, xa þ αaÞ:

Furthermore, concerning the priors for parameters of the autoregressive model, we suppose for <sup>h</sup> <sup>¼</sup> <sup>1</sup>;…, a: <sup>μ</sup>ðh<sup>Þ</sup> � <sup>N</sup> <sup>ð</sup>αh, <sup>τ</sup>hÞ, <sup>β</sup>ðh<sup>Þ</sup> � <sup>N</sup> <sup>ð</sup>bh, chÞ, and inverse gamma (IG) prior for <sup>σ</sup><sup>2</sup> � IG ðε, ζÞ. αh, τh, bh, ch,E, ζ are hyperparameters to be specified. For more details on Bayesian inference and prior selection in HMM, the reader is referred to Ref. [19]. In our case, prior distributions for the autoregressive parameters were proposed by Ref. [20] for a mixture autoregressive model, who points out that they are conventional prior choices for mixture models.

#### 4.2. Sampling the posterior distribution for the hidden states

Chib [21] developed a method for the simulation of the hidden states from the full joint distribution for the univariate hidden Markov model case. We will describe his full Bayesian algorithm for the univariate hidden Markov model before a generalization to our MAR(1) HMM.

#### 4.2.1. Chib's algorithm for the univariate hidden Markov model for estimation of the states

Suppose we have an observed process Yn ¼ ðy1, …, ynÞ and the hidden states Xn ¼ ðx1,…, xnÞ, θ are the parameters of the model. We adopt for simplicity Xt ¼ ðx1,…, xtÞ the history of the states up to time <sup>t</sup> and <sup>X</sup><sup>t</sup>þ<sup>1</sup> ¼ ðxtþ<sup>1</sup>, …, xn<sup>Þ</sup> the future from <sup>t</sup> + 1 to <sup>n</sup>. We use the same notation for Yt and Yt+1.

For each state xt ∈f g 1; 2;…, a for t ¼ 1; 2;…, n, the hidden model can be described by a conditional density given the hidden states πðyt jYt�<sup>1</sup>, xt ¼ kÞ ¼ πðyt jYt�<sup>1</sup>, θkÞ, k ¼ 1;…, a, with xt depending only on xt–<sup>1</sup> and having transition matrix Π and initial distribution Π0, and the parameters for π(.) are θ ¼ ðθ1, …, θaÞ.

Chib [21] shows that it is preferable to simulate the full latent data Xn ¼ ðx1, …, xnÞ from the joint distribution of x1, …, xnjYn, θ, in order to improve the convergence property of the MCMC algorithm because instead of n additional blocks if each state is simulated separately, only one additional block is required. First, we write the joint conditional density as

$$P(\mathbf{X}\_n|Y\_n, \boldsymbol{\theta}, \boldsymbol{\Pi}) = P(\mathbf{x}\_n|Y\_n, \boldsymbol{\theta})P(\mathbf{x}\_{n-1}|Y\_n, \mathbf{x}\_n, \boldsymbol{\theta}, \boldsymbol{\Pi}) \times \cdots \times P(\mathbf{x}\_1|Y\_n, \mathbf{X}^2, \boldsymbol{\theta}, \boldsymbol{\Pi}).$$

For sampling, it is sufficient to consider the sampling of xt from <sup>P</sup>ðxtjYn, X<sup>t</sup>þ<sup>1</sup> , θ, ΠÞ. Moreover, <sup>P</sup>ðxtjYn, X<sup>t</sup>þ<sup>1</sup> , θ, ΠÞ∝PðxtjYt, θ, ΠÞPðxtþ<sup>1</sup>jxt, ΠÞ. This expression has two ingredients: the first is Pðxtþ<sup>1</sup>jxt, ΠÞ, which is the transition matrix from the Markov chain. The second is PðxtjYt, θ, ΠÞ that would be obtained by recursively starting at t = 1.

The mass function Pðxt�<sup>1</sup>jYt�<sup>1</sup>, θ, ΠÞ is transformed into PðxtjYt, θ, ΠÞ, which is in turn transformed into Pðxtþ<sup>1</sup>jYtþ<sup>1</sup>, θ, ΠÞ and so on. The update is as follows: for k ¼ 1;…, a, we could write

$$P(\mathbf{x}\_t = k | Y\_t, \theta, \Pi) = \frac{P(\mathbf{x}\_t = k | Y\_{t-1}, \theta, \Pi)\pi(y\_t | y\_{t-1'}, \theta\_k)}{\sum\_{l=1}^{a} P(\mathbf{x}\_t = l | Y\_{t-1'}, \theta, \Pi)\pi(y\_t | y\_{t-1'}, \theta\_l)}.$$

These calculations are initialized at t = 0, by setting Pðx1jY0, θÞ to be the stationary distribution of the Markov chain. Precisely, the simulation proceeds for k ¼ 1;…, a, recursively by first simulating Pðx<sup>1</sup> ¼ kjY0, θÞ, from the initial distribution Π0(k) and Pðx<sup>1</sup> ¼ kjY1, θ, ΠÞ ∝ Pðx<sup>1</sup> ¼ kjY0, θ, ΠÞπðy1jY0, θkÞ: Then, we get by forward calculation Pðxt ¼ kjYt�<sup>1</sup>, θÞ ¼ X<sup>a</sup> l¼1 ΠlkPðxt�<sup>1</sup> ¼ ljYt�<sup>1</sup>, θÞ, for each t ¼ 2;…, n, where Πlk is the transition probability and Pðxt ¼ kjYt, θÞ∝ Pðxt ¼ kjYt�<sup>1</sup>, θ, PÞΠðyt jYt�<sup>1</sup>, θkÞ: The last term in the forward computation Pðxn ¼ kjYn, θÞ would serve as a start for the backward pass, and we get recursively for each <sup>t</sup> <sup>¼</sup> <sup>n</sup> � <sup>1</sup>;…; <sup>1</sup>:; <sup>P</sup>ðxt <sup>¼</sup> <sup>k</sup>jYn, X<sup>t</sup>þ<sup>1</sup> , θÞ∝ PðxtjYt�<sup>1</sup>, θ, ΠÞPðxtþ<sup>1</sup>jxt ¼ k, ΠÞ, which permits the obtention of Xn ¼ ðx1, …, xnÞ.

#### 4.2.2. Simulating the hidden states for the MAR(1)HMM

<sup>r</sup> � <sup>D</sup>ðα01, …, <sup>α</sup>0aÞ. Later on, <sup>X</sup><sup>a</sup>

<sup>π</sup>ðx1, …, xa, <sup>λ</sup>1, …, <sup>λ</sup>aÞ ¼ <sup>n</sup>!

πðx1, …, xa, λ1, …, λaÞ ¼

PðλjxÞ∝PðλÞPðxjλÞ∝

models.

HMM.

for Yt and Yt+1.

tional density given the hidden states πðyt

density:

152 Bayesian Inference

j¼1

<sup>x</sup>1!…xa! <sup>λ</sup><sup>x</sup><sup>1</sup>

Γ Xa i¼1 xiþ1 !

Y i λxi i Y i

Ya i¼1

Γðxiþ1Þ

λαi�<sup>1</sup> <sup>i</sup> ∝ Y i

4.2. Sampling the posterior distribution for the hidden states

<sup>1</sup> …λxa

Ya i¼1 λxi

Πij ¼ 1, and we assume that Π<sup>i</sup> � Dðδi1, …, δiaÞ for each row i

<sup>a</sup> for the nonnegative integers <sup>x</sup>1, …, xa, with <sup>X</sup><sup>a</sup>

<sup>i</sup> ∝ Dðx<sup>1</sup> þ α1, …, xa þ αaÞ:

<sup>i</sup> : This form shows its resemblance to the Dirichlet

i¼1

xi ¼ n:

of the transition matrix. This choice of the Dirichlet prior can be even the default Dð1;…; 1Þ as recently discussed in Ref. [18]. In fact, a Dirichlet prior is justified because the posterior density of each row of the transition matrix is proportional to the density of a Dirichlet distribution, and hence, choosing a Dirichlet prior would give a posterior Dirichlet. This can be justified as follows for a given set of parameters λ ¼ ðλ1, …, λaÞ from a discrete or from a multinomial

This probability mass function can be expressed, using the gamma function Γ, as

distribution, and by starting from supposing the prior λ ∝ Dðα0, …, αaÞ, the posterior is

Furthermore, concerning the priors for parameters of the autoregressive model, we suppose for <sup>h</sup> <sup>¼</sup> <sup>1</sup>;…, a: <sup>μ</sup>ðh<sup>Þ</sup> � <sup>N</sup> <sup>ð</sup>αh, <sup>τ</sup>hÞ, <sup>β</sup>ðh<sup>Þ</sup> � <sup>N</sup> <sup>ð</sup>bh, chÞ, and inverse gamma (IG) prior for <sup>σ</sup><sup>2</sup> � IG ðε, ζÞ. αh, τh, bh, ch,E, ζ are hyperparameters to be specified. For more details on Bayesian inference and prior selection in HMM, the reader is referred to Ref. [19]. In our case, prior distributions for the autoregressive parameters were proposed by Ref. [20] for a mixture autoregressive model, who points out that they are conventional prior choices for mixture

Chib [21] developed a method for the simulation of the hidden states from the full joint distribution for the univariate hidden Markov model case. We will describe his full Bayesian algorithm for the univariate hidden Markov model before a generalization to our MAR(1)

Suppose we have an observed process Yn ¼ ðy1, …, ynÞ and the hidden states Xn ¼ ðx1,…, xnÞ, θ are the parameters of the model. We adopt for simplicity Xt ¼ ðx1,…, xtÞ the history of the states up to time <sup>t</sup> and <sup>X</sup><sup>t</sup>þ<sup>1</sup> ¼ ðxtþ<sup>1</sup>, …, xn<sup>Þ</sup> the future from <sup>t</sup> + 1 to <sup>n</sup>. We use the same notation

For each state xt ∈f g 1; 2;…, a for t ¼ 1; 2;…, n, the hidden model can be described by a condi-

jYt�<sup>1</sup>, xt ¼ kÞ ¼ πðyt

jYt�<sup>1</sup>, θkÞ, k ¼ 1;…, a, with xt

4.2.1. Chib's algorithm for the univariate hidden Markov model for estimation of the states

λxiþαi�<sup>1</sup>

Returning to our model, and adopting notations and algorithm developed by Fitzpatrick and Marchev, f will denote the observation density for the MAR(1)HMM, and for u<sup>0</sup> < t < M; <sup>x</sup>�<sup>t</sup> ¼ ðxu<sup>0</sup> , …, xtÞ, xt ¼ ðxt, …, xMÞ, yðtÞ¼ðyi,t , i ¼ 1; 2;…, nÞ, y,t ¼ ⋃ i:ui<t yi,ui , …, yi,minf g t,mi n o, and yt <sup>¼</sup> <sup>⋃</sup> i:t<mi yi,maxf g <sup>t</sup>þ1;ui , …, yi,mi n o. The posterior distribution of the hidden state could be written as: <sup>P</sup>ðx�<sup>M</sup>jy,M, <sup>θ</sup>Þ ¼ <sup>P</sup>ðxMjy,M, <sup>θ</sup>Þ � <sup>⋯</sup> � <sup>P</sup>ðxu<sup>0</sup> <sup>j</sup>y,M, xu0þ<sup>1</sup>, <sup>θ</sup>Þ. So we could sample the

whole sequence of states by sampling from <sup>P</sup>ðxtjy,M, xtþ<sup>1</sup>, <sup>θ</sup>Þ: Hence, the estimation of the hidden states is performed recursively by first initializing

$$P(\mathbf{x}\_{\boldsymbol{u}\_{0}}|\boldsymbol{y}\_{,\boldsymbol{u}\_{0}},\boldsymbol{\theta}) \propto P(\boldsymbol{y}\_{,\boldsymbol{u}\_{0}}|\mathbf{x}\_{\boldsymbol{u}\_{0}})P(\mathbf{x}\_{\boldsymbol{u}\_{0}}|\boldsymbol{r});\boldsymbol{y}\_{,\boldsymbol{u}\_{0}} = \left\{\boldsymbol{y}\_{i,\boldsymbol{u}\_{i}},\boldsymbol{u}\_{i} = \boldsymbol{u}\_{0},\boldsymbol{i} = 1,\ldots,n\right\}.$$

$$P(\mathbf{x}\_{\boldsymbol{u}\_{0}+1} = k|\boldsymbol{y}\_{,\boldsymbol{u}\_{0}},\boldsymbol{\theta}) = \sum\_{l=1}^{a} \Pi\_{lk} P(\mathbf{x}\_{\boldsymbol{u}\_{0}} = l|\boldsymbol{Y},\boldsymbol{u}\_{0});\boldsymbol{k} = 1,\ldots,a.$$

$$P(\mathbf{x}\_{\boldsymbol{u}\_{0}+1} = k|\boldsymbol{y}\_{,\boldsymbol{u}\_{0}+1},\boldsymbol{\theta}) \propto P(\mathbf{x}\_{\boldsymbol{u}\_{0}+1} = k|\boldsymbol{y}\_{,\boldsymbol{u}\_{0}},\boldsymbol{\theta}) f(\boldsymbol{y}(\boldsymbol{u}\_{0})|\boldsymbol{y}\_{,\boldsymbol{u}\_{0}},\boldsymbol{\theta}\_{k}).$$

We perform a similar calculation for every state at time t, and we conclude by calculating <sup>P</sup>ðxM <sup>¼</sup> <sup>k</sup>jy,M�<sup>1</sup>, <sup>θ</sup>Þ ¼ <sup>X</sup><sup>a</sup> l¼1 <sup>Π</sup>lkPðxM�<sup>1</sup> <sup>¼</sup> <sup>l</sup>jY,M�<sup>1</sup>, <sup>θ</sup>Þ, and <sup>P</sup>ðxM <sup>¼</sup> <sup>k</sup>jy,M, <sup>θ</sup>Þ<sup>∝</sup> <sup>P</sup>ðxM <sup>¼</sup> <sup>k</sup>jy,M�<sup>1</sup>, <sup>θ</sup> <sup>Þ</sup>fðyðMÞjy,M�<sup>1</sup>, <sup>θ</sup>kÞ. Later on, we get <sup>P</sup>ðxM <sup>¼</sup> <sup>k</sup>jy,M, <sup>θ</sup>Þ, which permits the simulation of PðxMjy,M, θÞ: Finally, by backward calculation, we simulate from the probabilities <sup>P</sup>ðxtjy,M, xtþ<sup>1</sup>, <sup>θ</sup>Þ∝Pðxtþ<sup>1</sup>jxt, <sup>Π</sup>ÞPðxtjy,t , θÞ for each time t ¼ M � 1;…, u0. Those backward probabilities would permit the simulation of the latent states.

#### 4.3. Sampling from P(θ|x, y)

#### 4.3.1. Sampling Π

Under the prior assumption of Dirichlet prior for each row of the transition matrix PðΠiÞ∝ Dðδi1, …, δiaÞ, and the independence assumption between those rows, the posterior distribution for Π<sup>i</sup> can be developed using Eq. (2) as follows: Let nij denote the number of single transitions from state i to state j, so

$$\begin{split} &P(\boldsymbol{\Pi}\_{i}|\boldsymbol{y},\boldsymbol{x}) \propto P(\boldsymbol{\Pi}\_{i}) \prod\_{t=\boldsymbol{u}\_{0}+1}^{M} \prod\_{j=1}^{a} \prod\_{j=1}^{\chi\_{\{\boldsymbol{u}\_{i-1},\boldsymbol{u}\_{j}\}}(i,j)} \propto P(\boldsymbol{\Pi}\_{i}) \prod\_{j=1}^{a} \prod\_{j=1}^{\boldsymbol{u}\_{ij}} \propto \prod\_{j=1}^{a} \prod\_{i \neq j}^{\delta\_{\hat{\boldsymbol{y}}} + \boldsymbol{n}\_{\hat{\boldsymbol{\eta}}} - 1}^{\delta\_{\hat{\boldsymbol{y}}} + \boldsymbol{n}\_{\hat{\boldsymbol{\eta}}} - 1} \\ &\propto \mathbb{D}(\delta\_{i1} + \boldsymbol{n}\_{i1}, \ldots, \delta\_{\hat{\boldsymbol{\alpha}}} + \boldsymbol{n}\_{i\hat{\boldsymbol{\alpha}}}). \end{split}$$

#### 4.3.2. Sampling posterior distribution for initial distribution

Let n0<sup>l</sup> ¼ χxu<sup>0</sup> ðlÞ, for l ¼ 1;…, a. Using (2), under Dirichlet prior Dðδ01, …, δ0aÞ for the parameter r, we obtain Pðrjx, yÞ∝ PðrÞ Ya l¼1 r <sup>χ</sup> xu f g<sup>0</sup> ðlÞ <sup>l</sup> ∝ Ya l¼1 r δ0lþn0l�1 <sup>l</sup> ∝ Dðδ<sup>01</sup> þ n01, …, δ0<sup>a</sup> þ n0aÞ:

#### 4.3.3. Sampling posterior distribution for the autoregressive parameters μ, β, σ<sup>2</sup>

When a complete conditional distribution is known such as the normal distribution or beta distribution, we use the Gibbs sampler to draw the random variable. This is the case for our model. Let us define nui <sup>ð</sup>lÞ ¼ <sup>X</sup><sup>n</sup> i¼1 <sup>χ</sup> xui f g <sup>¼</sup><sup>l</sup> , nl <sup>¼</sup> <sup>X</sup><sup>n</sup> i¼1 Xmi t¼uiþ1 <sup>χ</sup>f g xt¼<sup>l</sup> , N <sup>¼</sup> <sup>X</sup><sup>a</sup> l¼1 nl, n0<sup>l</sup> <sup>¼</sup> <sup>χ</sup> xu f g<sup>0</sup> ðlÞ: So for l ¼ 1; 2;…, a; by supposing N ðαl, τlÞ as prior distribution and using Eq. (2), the conditional posterior distribution of μ(l) is:

$$P(\boldsymbol{\mu}^{(l)}|\boldsymbol{y},\boldsymbol{x}) \propto P(\boldsymbol{\mu}^{(l)}) \prod\_{i=1}^{n} \left[\frac{1}{\sigma} \phi\left(\frac{\boldsymbol{y}\_{i,\boldsymbol{u}\_{i}} - \boldsymbol{\mu}^{(l)}}{\sigma}\right)\right]^{\boldsymbol{\lambda}} \times \prod\_{i=1}^{n} \prod\_{t=\boldsymbol{u}\_{i}+1}^{m\_{i}} \left(\frac{1}{\sigma} \phi\left(\frac{\boldsymbol{y}\_{i,t} - \boldsymbol{\mu}^{(l)} - \boldsymbol{\beta}^{(l)} \boldsymbol{y}\_{i,t-1}}{\sigma}\right)\right)^{\boldsymbol{\lambda}\_{\left\{\boldsymbol{\lambda}\_{i}\right\}}^{\left\{\boldsymbol{\lambda}\_{i}\right\}}}.$$

$$\propto \exp\frac{-1}{2} \left\{\frac{(\boldsymbol{\mu}^{(l)} - \boldsymbol{\alpha}\_{l})^{2}}{\tau\_{l}} + \sum\_{i=1,\boldsymbol{u}\_{i}=l}^{n} \left(\frac{\boldsymbol{y}\_{i,\boldsymbol{u}\_{i}} - \boldsymbol{\mu}^{(l)}}{\sigma}\right)^{2} + \sum\_{i=1}^{n} \sum\_{t=\boldsymbol{u}\_{i}+1,\boldsymbol{u}\_{i}=l}^{m\_{i}} \left(\frac{\boldsymbol{y}\_{i,t} - \boldsymbol{\mu}^{(l)} - \boldsymbol{\beta}^{(l)} \boldsymbol{y}\_{i,t-1}}{\sigma}\right)^{2}\right\}.$$

then μðl<sup>Þ</sup> =y, x � N ðτ~l, α~lÞ with inverse mean τ~<sup>l</sup> �<sup>1</sup> <sup>¼</sup> nui ðlÞþnl <sup>σ</sup><sup>2</sup> <sup>þ</sup> <sup>1</sup> τl and variance

Pðxu<sup>0</sup> jy,u<sup>0</sup>

<sup>P</sup>ðxtjy,M, xtþ<sup>1</sup>, <sup>θ</sup>Þ∝Pðxtþ<sup>1</sup>jxt, <sup>Π</sup>ÞPðxtjy,t

single transitions from state i to state j, so

PðΠijy, xÞ ∝PðΠiÞ

∝ Dðδi<sup>1</sup> þ ni1, …, δia þ niaÞ:

4.3.2. Sampling posterior distribution for initial distribution

Ya l¼1 r <sup>χ</sup> xu f g<sup>0</sup> ðlÞ <sup>l</sup> ∝

<sup>ð</sup>lÞ ¼ <sup>X</sup><sup>n</sup> i¼1

4.3. Sampling from P(θ|x, y)

4.3.1. Sampling Π

Let n0<sup>l</sup> ¼ χxu<sup>0</sup>

r, we obtain Pðrjx, yÞ∝ PðrÞ

model. Let us define nui

posterior distribution of μ(l) is:

<sup>P</sup>ðxM <sup>¼</sup> <sup>k</sup>jy,M�<sup>1</sup>, <sup>θ</sup>Þ ¼ <sup>X</sup><sup>a</sup>

154 Bayesian Inference

Pðxu0þ<sup>1</sup> ¼ kjy,u<sup>0</sup>

, θÞ∝Pðy,u<sup>0</sup>

l¼1

, <sup>θ</sup>Þ ¼ <sup>X</sup><sup>a</sup>

<sup>P</sup>ðxu0þ<sup>1</sup> <sup>¼</sup> <sup>k</sup>jy,u0þ<sup>1</sup>, <sup>θ</sup>Þ<sup>∝</sup> <sup>P</sup>ðxu0þ<sup>1</sup> <sup>¼</sup> <sup>k</sup>jy,u<sup>0</sup>

probabilities would permit the simulation of the latent states.

Y M

Ya j¼1

> Ya l¼1 r δ0lþn0l�1

<sup>χ</sup> xui f g <sup>¼</sup><sup>l</sup> , nl <sup>¼</sup> <sup>X</sup><sup>n</sup>

4.3.3. Sampling posterior distribution for the autoregressive parameters μ, β, σ<sup>2</sup>

t¼u0þ1

l¼1

jxu<sup>0</sup> ÞPðxu<sup>0</sup> jrÞ; y,u<sup>0</sup> ¼ yi,ui, , ui ¼ u0, i ¼ 1;…, n

, θÞfðyðu0Þjy,u<sup>0</sup>

<sup>Π</sup>lkPðxM�<sup>1</sup> <sup>¼</sup> <sup>l</sup>jY,M�<sup>1</sup>, <sup>θ</sup>Þ, and <sup>P</sup>ðxM <sup>¼</sup> <sup>k</sup>jy,M, <sup>θ</sup>Þ<sup>∝</sup> <sup>P</sup>ðxM <sup>¼</sup> <sup>k</sup>jy,M�<sup>1</sup>, <sup>θ</sup>

ΠlkPðxu<sup>0</sup> ¼ ljY,u<sup>0</sup> Þ; k ¼ 1;…, a:

We perform a similar calculation for every state at time t, and we conclude by calculating

<sup>Þ</sup>fðyðMÞjy,M�<sup>1</sup>, <sup>θ</sup>kÞ. Later on, we get <sup>P</sup>ðxM <sup>¼</sup> <sup>k</sup>jy,M, <sup>θ</sup>Þ, which permits the simulation of PðxMjy,M, θÞ: Finally, by backward calculation, we simulate from the probabilities

Under the prior assumption of Dirichlet prior for each row of the transition matrix PðΠiÞ∝ Dðδi1, …, δiaÞ, and the independence assumption between those rows, the posterior distribution for Π<sup>i</sup> can be developed using Eq. (2) as follows: Let nij denote the number of

Π<sup>χ</sup> xt�<sup>1</sup>, <sup>x</sup> f g<sup>t</sup> <sup>ð</sup>i,j<sup>Þ</sup>

When a complete conditional distribution is known such as the normal distribution or beta distribution, we use the Gibbs sampler to draw the random variable. This is the case for our

i¼1

l ¼ 1; 2;…, a; by supposing N ðαl, τlÞ as prior distribution and using Eq. (2), the conditional

Xmi t¼uiþ1

∝PðΠiÞ

ðlÞ, for l ¼ 1;…, a. Using (2), under Dirichlet prior Dðδ01, …, δ0aÞ for the parameter

Ya j¼1 Πnij ij ∝ Ya j¼1

<sup>l</sup> ∝ Dðδ<sup>01</sup> þ n01, …, δ0<sup>a</sup> þ n0aÞ:

<sup>χ</sup>f g xt¼<sup>l</sup> , N <sup>¼</sup> <sup>X</sup><sup>a</sup>

l¼1

nl, n0<sup>l</sup> <sup>¼</sup> <sup>χ</sup> xu f g<sup>0</sup>

ðlÞ: So for

Π<sup>δ</sup>ijþnij�<sup>1</sup> ij

n o

, θkÞ:

, θÞ for each time t ¼ M � 1;…, u0. Those backward

:

$$\tilde{\alpha}\_{l} = \tilde{\pi}\_{l} \left( \frac{\sum\_{i=1, \mathbf{x}\_{u\_{i}} = l}^{n} y\_{i, u\_{i}} + \sum\_{i=1}^{n} \sum\_{t=u\_{i}+1, \mathbf{x}\_{i} = l}^{m\_{i}} (y\_{i, t} - \beta^{(l)} y\_{i, t-1})}{\sigma^{2}} + \frac{\alpha\_{l}}{\tau\_{l}} \right)$$

:

For βðl<sup>Þ</sup> , l <sup>¼</sup> <sup>1</sup>;…, a, and similar to <sup>μ</sup>(l) , N ðbl, clÞ was proposed as prior choice to obtain:

$$P(\boldsymbol{\beta}^{(l)}|\boldsymbol{y},\boldsymbol{\alpha}) \propto P(\boldsymbol{\beta}^{(l)}) \prod\_{i=1}^{n} \prod\_{t=u\_i+1}^{m\_i} \left[ \frac{1}{\sigma} \phi \left( \frac{\boldsymbol{y}\_{i,t} - \mu^{(l)} - \boldsymbol{\beta}^{(l)} \boldsymbol{y}\_{i,t-1}}{\sigma} \right) \right]^{\chi\_{\{s\_t\}}(l)} $$

and therefore, βðl<sup>Þ</sup> <sup>=</sup>y, x � <sup>N</sup> <sup>ð</sup>~cl, <sup>~</sup>bl<sup>Þ</sup> with inverse mean <sup>~</sup>cl �<sup>1</sup> <sup>¼</sup> <sup>1</sup> cl þ Xn i¼1 Xmi t¼uiþ1;xt¼l y2 i,t�<sup>1</sup> <sup>σ</sup><sup>2</sup> and variance

$$
\tilde{b}\_l = \tilde{c}\_l \left( \frac{b\_l}{c\_l} + \frac{\sum\_{i=1}^n \sum\_{t=\mu\_i+1, \chi\_i=l}^{m\_i} (y\_{i,t} - \mu^{(l)}) y\_{i,t-1}}{\sigma^2} \right).
$$

For the posterior distribution of σ<sup>2</sup> , by supposing IGðε, ζÞ as prior, we deduce from Eq. (2)

$$\begin{split} &P(\sigma^{2}|\boldsymbol{y},\boldsymbol{x}) \propto (\sigma^{2})^{-(\boldsymbol{\epsilon}+1)} \exp\left(-\frac{\zeta}{\sigma^{2}}\right) \prod\_{i=1}^{n} \left[\frac{1}{\sigma} \phi\left(\frac{\boldsymbol{y}\_{i,\boldsymbol{u}\_{i}} - \mu^{(\boldsymbol{x}\_{i})}}{\sigma}\right)\right], \\ &\times \prod\_{i=1}^{n} \prod\_{t=\boldsymbol{u}\_{i}+1}^{\boldsymbol{m}} \left[\frac{1}{\sigma} \phi\left(\frac{\boldsymbol{y}\_{i,t} - \mu^{(\boldsymbol{x}\_{i})} - \beta^{(\boldsymbol{x}\_{i})} \boldsymbol{y}\_{i,t-1}}{\sigma}\right)\right], \end{split}$$

consequently <sup>σ</sup><sup>2</sup>=y, x � IGð~ε, <sup>~</sup>ζ<sup>Þ</sup> with parameters <sup>~</sup><sup>ε</sup> <sup>¼</sup> nui þ N <sup>2</sup> þ ε and

$$\tilde{\zeta} = \frac{\sum\_{i=1}^{n} (y\_{i, u\_i} - \mu^{(\mathbf{x}\_{\tilde{u}})})^2 + \sum\_{i=1}^{n} \sum\_{t=u\_i+1}^{m\_i} (y\_{i, t} - \mu^{(\mathbf{x}\_t)} - \beta^{(\mathbf{x}\_t)} y\_{i, t-1})^2}{2} + \zeta.$$

Finally, the algorithm is ran for d = 1,…,D iterations by alternating between the following steps, where in each step we compute a conditional posterior for the given parameter:

#### The MCMC algorithm:

	- a. Initialization of forward simulation: Pðx ðdÞ <sup>u</sup><sup>0</sup> jy,u<sup>0</sup> , θÞ∝Pðy,u<sup>0</sup> jx ðdÞ <sup>u</sup><sup>0</sup> ÞPðx ðdÞ <sup>u</sup><sup>0</sup> <sup>j</sup>rðd<sup>Þ</sup> Þ, with y,u<sup>0</sup> ¼ yi,ui, , ui ¼ u0, i ¼ 1; …; n n o:
	- b. Forward simulation: For k ¼ 1;…:;a and t ¼ u<sup>0</sup> þ 1;…:; M : Pðx ðdÞ <sup>t</sup> <sup>¼</sup> <sup>k</sup>jy,t�<sup>1</sup>, <sup>θ</sup>Þ ¼ <sup>X</sup><sup>a</sup> l¼1 Π<sup>ð</sup>d<sup>Þ</sup> lk Pðx ðdÞ <sup>t</sup>�<sup>1</sup> <sup>¼</sup> <sup>l</sup>jY,t�<sup>1</sup>, <sup>θ</sup><sup>Þ</sup> and Pðx ðdÞ <sup>t</sup> ¼ kjy,t , <sup>θ</sup>Þ ¼ <sup>P</sup>ð<sup>x</sup> ðdÞ <sup>t</sup> <sup>¼</sup>kjy,t�<sup>1</sup>, <sup>θ</sup>kÞfðyðtÞjy,t�<sup>1</sup> <sup>X</sup> , <sup>θ</sup>k<sup>Þ</sup> a <sup>P</sup>ðxd <sup>t</sup> <sup>¼</sup>ljy,t�<sup>1</sup>, <sup>θ</sup>ÞfðyðtÞjy,t�<sup>1</sup>, <sup>θ</sup>l<sup>Þ</sup> .

l¼1

c. Initialization of backward simulation: For k ¼ 1;…:;a, given

Pðx ðdÞ <sup>M</sup> ¼ kjy,M, θÞ from forward simulation, we get Pðx ðdÞ <sup>M</sup> jy,M, θÞ:

d. Backward simulation: For k ¼ 1;…:;a and t ¼ M � 1;…;u<sup>0</sup> :

$$P(\mathbf{x}\_t^{(d)} | \underline{y}\_{,M'} \ge^{t+1(d)} \boldsymbol{\theta}) \propto P(\mathbf{x}\_{t+1}^{(d)} | \mathbf{x}\_t^{(d)} \text{ } \boldsymbol{\pi}) P(\mathbf{x}\_t^{(d)} | \underline{y}\_{,t'} \boldsymbol{\theta}) \text{ } \boldsymbol{\pi}$$

	- a. for l ¼ 1;…:;a, k ¼ 1;…:;a. Calculate n0<sup>l</sup> ¼ χ <sup>x</sup> ðdÞ <sup>u</sup><sup>0</sup> f gðl<sup>Þ</sup> and nkl <sup>¼</sup> <sup>Σ</sup><sup>M</sup> <sup>t</sup>¼u0þ<sup>1</sup><sup>χ</sup> <sup>x</sup> ðdÞ <sup>t</sup>�<sup>1</sup>,x <sup>ð</sup>d<sup>Þ</sup> f g <sup>t</sup> ðk, lÞ:
	- b. Sample ðr ðdþ1Þ <sup>1</sup> ;…:;r <sup>ð</sup>dþ1<sup>Þ</sup> <sup>a</sup> <sup>Þ</sup> <sup>∝</sup> <sup>D</sup>ðδ<sup>01</sup> <sup>þ</sup> <sup>n</sup>01;…:;δ0<sup>a</sup> <sup>þ</sup> <sup>n</sup>0aÞ:
	- c. For <sup>i</sup> <sup>¼</sup> <sup>1</sup>; …:; <sup>a</sup>; sample <sup>ð</sup>Π<sup>ð</sup>dþ1<sup>Þ</sup> <sup>i</sup><sup>1</sup> ;…:;Π<sup>ð</sup>dþ1<sup>Þ</sup> ia Þ∝ Dðδi1 þ ni1;…:;δia þ niaÞ:

$$\begin{array}{ll} \mathbf{a}. & \widetilde{\boldsymbol{\tau}}\_{l}^{-1} = \frac{n\_{\boldsymbol{v}\_{l}}(l) + n\_{l}}{\boldsymbol{\sigma}\_{(d)}^{2}} + \frac{1}{\boldsymbol{\tau}\_{l}}. \\\\ \mathbf{b}. & \widetilde{\boldsymbol{\alpha}}\_{l} = \widetilde{\boldsymbol{\tau}}\_{l} \left( \frac{\sum\_{i=1, \boldsymbol{x}\_{\boldsymbol{u}\_{i}} = l}^{n} y\_{i, \boldsymbol{u}\_{i}} + \sum\_{i=1}^{n} \sum\_{t=\boldsymbol{u}\_{i} + 1, \boldsymbol{x}\_{t} = l}^{m\_{i}} (y\_{i,t} - \boldsymbol{\beta}\_{(d)}^{(l)} y\_{i, t-1})}{\boldsymbol{\sigma}\_{(d)}^{2}} + \frac{\alpha\_{l}}{\boldsymbol{\tau}\_{l}} \right). \end{array}$$

c. Simulate μ<sup>ð</sup>l<sup>Þ</sup> ðdþ1Þ =y, x � N ðα~l, τ~lÞ:


Finally, the algorithm is ran for d = 1,…,D iterations by alternating between the following steps,

1. For h ¼ 1; 2; ::::; a, give reference values for the hyperparameters αh, τh, ah, bh, δ0h, and δih

ðdÞ <sup>u</sup><sup>0</sup> jy,u<sup>0</sup>

<sup>t</sup>�<sup>1</sup> <sup>¼</sup> <sup>l</sup>jY,t�<sup>1</sup>, <sup>θ</sup><sup>Þ</sup> and

.

ðdÞ

<sup>þ</sup> <sup>α</sup><sup>l</sup> τl

1

CCCCA :

, θÞ∝Pðy,u<sup>0</sup>

ðdÞ <sup>M</sup> jy,M, θÞ:

<sup>u</sup><sup>0</sup> f gðl<sup>Þ</sup> and nkl <sup>¼</sup> <sup>Σ</sup><sup>M</sup>

ia Þ∝ Dðδi1 þ ni1;…:;δia þ niaÞ:

<sup>t</sup>¼u0þ<sup>1</sup><sup>χ</sup> <sup>x</sup>

ðdÞ <sup>t</sup>�<sup>1</sup>,x <sup>ð</sup>d<sup>Þ</sup> f g <sup>t</sup>

ðk, lÞ:

jx ðdÞ <sup>u</sup><sup>0</sup> ÞPðx ðdÞ <sup>u</sup><sup>0</sup> <sup>j</sup>rðd<sup>Þ</sup>

(1), μ(1), β(1), and σ2(1).

Þ, with y,u<sup>0</sup> ¼

where in each step we compute a conditional posterior for the given parameter:

2. Initialization (Step d = 1 of the MCMC iterations): Initialize Π(1), r

:

b. Forward simulation: For k ¼ 1;…:;a and t ¼ u<sup>0</sup> þ 1;…:; M :

c. Initialization of backward simulation: For k ¼ 1;…:;a, given

d. Backward simulation: For k ¼ 1;…:;a and t ¼ M � 1;…;u<sup>0</sup> :

ðdÞ <sup>t</sup>þ<sup>1</sup>j<sup>x</sup> ðdÞ <sup>t</sup> , πÞPðx

4. Estimation of the initial distribution and the transition distribution

Xmi t¼uiþ1;xt¼l

> σ2 ðdÞ

=y, x � N ðα~l, τ~lÞ:

<sup>M</sup> ¼ kjy,M, θÞ from forward simulation, we get Pðx

<sup>t</sup> <sup>¼</sup>kjy,t�<sup>1</sup>, <sup>θ</sup>kÞfðyðtÞjy,t�<sup>1</sup> <sup>X</sup> , <sup>θ</sup>k<sup>Þ</sup>

<sup>t</sup> <sup>¼</sup>ljy,t�<sup>1</sup>, <sup>θ</sup>ÞfðyðtÞjy,t�<sup>1</sup>, <sup>θ</sup>l<sup>Þ</sup>

ðdÞ <sup>t</sup> jy,t , θÞ:

<sup>ð</sup>dþ1<sup>Þ</sup> <sup>a</sup> <sup>Þ</sup> <sup>∝</sup> <sup>D</sup>ðδ<sup>01</sup> <sup>þ</sup> <sup>n</sup>01;…:;δ0<sup>a</sup> <sup>þ</sup> <sup>n</sup>0aÞ:

<sup>i</sup><sup>1</sup> ;…:;Π<sup>ð</sup>dþ1<sup>Þ</sup>

ðyi,t�β ðlÞ ðdÞ yi,t�1<sup>Þ</sup>

l¼1 Π<sup>ð</sup>d<sup>Þ</sup> lk Pðx ðdÞ

ðdÞ

a l¼1 <sup>P</sup>ðxd

, θÞ∝ Pðx

a. for l ¼ 1;…:;a, k ¼ 1;…:;a. Calculate n0<sup>l</sup> ¼ χ <sup>x</sup>

The MCMC algorithm:

156 Bayesian Inference

Pðx ðdÞ

Pðx ðdÞ <sup>t</sup> ¼ kjy,t

Pðx ðdÞ

Pðx ðdÞ

b. Sample ðr

a. ~τ<sup>l</sup>

b. α~<sup>l</sup> ¼ τ~<sup>l</sup>

c. Simulate μ<sup>ð</sup>l<sup>Þ</sup>

<sup>t</sup> <sup>j</sup>y,M, xtþ1ðd<sup>Þ</sup>

ðdþ1Þ <sup>1</sup> ;…:;r

c. For <sup>i</sup> <sup>¼</sup> <sup>1</sup>; …:; <sup>a</sup>; sample <sup>ð</sup>Π<sup>ð</sup>dþ1<sup>Þ</sup>

5. Simulation of μ: For l ¼ 1;…:; a,

0

BBBB@

ðlÞþnl σ2 ðdÞ

Xn i¼1;xui ¼l yi,ui þ Xn i¼1

ðdþ1Þ

þ 1 τl :

�<sup>1</sup> <sup>¼</sup> nui

for i ¼ 1; 2; ::::; a:

3. Simulation of the hidden states:

a. Initialization of forward simulation: Pðx

, <sup>θ</sup>Þ ¼ <sup>P</sup>ð<sup>x</sup>

yi,ui, , ui ¼ u0, i ¼ 1; …; n n o

<sup>t</sup> <sup>¼</sup> <sup>k</sup>jy,t�<sup>1</sup>, <sup>θ</sup>Þ ¼ <sup>X</sup><sup>a</sup>

$$\mathbf{a} \cdot \quad \tilde{c}\_{l}^{-1} = \frac{1}{c\_{l}} + \frac{\sum\_{i=1}^{n} \sum\_{\substack{t=u\_{i}+1, \mathbf{x}\_{t}=l \\ \sigma\_{(d)}^{2}}}^{m\_{i}}}{\sum\_{c\_{l}^{2}}^{n}}.$$

$$\mathbf{b} \cdot \quad \tilde{b}\_{l} = \tilde{c}\_{l} \left(\sum\_{c\_{l}}^{n} + \frac{\sum\_{i=1}^{m\_{i}} \sum\_{t=u\_{i}+1, \mathbf{x}\_{t}=l}^{m\_{i}} (y\_{i,t} - \mu\_{d+1}^{(l)}) y\_{i,t-1}}{\sigma\_{(d)}^{2}}\right).$$

$$\text{This leads to } \mathbf{2}^{(l)} \text{ and } \mathbf{2}^{(l)} \text{ for } \mathbf{2}^{(l)} \text{ and } \mathbf{2}^{(l)}$$

$$\mathbf{c}.\quad\text{Simulate }\boldsymbol{\beta}\_{(d+1)}^{(l)}/\boldsymbol{y}\_{\prime}\ge\sim\mathcal{N}(\boldsymbol{\tilde{b}}\_{l\prime}\,\boldsymbol{\tilde{c}}\_{l}).$$

7. - Simulation of σ<sup>2</sup> :

$$\begin{aligned} \mathbf{a}. \quad \tilde{\epsilon} &= \frac{n\_{u\_i} + N}{2} + \epsilon. \\ \mathbf{b}. \quad \tilde{\zeta} &= \frac{\sum\_{i=1}^{n} (y\_{i, u\_i} - \mu\_{(d+1)}^{(m\_i)})^2 + \sum\_{i=1}^{n} \sum\_{t=u\_i+1}^{m\_i} (y\_{i, t} - \mu\_{(d+1)}^{(x\_t)} - \mathfrak{k}\_{(d+1)}^{(x\_t)} y\_{i, t-1})^2}{2} + \zeta. \end{aligned}$$

c. Simulate σ<sup>2</sup> ðdþ1Þ <sup>=</sup>y, x � IGðE, <sup>~</sup>ζÞ.

### 5. Simulation study

In this section, we apply our results to the breast cancer model discussed earlier. The main reason behind our work is that the progression of breast cancer cannot be seen directly unless we use observations related to the disease that could characterize its progression; those observations here are quantities which could be measured; they are called biomarkers, where the word biomarker is used to designate any objective indication of a biological process or disease condition including during treatment and should be measurable. Furthermore, biomarkers are increasingly used in the management of breast cancer patients. One example is reported in Ref. [22], stating that there is correlation between elevation of CEA and/or CA15-3 and disease progression, in breast cancer patients. Also we use the autoregressive dependence among the observations to add more dynamics to the model unlike conventional HMMs where the successive observations given the Markov process are independent. We used the classification of breast cancer in three states: local where the disease is confined within the breast, the regional phase when the lymph nodes are involved, and the distant stage where the cancer is found in other parts of the body. We restrict ourselves to these three stages unlike other stage classifications that divide the progression in more than three stages such as the TNM (tumor, node, and metastasis) system. By lack of finding data about breast cancer biomarkers, we will confine ourselves to simulate an MAR(1)HMM model for observation time M = 24, and a number of individuals n = 210, a = 3 for Markov states number, with the length observation time for each individual selected uniformly between 2 and M. The simulation process supposes we have for the autoregressive means <sup>μ</sup> ¼ ðμð1<sup>Þ</sup> , μð2<sup>Þ</sup> , μð3<sup>Þ</sup> Þ ¼ð12; 24; 36Þ, since markers such as CA15–3 increase as the disease advances toward metastatic breast cancer. In addition, CA15–3 increase rapidly between successive observations, and thus, we take in the simulation the parameters <sup>β</sup> ¼ ðβð1<sup>Þ</sup> , βð2<sup>Þ</sup> , βð3<sup>Þ</sup> Þ¼ð0:2; 0:4; 0:8Þ.

The algorithm of simulation works as follows:


We choose a prior <sup>σ</sup><sup>2</sup> � IGð0:001; <sup>0</sup>:001Þ, a <sup>D</sup>ð1;…; <sup>1</sup><sup>Þ</sup> prior for each row of <sup>Π</sup>, and Gaussian noninformative priors for the μs and the βs. Having the hidden states and the observations, we ran our algorithm for 8000 MCMC iterations. MCMC algorithm convergence was assessed by analyzing MCMC iterations mixing plots that are shown in Figure 1, autocorrelation sample graphs checking as illustrated in Figure 2, and inspecting histograms of posterior densities for the parameters of the models in Figure 3. All parameters show good mixing of chains, autocorrelations that decay immediately after a few lags, and perfect posterior densities fitting. Also the Gelman [23] potential scale reduction factor (PSRF) was plot. The PSRF is measured for more than two MCMC chains (three chains in this works are considered), and it is measured for each parameter of the model; it should show how the chains have forgotten their initial values and that the output from all chains is indistinguishable. It is based on a comparison of within-chain and between-chains variances and is similar to a classical analysis of variance; when the PSRF is high (perhaps greater than 1.1 or 1.2), then we should run our chains out longer to improve convergence to the stationary distribution. Each PSRF declines to 1 as the number of iterations approaches infinity to confirm convergence. All the parameters have shown a PSRF less than 1.1 as the number of iteration increases and by the way a good sign of convergence (Figure 4). Moreover, we should point out that in the family of Markov switching model there is the so-called label switching problem (e.g., Ref. [24]) which arises identifiability problem, and hence, we would not estimate perfectly the parameters. In addition, the posterior densities could show evidence of multimodality. some authors postprocess the output of the MCMC to deal with the issue (e.g., [25]), while other uses a random permutation of the parameters in each iteration of the MCMC algorithm (e.g., [26]) or one can call for an invariant loss function method (e.g., [27]). In our case, no identifiability issue is noticed since we used well-separated prior hyperparameters. Even when we start from different initial values for the parameters, our algorithm converges immediately after a few iterations.

Finally and before giving our results, we should report that the simulation of the Dirichlet posterior was carried out following ([28, p. 22], [29, p. 155]) who reported that the posterior Dirichlet parameters should be simulated using the beta distribution approach. Table 1 shows how the posterior values estimated from algorithm are very close to the true ones.

time for each individual selected uniformly between 2 and M. The simulation process sup-

such as CA15–3 increase as the disease advances toward metastatic breast cancer. In addition, CA15–3 increase rapidly between successive observations, and thus, we take in the simulation

1. For each individual i ¼ 1;…, n, choose mi the length of observation for that individual i. 2. Generate each discrete disease state xt using transition matrixΠ ¼ ð0:7; 0:2;0:1; 0:1; 0:6; 0:3;

We choose a prior <sup>σ</sup><sup>2</sup> � IGð0:001; <sup>0</sup>:001Þ, a <sup>D</sup>ð1;…; <sup>1</sup><sup>Þ</sup> prior for each row of <sup>Π</sup>, and Gaussian noninformative priors for the μs and the βs. Having the hidden states and the observations, we ran our algorithm for 8000 MCMC iterations. MCMC algorithm convergence was assessed by analyzing MCMC iterations mixing plots that are shown in Figure 1, autocorrelation sample graphs checking as illustrated in Figure 2, and inspecting histograms of posterior densities for the parameters of the models in Figure 3. All parameters show good mixing of chains, autocorrelations that decay immediately after a few lags, and perfect posterior densities fitting. Also the Gelman [23] potential scale reduction factor (PSRF) was plot. The PSRF is measured for more than two MCMC chains (three chains in this works are considered), and it is measured for each parameter of the model; it should show how the chains have forgotten their initial values and that the output from all chains is indistinguishable. It is based on a comparison of within-chain and between-chains variances and is similar to a classical analysis of variance; when the PSRF is high (perhaps greater than 1.1 or 1.2), then we should run our chains out longer to improve convergence to the stationary distribution. Each PSRF declines to 1 as the number of iterations approaches infinity to confirm convergence. All the parameters have shown a PSRF less than 1.1 as the number of iteration increases and by the way a good sign of convergence (Figure 4). Moreover, we should point out that in the family of Markov switching model there is the so-called label switching problem (e.g., Ref. [24]) which arises identifiability problem, and hence, we would not estimate perfectly the parameters. In addition, the posterior densities could show evidence of multimodality. some authors postprocess the output of the MCMC to deal with the issue (e.g., [25]), while other uses a random permutation of the parameters in each iteration of the MCMC algorithm (e.g., [26]) or one can call for an invariant loss function method (e.g., [27]). In our case, no identifiability issue is noticed since we used well-separated prior hyperparameters. Even when we start from different initial values for the parameters, our algo-

Finally and before giving our results, we should report that the simulation of the Dirichlet posterior was carried out following ([28, p. 22], [29, p. 155]) who reported that the posterior Dirichlet parameters should be simulated using the beta distribution approach. Table 1 shows

how the posterior values estimated from algorithm are very close to the true ones.

Þ¼ð0:2; 0:4; 0:8Þ.

3. Generate the observations yi,t for all individuals using our model 1.

, μð2<sup>Þ</sup> , μð3<sup>Þ</sup>

Þ ¼ð12; 24; 36Þ, since markers

poses we have for the autoregressive means <sup>μ</sup> ¼ ðμð1<sup>Þ</sup>

, βð2<sup>Þ</sup> , βð3<sup>Þ</sup>

rithm converges immediately after a few iterations.

The algorithm of simulation works as follows:

0:2; 0:3; 0:5Þ for t ¼ u<sup>0</sup>þ<sup>1</sup>,…, M.

the parameters <sup>β</sup> ¼ ðβð1<sup>Þ</sup>

158 Bayesian Inference

Figure 1. Markov chain mixing for each parameter through MCMC algorithm simulation.

Figure 2. Autocorrelation sample plots for parameters of the model.

Figure 3. Posterior densities for the parameters of the model (after 8000 iterations).

Figure 2. Autocorrelation sample plots for parameters of the model.

160 Bayesian Inference

Figure 4. Potential scale reduction factor convergence to less than 1.02 with more iterations.


Table 1. Posterior inference for the parameters of the MAR(1)HMM model.

### 6. Conclusion

Figure 4. Potential scale reduction factor convergence to less than 1.02 with more iterations.

162 Bayesian Inference

We have extended the method of Chib [21] for block update estimation of the states to a MAR (1)HMM model. Furthermore, we would like to point out that our model can easily be extended to include missing observations, as we should only add an extra step in each MCMC iteration to estimate the missing observations. Also, we can estimate the autoregressive model for different values of the autoregressive order, p ≥ 1, by evaluating the Bayesian information criterion to select the best order that fits the observations of the model. Our model would capture the complexity and the dynamics of the evolution of breast cancer by introducing the latent states; the probabilities of transition between the latent states allow to compare among the effects of treatments on slowing or accelerating the transition of the disease from one health stage to another the autoregressive parameter mean values corresponding to different stages of the disease would guide medical doctors and scientists to monitor patients in different phases of the disease. The model incorporates individual observations with different lengths.

Last but not least, we like to mention the utilities of switching diffusion processes in addressing and analyzing many complicated applications such as in finance and risk management. Our future work would be to apply these processes to explore disease progression, because they are characterized by the coexistence of continuous dynamics and discrete events as well as their interactions.

### Acknowledgements

We would like to thank the editorial staff for the comments that helped in improving this work. Also, we would like to thank the supporters of this work: The Lalla Salma Foundation Prevention and Treatment of Cancer, Rabat, Morocco; and the Germano-Morrocan Program for Scientific Research PMARS 2015-060.

### Author details

Hamid El Maroufy<sup>1</sup> \*†, El Houcine Hibbah1† , Abdelmajid Zyad2† and Taib Ziad3

\*Address all correspondence to: h\_elmaroufy@hotmail.com

1 Department of Mathematics, Faculty of Sciences and Technics, Sultan Moulay Slimane University, Béni Mellal, Morocco

2 Biological Engineering Laboratory, Team of Natural Substances, Cell and Molecular Immuno-Pharmacology, Sultan Moulay Slimane University, Morocco

3 Early Clinical Development, Astra Zeneca RD, Gothenburg, Mölndal, Sweden

† The three first authors acknowledge the financial support of the Lalla Salma Fondation of Cancer: Prevention and Treatment, Project 09/2013.

### References


[7] Zeng Y, Frias J. A novel HMM-based clustering algorithm for the analysis of gene expression time-course data. Computational Statistics and Data Analysis. 2006;50:2472–2494

Acknowledgements

164 Bayesian Inference

Author details

Hamid El Maroufy<sup>1</sup>

†

References

University, Béni Mellal, Morocco

markers. 2009;6:63-72

2000;56:733–741

1993;11:1–15

for Scientific Research PMARS 2015-060.

\*†, El Houcine Hibbah1†

Immuno-Pharmacology, Sultan Moulay Slimane University, Morocco

\*Address all correspondence to: h\_elmaroufy@hotmail.com

Cancer: Prevention and Treatment, Project 09/2013.

cussion). Biometrics. 2004;60:573–588

Dynamics and Control. 2008;32:3807–3819

We would like to thank the editorial staff for the comments that helped in improving this work. Also, we would like to thank the supporters of this work: The Lalla Salma Foundation Prevention and Treatment of Cancer, Rabat, Morocco; and the Germano-Morrocan Program

1 Department of Mathematics, Faculty of Sciences and Technics, Sultan Moulay Slimane

2 Biological Engineering Laboratory, Team of Natural Substances, Cell and Molecular

The three first authors acknowledge the financial support of the Lalla Salma Fondation of

[1] Samy N, Ragab HM, El Maksoud NA, Shaalan M. Prognostic significance of serum Her2/ neu, BCL2, CA15-3 and CEA in breast cancer patients: A short follow up. Cancer Bio-

[2] Benmiloud B, Piczunski W. Estimation des parametres dans les chaines de markov

[3] Boys R, Handerson D. A Bayesian approach to DNA sequence segmentation (with dis-

[4] Guihenneuc-Jouyaux C, Richardson S, Longini IM Jr. Modeling disease progression by a hidden Markov process: Application to characterizing CD4 cell decline. Biometrics.

[5] Albert J, Chib S. Bayes inference via Gibbs sampling of autoregressive time series subject to Markov mean and variance shifts. Journal of Business and Economic Statistics.

[6] Korolkiewickz M, Elliot J. A hidden Markov model of credit quality. Journal of Economic

3 Early Clinical Development, Astra Zeneca RD, Gothenburg, Mölndal, Sweden

cachees et segmentation. Traitement du Signal. 1995;12:433–454

, Abdelmajid Zyad2† and Taib Ziad3


Provisional chapter

### **Bayesian Model Averaging and Compromising in Dose-Response Studies** Bayesian Model Averaging and Compromising in

DOI: 10.5772/intechopen.68786

Steven B. Kim

[23] Gelman A. Inference and monitoring convergence. In: Gilks W, Richardson S, Spiegelhalter D, editors. Markov Chain Monte Carlo in Practice. London: Chapman and Hall/CRC; 1995.

[24] Fruhwirth-Schnatter S. Finite Mixture and Markov Switching Models. New York: Springer;

[25] Celoux G. Bayesian inference for mixture: The label switching problem. In: Payne R, Green P, editors. Proceedings in Computational Statistics. Heidelberg: Physica; 1998. pp.

[26] Fruhwirth-Schnatter S. Markov Chain Monte Carlo estimation of classical and dynamic switching and mixture models. Journal of the American Statistical Association. 2001;

[27] Hurn M, Justel A, Rober C. Estimating mixtures of regressions. Journal of Computational

[28] Kim C, Nelson C. State-Space Models with Regime Switching: Classical and Gibbs Sam-

[29] Krozlig H. Markov-Switching Vector Autoregressions. Berlin, Heidelberg: Springer-

pling Approaches with Applications. Cambridge, MA: MIT Press; 1999

p. 131

166 Bayesian Inference

2006

227–232

96:194–209

Verlag; 1997

and Graphical Statistics. 2003;79:55–79

Additional information is available at the end of the chapter Steven B. Kim

http://dx.doi.org/10.5772/intechopen.68786 Additional information is available at the end of the chapter

Dose-Response Studies

#### Abstract

Dose-response models are applied to animal-based cancer risk assessments and humanbased clinical trials usually with small samples. For sparse data, we rely on a parametric model for efficiency, but posterior inference can be sensitive to an assumed model. In addition, when we utilize prior information, multiple experts may have different prior knowledge about the parameter of interest. When we make sequential decisions to allocate experimental units in an experiment, an outcome may depend on decision rules, and each decision rule has its own perspective. In this chapter, we address the three practical issues in small-sample dose-response studies: (i) model-sensitivity, (ii) disagreement in prior knowledge and (iii) conflicting perspective in decision rules.

Keywords: dose-response models, model-sensitivity, model-averaging, prior-sensitivity, consensus prior, Bayesian decision theory, individual-level ethics, population-level ethics, Bayesian adaptive designs, sequential decisions, continual reassessment method, c-optimal design, Phase I clinical trials

### 1. Introduction

Dose-response modeling is often used to learn about the effect of an agent on a particular outcome with respect to dose. It is widely applied to animal-based cancer risk assessments and human-based clinical trials. A sample size is typically small; so many statistical issues can arise from a limited amount of data. The issues include the impact of a misspecified model, priorsensitivity, and conflicting ethical perspectives in clinical trials. In this chapter, we focus on cases when an outcome variable of interest is binary (a predefined event happened or not) when an experimental unit is exposed to a dose. Main ideas are preserved for cases when an outcome variable is continuous or discrete.

© 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and eproduction in any medium, provided the original work is properly cited.

There are two different approaches to statistical inference. One approach is called frequentist inference. In this framework, we often rely on the sampling distribution of a statistic and largesample theories. Another approach is called Bayesian inference. It is founded on Bayes' Theorem, and it allows researchers to express prior knowledge independent of data. In a smallsample study, Bayesian inference can be more useful than frequentist inference because we can incorporate both researcher's prior knowledge and observed data to make inference for the parameter of interest. Bayesian ideas are briefly introduced for dose-response modeling with a binary outcome in Section 2.

In a small-sample study, we often rely on a parametric model to gain statistical efficiency (i.e., less variance in parameter estimation), but our inference can be severely biased by the use of a wrong model. To account for model uncertainty, it is reasonable to specify multiple models and make inference based on "averaged-inference." In this regard, Bayesian model averaging (BMA) is a useful method to gain robustness [1]. The BMA method has a wide range of application, and we focus its application to animal-based cancer risk assessments in Section 3.

In clinical trials, study participants are real patients, and therefore, we need to carefully consider ethics. There are conflicting perspectives of individual- and population-level ethics in early phase clinical trials. Individual-level ethics focuses on the benefit of trial participants, whereas population-level focuses on the benefit of future patients, which may require some level of sacrifice from trial participants. We compare the two conflicting perspectives in clinical trials based on Bayesian decision theory, and we discuss a compromising method in Section 4 [2, 3].

A sample size for an early phase (Phase I) clinical trial is often less than 30 subjects. Dose allocations for first few patients and statistical inference for future patients heavily depend on researcher's prior knowledge in sparse data. When multiple researchers have different prior knowledge about a parameter of interest, one compromising approach is to combine their prior elicitations and average them (i.e., consensus prior) [4, 5]. When we average the prior elicitations, there are two different approaches to determine the weight of each prior elicitation, weights determined before observing data and after observing data. We discuss operating characteristics of the two different weighting methods in the context of Phase I clinical trials in Section 5.

### 2. Bayesian inference

In statistics, we address a research question by a parameter, which is often denoted by θ. We begin Bayesian inference by modeling the prior knowledge about θ. A function, which models the prior knowledge about θ, is called the prior density function of θ, and we denote it by f(θ). It is a non-negative function, which satisfies <sup>ð</sup> <sup>Ω</sup> <sup>f</sup>ðθ<sup>Þ</sup> <sup>d</sup><sup>θ</sup> <sup>¼</sup> 1, where <sup>Ω</sup> is the set of all possible values of θ (i.e., parameter space). We then model data y ! ¼ ðy1, …, yn<sup>Þ</sup> given <sup>θ</sup>. The likelihood function, denoted by fðy !j<sup>θ</sup>Þ, quantifies the likelihood of observing a particular sample y ! ¼ ðy1,…, yn<sup>Þ</sup> under an assumed probability model. By Bayes' Theorem, we update our knowledge about θ after observing data y ! as

$$f(\theta|\vec{y}) = \frac{f(\vec{y}|\theta)f(\theta)}{f(\vec{y})} \,. \tag{1}$$

The function fðθjy !Þ is called the posterior density function of <sup>θ</sup> given data <sup>y</sup> !. Since we treat observed data y ! ¼ ðy1, …, yn<sup>Þ</sup> as fixed numbers, we often express Eq. (1) as follows

$$f(\theta|\overrightarrow{y}) \approx f(\overrightarrow{y}|\theta)f(\theta) = k \, f(\overrightarrow{y}|\theta) \, f(\theta) \, \tag{2}$$

where k is the normalizing constant which makes <sup>ð</sup> <sup>Ω</sup> <sup>f</sup>ðθj<sup>y</sup> !Þ <sup>d</sup><sup>θ</sup> <sup>¼</sup> 1. We can often realize <sup>f</sup>ðθj<sup>y</sup> !Þ based on the prior density function f(θ) and the likelihood function fðy !j<sup>θ</sup><sup>Þ</sup> without considering the denominator fðy !Þ ¼ <sup>ð</sup> fðyjθÞfðθÞ dθ in Eq. (1) which is called the marginal likelihood.

#### 2.1. Example

There are two different approaches to statistical inference. One approach is called frequentist inference. In this framework, we often rely on the sampling distribution of a statistic and largesample theories. Another approach is called Bayesian inference. It is founded on Bayes' Theorem, and it allows researchers to express prior knowledge independent of data. In a smallsample study, Bayesian inference can be more useful than frequentist inference because we can incorporate both researcher's prior knowledge and observed data to make inference for the parameter of interest. Bayesian ideas are briefly introduced for dose-response modeling with a

In a small-sample study, we often rely on a parametric model to gain statistical efficiency (i.e., less variance in parameter estimation), but our inference can be severely biased by the use of a wrong model. To account for model uncertainty, it is reasonable to specify multiple models and make inference based on "averaged-inference." In this regard, Bayesian model averaging (BMA) is a useful method to gain robustness [1]. The BMA method has a wide range of application, and we focus its application to animal-based cancer risk assessments in

In clinical trials, study participants are real patients, and therefore, we need to carefully consider ethics. There are conflicting perspectives of individual- and population-level ethics in early phase clinical trials. Individual-level ethics focuses on the benefit of trial participants, whereas population-level focuses on the benefit of future patients, which may require some level of sacrifice from trial participants. We compare the two conflicting perspectives in clinical trials based on Bayesian decision theory, and we discuss a compromising method in

A sample size for an early phase (Phase I) clinical trial is often less than 30 subjects. Dose allocations for first few patients and statistical inference for future patients heavily depend on researcher's prior knowledge in sparse data. When multiple researchers have different prior knowledge about a parameter of interest, one compromising approach is to combine their prior elicitations and average them (i.e., consensus prior) [4, 5]. When we average the prior elicitations, there are two different approaches to determine the weight of each prior elicitation, weights determined before observing data and after observing data. We discuss operating characteristics of the two different weighting methods in the context of Phase I clinical trials

In statistics, we address a research question by a parameter, which is often denoted by θ. We begin Bayesian inference by modeling the prior knowledge about θ. A function, which models the prior knowledge about θ, is called the prior density function of θ, and we denote

<sup>Ω</sup> <sup>f</sup>ðθ<sup>Þ</sup> <sup>d</sup><sup>θ</sup> <sup>¼</sup> 1, where <sup>Ω</sup> is the set of all

!j<sup>θ</sup>Þ, quantifies the likelihood of observing a particular

! ¼ ðy1, …, yn<sup>Þ</sup> given <sup>θ</sup>. The

binary outcome in Section 2.

Section 3.

168 Bayesian Inference

Section 4 [2, 3].

in Section 5.

2. Bayesian inference

likelihood function, denoted by fðy

it by f(θ). It is a non-negative function, which satisfies <sup>ð</sup>

possible values of θ (i.e., parameter space). We then model data y

Suppose we observe n = 20 rats for 2 years. Let π be the parameter of interest, which is interpreted as the probability of developing some type of tumor. Suppose a researcher models the prior knowledge about π using the prior density function

$$f(\pi) = \frac{\Gamma(a+b)}{\Gamma(a)\,\Gamma(b)} \,\pi^{a-1} \left(1-\pi\right)^{b-1}, \ 0 < \pi < 1\,\,. \tag{3}$$

It is known as the beta distribution with shape parameters a > 0 and b > 0. We often denote the beta distribution by π � Betaða, bÞ, and the values of a and b must be specified by the researcher independent from data. Let y ! ¼ ðy1,…, yn<sup>Þ</sup> denote observed data, where yi = 1 if the <sup>i</sup> th rat developed tumor and yi = 0 otherwise. Assuming y1,…, yn are independent observations, the likelihood function is as follows

$$f(\overrightarrow{y} \mid \pi) = \prod\_{i=1}^{n} \pi^{y\_i} \left(1 - \pi\right)^{1 - y\_i} = \pi^s \left(1 - \pi\right)^{n - s},\tag{4}$$

where <sup>s</sup> <sup>¼</sup> <sup>X</sup><sup>n</sup> <sup>i</sup>¼<sup>1</sup> yi is the total number of rats developed tumor. By Eq. (2), the posterior density function of π is as follows

$$f(\pi|\vec{y}) = k\pi^{a+s-1} \left(1 - \pi\right)^{b+n-s-1},\tag{5}$$

where <sup>k</sup> <sup>¼</sup> <sup>Γ</sup>ðaþbþn<sup>Þ</sup> <sup>Γ</sup>ðaþs<sup>Þ</sup> <sup>Γ</sup>ðbþn�s<sup>Þ</sup> is the normalizing constant, which makes <sup>ð</sup><sup>1</sup> <sup>0</sup> <sup>f</sup>ðπj<sup>y</sup> !Þ <sup>d</sup><sup>π</sup> <sup>¼</sup> 1. We can recognize that πjy ! � Betað<sup>a</sup> <sup>þ</sup> s, b <sup>þ</sup> <sup>n</sup> � <sup>s</sup>Þ.

If the researcher fixed a = 2 and b =3 and observed s = 9 from a sample of size n = 20, the prior density function is fðπÞ ¼ k π ð1 � πÞ <sup>2</sup> with <sup>k</sup> <sup>¼</sup> <sup>Γ</sup>ð5<sup>Þ</sup> <sup>Γ</sup>ð2<sup>Þ</sup> <sup>Γ</sup>ð3<sup>Þ</sup> ¼ 12, and the posterior density function is fðπjy !Þ ¼ <sup>k</sup> <sup>π</sup><sup>10</sup> <sup>ð</sup><sup>1</sup> � <sup>π</sup><sup>Þ</sup> <sup>13</sup> with <sup>k</sup> <sup>¼</sup> <sup>Γ</sup>ð25<sup>Þ</sup> <sup>Γ</sup>ð11<sup>Þ</sup> <sup>Γ</sup>ð14<sup>Þ</sup> ¼ 27457584. The prior and posterior distributions are shown in Figure 1. The knowledge about π becomes more certain (less variance) after observing the data.

#### 2.2. Example

This example is simplified from Shao and Small [6]. In dose-response studies, we model π as a function of dose x. There are many link functions between π and x used in practice. In this example, we focus on a link function

$$
\pi\_{\mathfrak{x}} = \frac{\mathcal{e}^{\mathfrak{E}\_0 + \mathfrak{E}\_1 \mathfrak{x}}}{1 + \mathcal{e}^{\mathfrak{E}\_0 + \mathfrak{E}\_1 \mathfrak{x}}} \tag{6}
$$

which is known as a logistic regression model. It is commonly assumed that a dose-response curve increases with respect to dose, so we assume β<sup>1</sup> > 0 (and β<sup>0</sup> can be any real number). There are two regression parameters in Eq. (6), β<sup>0</sup> and β1, and we denote them as β ! ¼ ðβ0, β1Þ. Figure 2 presents two dose-response curves. The solid curve is generated by β ! ¼ ð�1; 2Þ, and the dotted curve is generated by β ! ¼ ð�2; 5Þ. As β<sup>0</sup> increases, the background risk π<sup>0</sup> ¼ <sup>e</sup>β<sup>0</sup> <sup>1</sup>þeβ<sup>0</sup> increases, where π<sup>0</sup> is interpreted as the probability of tumor development at dose x = 0. The dose-response curve increases when β<sup>1</sup> > 0, and it decreases when β<sup>1</sup> < 0. The rate of change in the dose-response curve is determined by|β1|.

To express prior knowledge about β ! , we need to find an appropriate prior density function fðβ ! Þ. It is not simple because it is difficult to express one's knowledge on the two-dimensional parameters β ! ¼ ðβ0, β1Þ. For mathematical convenience, some practitioners use a flat prior density function fðβ ! Þ∝ 1. Another way of expressing a lack of prior knowledge about β ! is as follows

$$f(\overrightarrow{\beta}^{\cdot}) \propto \frac{1}{2\pi\sigma^2} \ e^{-\frac{\beta\_0^2 + \beta\_1^2}{2\sigma^2}} \mathbf{I}\_{\beta\_1 > 0} \tag{7}$$

with an arbitrarily large value of σ [6]. When a reliable source of prior information is available, there is a practical method, which is known as the conditional mean prior [7], and it will be discussed in a later section (see Section 4.2). In an experiment, the experimental doses x ! ¼ ðx1, …, xn<sup>Þ</sup> are fixed, and we observe random binary outcomes <sup>y</sup> ! ¼ ðy1,…, ynÞ. Given <sup>y</sup> ! (and fixed x !), the likelihood function is as follows

$$f(\overrightarrow{y} \mid \overrightarrow{\beta}) = \prod\_{i=1}^{n} \left( \frac{e^{\theta\_0 + \beta\_1 x\_i}}{1 + e^{\theta\_0 + \beta\_1 x\_i}} \right)^{y\_i} \left( \frac{1}{1 + e^{\theta\_0 + \beta\_1 x\_i}} \right)^{1 - y\_i} = \frac{e^{\theta\_0 s\_1 + \beta\_1 s\_2}}{\prod\_{i=1}^{n} \left( 1 + e^{\theta\_0 + \beta\_1 x\_i} \right)},\tag{8}$$

### **Prior and Posterior Distributions**

Figure 1. The prior f(π) in the dotted curve and the posterior fðπjy !Þ in the solid curve.

If the researcher fixed a = 2 and b =3 and observed s = 9 from a sample of size n = 20, the prior

are shown in Figure 1. The knowledge about π becomes more certain (less variance) after

This example is simplified from Shao and Small [6]. In dose-response studies, we model π as a function of dose x. There are many link functions between π and x used in practice. In this

<sup>π</sup><sup>x</sup> <sup>¼</sup> <sup>e</sup><sup>β</sup>0þβ1<sup>x</sup>

which is known as a logistic regression model. It is commonly assumed that a dose-response curve increases with respect to dose, so we assume β<sup>1</sup> > 0 (and β<sup>0</sup> can be any real number).

increases, where π<sup>0</sup> is interpreted as the probability of tumor development at dose x = 0. The dose-response curve increases when β<sup>1</sup> > 0, and it decreases when β<sup>1</sup> < 0. The rate of change in

It is not simple because it is difficult to express one's knowledge on the two-dimensional

<sup>2</sup>πσ<sup>2</sup> <sup>e</sup> � β2 0 <sup>þ</sup>β<sup>2</sup> 1

with an arbitrarily large value of σ [6]. When a reliable source of prior information is available, there is a practical method, which is known as the conditional mean prior [7], and it will be discussed in a later section (see Section 4.2). In an experiment, the experimental doses

> <sup>1</sup> <sup>þ</sup> <sup>e</sup><sup>β</sup>0þβ1xi � �<sup>1</sup>�yi

Þ∝ 1. Another way of expressing a lack of prior knowledge about β

¼ ðβ0, β1Þ. For mathematical convenience, some practitioners use a flat prior den-

There are two regression parameters in Eq. (6), β<sup>0</sup> and β1, and we denote them as β

Figure 2 presents two dose-response curves. The solid curve is generated by β

!

!

fðβ

! ¼ ðx1, …, xn<sup>Þ</sup> are fixed, and we observe random binary outcomes <sup>y</sup>

� �yi 1

!), the likelihood function is as follows

e<sup>β</sup>0þβ1xi <sup>1</sup> <sup>þ</sup> <sup>e</sup><sup>β</sup>0þβ1xi ! <sup>Þ</sup> <sup>∝</sup> <sup>1</sup>

<sup>Γ</sup>ð2<sup>Þ</sup> <sup>Γ</sup>ð3<sup>Þ</sup> ¼ 12, and the posterior density function

<sup>1</sup> <sup>þ</sup> <sup>e</sup><sup>β</sup>0þβ1<sup>x</sup> <sup>ð</sup>6<sup>Þ</sup>

¼ ð�2; 5Þ. As β<sup>0</sup> increases, the background risk π<sup>0</sup> ¼ <sup>e</sup>β<sup>0</sup>

, we need to find an appropriate prior density function fðβ

<sup>¼</sup> <sup>e</sup><sup>β</sup>0s1þβ1s<sup>2</sup> Y<sup>n</sup> i¼1 ð1 þ e

!

!

!

! ¼ ðy1,…, ynÞ. Given <sup>y</sup>

, ð8Þ

<sup>2</sup>σ<sup>2</sup> I<sup>β</sup>1><sup>0</sup> ð7Þ

β0þβ1xi Þ ¼ ðβ0, β1Þ.

<sup>1</sup>þeβ<sup>0</sup>

! Þ.

!

is as follows

¼ ð�1; 2Þ, and

<sup>Γ</sup>ð11<sup>Þ</sup> <sup>Γ</sup>ð14<sup>Þ</sup> ¼ 27457584. The prior and posterior distributions

<sup>2</sup> with <sup>k</sup> <sup>¼</sup> <sup>Γ</sup>ð5<sup>Þ</sup>

<sup>13</sup> with <sup>k</sup> <sup>¼</sup> <sup>Γ</sup>ð25<sup>Þ</sup>

density function is fðπÞ ¼ k π ð1 � πÞ

example, we focus on a link function

the dotted curve is generated by β

To express prior knowledge about β

parameters β

x

(and fixed x

fðy ! jβ ! Þ ¼ <sup>Y</sup><sup>n</sup> i¼1

sity function fðβ

!

!

the dose-response curve is determined by|β1|.

!Þ ¼ <sup>k</sup> <sup>π</sup><sup>10</sup> <sup>ð</sup><sup>1</sup> � <sup>π</sup><sup>Þ</sup>

is fðπjy

170 Bayesian Inference

observing the data.

2.2. Example

where <sup>s</sup><sup>1</sup> <sup>¼</sup> <sup>X</sup><sup>n</sup> <sup>i</sup>¼<sup>1</sup> yi and <sup>s</sup><sup>2</sup> <sup>¼</sup> <sup>X</sup><sup>n</sup> <sup>i</sup>¼<sup>1</sup> xiyi . By incorporating both prior and data, the posterior density function is as follows

$$f(\overrightarrow{\beta} \mid \overrightarrow{y}) \propto f(\overrightarrow{\beta}) \frac{e^{\delta\_0 s\_1 + \delta\_1 s\_2}}{\prod\_{i=1}^n \left(1 + e^{\delta\_0 + \delta\_1 x\_i}\right)} \,. \tag{9}$$

In an animal-based studies, one parameter of interest is the median effective dose, which is denoted by ED50. It is the dose, which satisfies

$$
\pi\_{\rm ED\_{50}} = \frac{e^{\mathfrak{E}\_0 + \mathfrak{E}\_1 \amalg \mathfrak{E}\_{\mathfrak{N}}}}{1 + e^{\mathfrak{E}\_0 + \mathfrak{E}\_1 \amalg \mathfrak{E}\_{\mathfrak{N}}}} = .5 \,, \tag{10}
$$

and it can be shown that ED50 ¼ �<sup>β</sup><sup>0</sup> β1 by algebra. In the case of β<sup>0</sup> = �2 and β<sup>1</sup> = 5, we have ED50 = .4 as describe in the figure with the dotted curve. In the case of β<sup>0</sup> = �1 and β<sup>2</sup> = 2, we have ED50 = .5 as described in the figure with the solid curve.

In 1997, International Agency for Research on Cancer classified 2,3,7,8-Tetrachlorodibenzo-pdioxin (known as TCDD) as a carcinogen for humans based on various empirical evidence [8].

**Dose−Response Curves (Logistic Link)**

Figure 2. Two dose-response curves using the logistic link.

In 1978, Kociba et al. presented the data on male Sprague-Dawley rats at four experimental doses 0, 1, 10 and 100 nanograms per kilogram per day (ng/kg/day) [9]. In the control dose group, nine of 86 rats developed tumor (known as hepatocellular carcinoma); three of 50 rats developed the tumor at dose 1; 18 of 50 rats developed the tumor at dose 10; and 34 of 48 rats developed the tumor at dose 100 [6]. Without loss of generosity, we let xi = 0 for i = 1,…, 86; xi = 1 for i = 87,…136; xi = 10 for i = 137,…, 186; and xi = 100 for i = 187,…, 234. The given information is sufficient to calculate <sup>s</sup><sup>1</sup> <sup>¼</sup> <sup>X</sup><sup>n</sup> <sup>i</sup>¼<sup>1</sup> yi <sup>¼</sup> 64 and <sup>s</sup><sup>2</sup> <sup>¼</sup> <sup>X</sup><sup>n</sup> <sup>i</sup>¼<sup>1</sup> xiyi <sup>¼</sup> 3583. By the use of the flat prior fðβ ! Þ ∝ 1 with the restriction β<sup>1</sup> > 0, given the observed sample of size n = 234, we can generate random numbers of β ! ¼ ðβ0, β1Þ from the posterior density function

$$f(\overrightarrow{\beta} \mid \overrightarrow{y}) \propto \frac{e^{\beta\_0 s\_1 + \beta\_1 s\_2}}{\prod\_{i=1}^n (1 + e^{\beta\_0 + \beta\_1 x\_i})} \text{ } \mathbf{I}\_{\beta\_1 > 0} \text{ } \tag{11}$$

where I<sup>β</sup>1><sup>0</sup> ¼ 1 if β<sup>1</sup> > 0 and I<sup>β</sup>1><sup>0</sup> ¼ 0 otherwise. Using a method of Markov Chain Monte Carlo (MCMC), we can approximate the posterior distribution of β ! as shown in the left panel of Figure 3. By transforming (β0, β1) to ED50 ¼ �<sup>β</sup><sup>0</sup> β1 , we can approximate the posterior

Figure 3. Approximate posterior distributions of (β0, β1) and ED50 ¼ �<sup>β</sup><sup>0</sup> β1 .

distribution of the median effective dose ED50 as shown in the right panel. The posterior mean of ED50 is EðED50jy !Þ ¼ <sup>64</sup>:9 with 95% credible interval (50.8, 82.5), the 2.5th percentile and the 97.5th percentile of the posterior distribution.

### 3. Bayesian model averaging

In 1978, Kociba et al. presented the data on male Sprague-Dawley rats at four experimental doses 0, 1, 10 and 100 nanograms per kilogram per day (ng/kg/day) [9]. In the control dose group, nine of 86 rats developed tumor (known as hepatocellular carcinoma); three of 50 rats developed the tumor at dose 1; 18 of 50 rats developed the tumor at dose 10; and 34 of 48 rats developed the tumor at dose 100 [6]. Without loss of generosity, we let xi = 0 for i = 1,…, 86; xi = 1 for i = 87,…136; xi = 10 for i = 137,…, 186; and xi = 100 for i = 187,…, 234. The given

x

0.0 0.2 0.4 0.6 0.8 1.0

**Dose−Response Curves (Logistic Link)**

!

!Þ <sup>∝</sup> <sup>e</sup><sup>β</sup>0s1þβ1s<sup>2</sup> Y<sup>n</sup> i¼1 ð1 þ e

where I<sup>β</sup>1><sup>0</sup> ¼ 1 if β<sup>1</sup> > 0 and I<sup>β</sup>1><sup>0</sup> ¼ 0 otherwise. Using a method of Markov Chain Monte

fðβ ! <sup>j</sup><sup>y</sup>

Carlo (MCMC), we can approximate the posterior distribution of β

of Figure 3. By transforming (β0, β1) to ED50 ¼ �<sup>β</sup><sup>0</sup>

<sup>i</sup>¼<sup>1</sup> yi <sup>¼</sup> 64 and <sup>s</sup><sup>2</sup> <sup>¼</sup> <sup>X</sup><sup>n</sup>

Þ ∝ 1 with the restriction β<sup>1</sup> > 0, given the observed sample of size n = 234,

β0þβ1xi Þ

β1

¼ ðβ0, β1Þ from the posterior density function

!

<sup>i</sup>¼<sup>1</sup> xiyi <sup>¼</sup> 3583. By the use

as shown in the left panel

I<sup>β</sup>1><sup>0</sup> , ð11Þ

, we can approximate the posterior

information is sufficient to calculate <sup>s</sup><sup>1</sup> <sup>¼</sup> <sup>X</sup><sup>n</sup>

Figure 2. Two dose-response curves using the logistic link.

!

0.0

 0.2

 0.4

 0.6

 0.8

 1.0

we can generate random numbers of β

of the flat prior fðβ

π

172 Bayesian Inference

In a small sample, we borrow the strength of a parametric model to gain efficiency in parameter estimation. However, an assumed model may not describe the true dose-response relationship adequately. The impact of model misspecification is not negligible particularly in a poor experimental design. In such a limited practical situation, Bayesian model averaging (BMA) can be a useful method to account for model uncertainty. It is widely applied in practice, and in this section, we focus on the application to cancer risk assessment for the estimation of a benchmark dose [1, 6, 10, 11].

Let θ denote a parameter of interest. Suppose we have a set of K candidate models denoted by M ¼ fM1, …, MKg. Let β ! <sup>k</sup> denote the vector of regression parameters under model Mk for k =1, …, K. Suppose θ is a function of β ! <sup>k</sup>, and the interpretation of θ must be common across all models. Let fðβ ! <sup>k</sup>jMkÞ and fðy !j<sup>β</sup> ! <sup>k</sup>, MkÞ denote the prior density function and the likelihood function, respectively, under Mk. By the Law of Total Probability, the posterior density function of θ is as follows

$$f(\boldsymbol{\theta}|\overrightarrow{\boldsymbol{y}}^{\cdot}) = \sum\_{k=1}^{K} f(\boldsymbol{\theta}|\boldsymbol{M}\_{k\prime}\overrightarrow{\boldsymbol{y}}) \, \boldsymbol{P}(\boldsymbol{M}\_{k}|\overrightarrow{\boldsymbol{y}}) \,. \tag{12}$$

In Eq. (12), the posterior density function fðθjMk, y !Þ depends on model Mk, and the posterior model probability PðMkjy !Þ quantifies the plausibility of model Mk after observing data, which is given by

$$P(M\_k | \overrightarrow{y} \mid \newline) = \frac{f(\overrightarrow{y} | M\_k) \, P(M\_k)}{\sum\_{j=1}^{K} f(\overrightarrow{y} | M\_j) \, P(M\_j)}.\tag{13}$$

In Eq. (13), the prior model probability P(Mk) is determined before observing data such that <sup>P</sup>ðMk<sup>Þ</sup> <sup>&</sup>gt; 0 for <sup>k</sup> <sup>¼</sup> <sup>1</sup>;…, K andX<sup>K</sup> <sup>k</sup>¼<sup>1</sup> <sup>P</sup>ðMkÞ ¼ 1. The marginal likelihood under Mk requires the integration

$$f(\overrightarrow{y}|M\_k) = \left[ f(\overrightarrow{y}|\overrightarrow{\beta}\_{k'}M\_k) f(\overrightarrow{\beta}\_k|M\_k) \, d\overrightarrow{\beta}\_k \right] \tag{14}$$

In the BMA method, all K models contribute to inference of θ through the averaged posterior density function in Eq. (12), and the weight of contribution is determined by Bayes' Theorem in Eq. (13).

#### 3.1. Example

This example is continued from the example in Section 2.2. Recall π<sup>x</sup> is interpreted as the probability of a toxic event (tumor development) at dose x. In many cancer risk assessments, a parameter of interest is θγ at a fixed risk level γ, which is defined as follows

$$\mathcal{V} = \frac{\pi\_{\theta\_{\mathcal{V}}} - \pi\_0}{1 - \pi\_0} \tag{15}$$

or equivalently πθγ ¼ π<sup>0</sup> þ ð1 � π0Þ γ. In words, θγ is a dose corresponding to a fixed increase in the risk level. In frequentist framework, Crump defined a benchmark dose as a lower confidence limit for θγ [12]. In Bayesian framework, an analogous definition would be a lower credible bound (i.e., a fixed low percentile of the posterior distribution of θγ). The definition is widely applied to the public health protection [13].

In practice, γ is fixed between 0.01 and 0.1. Often, the estimation of θγ is highly sensitive to an assumed dose-response model because we have a lack of information at low doses. Shao and Small fixed γ = 0.1 and applied BMA with K = 2 models, logistic model and quantal-linear model [6]. In the quantal-linear model, the probability of tumor development is modeled by

$$
\pi\_x = \beta\_0 + (1 - \beta\_0)(1 - e^{-\beta\_1 x}) \,. \tag{16}
$$

with the restrictions 0 < β<sup>0</sup> < 1 and β<sup>1</sup> > 0 under the monotonic assumption. The logistic model was given in Eq. (6) of Section 2.2.

Let M<sup>1</sup> denote the logistic model, and let M<sup>2</sup> denote the quantal-linear model. Assume the uniform prior model probabilities PðM1Þ ¼ PðM2Þ ¼ :5 and flat priors on the regression parameters. By posterior sampling, we can approximate the posterior model probabilities PðM1jy !Þ ¼ :049 and <sup>P</sup>ðM2j<sup>y</sup> !Þ ¼ :951. Under <sup>M</sup>1, the posterior mean of <sup>θ</sup>0.1 is 20.95 with the 5th percentile 16.74. Under M2, the posterior mean is 8.25 with the 5th percentile 5.95. These

PðMkjy

fðy ! jMkÞ ¼

widely applied to the public health protection [13].

was given in Eq. (6) of Section 2.2.

!Þ ¼ :049 and <sup>P</sup>ðM2j<sup>y</sup>

PðM1jy

<sup>P</sup>ðMk<sup>Þ</sup> <sup>&</sup>gt; 0 for <sup>k</sup> <sup>¼</sup> <sup>1</sup>;…, K andX<sup>K</sup>

integration

174 Bayesian Inference

in Eq. (13).

3.1. Example

! Þ ¼ <sup>f</sup>ð<sup>y</sup>

ð fðy !j<sup>β</sup> !

a parameter of interest is θγ at a fixed risk level γ, which is defined as follows

X<sup>K</sup> <sup>j</sup>¼<sup>1</sup> <sup>f</sup>ð<sup>y</sup>

In Eq. (13), the prior model probability P(Mk) is determined before observing data such that

In the BMA method, all K models contribute to inference of θ through the averaged posterior density function in Eq. (12), and the weight of contribution is determined by Bayes' Theorem

This example is continued from the example in Section 2.2. Recall π<sup>x</sup> is interpreted as the probability of a toxic event (tumor development) at dose x. In many cancer risk assessments,

> <sup>γ</sup> <sup>¼</sup> πθγ � <sup>π</sup><sup>0</sup> 1 � π<sup>0</sup>

or equivalently πθγ ¼ π<sup>0</sup> þ ð1 � π0Þ γ. In words, θγ is a dose corresponding to a fixed increase in the risk level. In frequentist framework, Crump defined a benchmark dose as a lower confidence limit for θγ [12]. In Bayesian framework, an analogous definition would be a lower credible bound (i.e., a fixed low percentile of the posterior distribution of θγ). The definition is

In practice, γ is fixed between 0.01 and 0.1. Often, the estimation of θγ is highly sensitive to an assumed dose-response model because we have a lack of information at low doses. Shao and Small fixed γ = 0.1 and applied BMA with K = 2 models, logistic model and quantal-linear model [6]. In the quantal-linear model, the probability of tumor development is modeled by

with the restrictions 0 < β<sup>0</sup> < 1 and β<sup>1</sup> > 0 under the monotonic assumption. The logistic model

Let M<sup>1</sup> denote the logistic model, and let M<sup>2</sup> denote the quantal-linear model. Assume the uniform prior model probabilities PðM1Þ ¼ PðM2Þ ¼ :5 and flat priors on the regression parameters. By posterior sampling, we can approximate the posterior model probabilities

5th percentile 16.74. Under M2, the posterior mean is 8.25 with the 5th percentile 5.95. These

<sup>k</sup>, MkÞfðβ !

!jMk<sup>Þ</sup> <sup>P</sup>ðMk<sup>Þ</sup>

!jMj<sup>Þ</sup> <sup>P</sup>ðMj<sup>Þ</sup>

<sup>k</sup>¼<sup>1</sup> <sup>P</sup>ðMkÞ ¼ 1. The marginal likelihood under Mk requires the

<sup>π</sup><sup>x</sup> <sup>¼</sup> <sup>β</sup><sup>0</sup> þ ð<sup>1</sup> � <sup>β</sup>0Þð<sup>1</sup> � <sup>e</sup>�β1<sup>x</sup>Þ: <sup>ð</sup>16<sup>Þ</sup>

!Þ ¼ :951. Under <sup>M</sup>1, the posterior mean of <sup>θ</sup>0.1 is 20.95 with the

<sup>k</sup>jMkÞ dβ !

: ð13Þ

<sup>k</sup> : ð14Þ

ð15Þ

Figure 4. Posterior distributions of θ0.1 from the logistic model (left panel), the quantal-linear model (middle panel), and the Bayesian model averaging (right panel).

results are very similar to the results reported by Shao and Small [6]. From these modelspecific statistics, we can calculate the model-averaged posterior mean

$$E(\theta\_{0.1}|\vec{y}) = \sum\_{k=1}^{2} E(\theta\_{0.1}|M\_k, \vec{y}) \, P(M\_k|\vec{y}) = 20.95 \, (.049) + 8.25 \, (.951) = 8.87\,. \tag{17}$$

However, we are not able to calculate the 5th percentile of the model-averaged posterior distribution based on the given statistics. In fact, we need to approximate the posterior distribution fðθ<sup>0</sup>:<sup>1</sup>jy !Þ, which is a mixture of <sup>f</sup>ðθ<sup>0</sup>:<sup>1</sup>jM1, y !Þ and <sup>f</sup>ðθ<sup>0</sup>:<sup>1</sup>jM2, y !Þ weighted by PðM1jy !Þ ¼ :049 and <sup>P</sup>ðM2j<sup>y</sup> !Þ ¼ :951, respectively, as shown in Figure 4. In the figure, the left panel shows an approximation of fðθ<sup>0</sup>:<sup>1</sup>jM1, y !Þ, the middle panel shows an approximation of fðθ<sup>0</sup>:<sup>1</sup>jM2, y !Þ, and the right panel shows an approximation of the averaged posterior <sup>f</sup>ðθ<sup>0</sup>:<sup>1</sup>j<sup>y</sup> !Þ. The averaged posterior density fðθ<sup>0</sup>:<sup>1</sup>jy !Þ is bimodal, but it is very close to <sup>f</sup>ðθ<sup>0</sup>:<sup>1</sup>jM2, y !Þ because the quantal-linear model M<sup>2</sup> fits the data better than the logistic model M<sup>1</sup> by a Bayes factor of <sup>P</sup>ðM2<sup>j</sup>y !Þ <sup>P</sup>ðM1<sup>j</sup>y !Þ ¼ :<sup>951</sup> :<sup>049</sup> ¼ 19:4. The 5th percentile of the model-averaged posterior distribution is approximately 5.97, and it is a BMA-BMD based on the BMA method proposed by Raftery et al. [1] and the BMD estimation method suggested by Crump [12].

### 4. Application of Bayesian decision theory to Phase I trials

In a Phase I cancer trial, the main objectives are to study the safety of a new chemotherapy and to determine an appropriate dose for future patients. Since trial participants are cancer patients, dose allocations require ethical considerations. Whitehead and Williams discussed several Bayesian approaches to dose allocations [14]. One decision rule is devised from the perspective of trial participants (individual-level ethics), and another decision rule is devised from the perspective of future patients (population-level ethics). However, a decision rule, which is devised from the population-level ethics, is not widely accepted in current practice [15]. Instead, there are some proposed decision rules, which compromise between the individual- and population-level perspectives [3, 16]. In this section, we discuss the two

conflicting perspectives in Phase I clinical trials and a compromising method based on Bayesian decision theory.

Assume a dose-response relationship follows a logistic model

$$
\pi\_{\mathfrak{x}} = \frac{e^{\mathfrak{g}\_0 + \mathfrak{f}\_1 \mathbf{x}}}{1 + e^{\mathfrak{f}\_0 + \mathfrak{f}\_1 \mathbf{x}'}} \tag{18}
$$

where x is a dose in the logarithmic scale (base e) and π<sup>x</sup> is the probability of observing an adverse event due to the toxicity of a new chemotherapy at dose x. The logarithmic transformation on the dose is to satisfy π<sup>x</sup> ! 0 as x ! 0. Let x ! <sup>n</sup> ¼ ðx1, …, xnÞ denote a series of decisions for n patients (i.e., allocated doses) and y ! <sup>n</sup> ¼ ðy1, …, ynÞ denote a series of observed responses, where yi = 1 indicates an adverse event and yi = 0 otherwise. Let Lðβ ! , xnþ<sup>1</sup>Þ denote a loss by allocating the next patient at xn+1. Based on Bayesian Decision Theory, we want to find xn+1 which minimizes the posterior mean of Lðβ ! , xnþ<sup>1</sup>Þ. If we let A denote an action space, a set of all possible dose allocations for the next patients, the decision rule can be written as follows:

$$\mathbf{x}\_{n+1}^{\*} = \operatorname{argmin}\_{\mathbf{x}\_{n+1} \in \mathcal{A}} E\left(L(\vec{\boldsymbol{\beta}}, \mathbf{x}\_{n+1}) \mid \vec{\boldsymbol{y}}\_{n}\right). \tag{19}$$

A choice of L has a substantial impact on the operating characteristics of a Phase I trial including (i) the degree of under- and over-dosing in trial, (ii) the observed number of adverse events at the end of a trial, and (iii) the quality of estimation at the end of a trial.

#### 4.1. Parameter of interest: maximum tolerable dose

Let N denote an available sample size for a Phase I clinical trial. A typical sample size is N ≤ 30. Let γ denote a target risk level, the probability of an adverse event. In a cancer study, a typical target risk level γ is fixed between .15 and .35 depending on the severity of an adverse event. Then, the dose corresponding to γ is called a maximum tolerable dose (MTD) at level γ, and we denote it by θγ in the logarithmic scale. Under the logistic model in Eq. (18), it is defined as follows

$$
\theta\_{\mathcal{V}} = \frac{\log\left(\frac{\mathcal{V}}{1-\mathcal{V}}\right) - \beta\_0}{\beta\_1}.\tag{20}
$$

At the end of a trial (observing N responses), we estimate θγ by the posterior mean <sup>θ</sup>^γ,N <sup>¼</sup> <sup>E</sup>ðθγj<sup>y</sup> ! <sup>N</sup>Þ for future patients.

#### 4.2. Prior density function: conditional mean priors

A consequence of sequential decisions heavily depends on a prior density function fðβ ! Þ. In particular, the first decision x<sup>1</sup> must be made based on prior knowledge only because empirical evidence is not observed yet. In addition, the later decisions x2, x3,… and the final inference of θγ are substantially affected by fðβ ! Þ as a Phase I study is typically based on a small sample. In this regard, we want to carefully utilize researchers' prior knowledge about β ! , but it may be difficult to express their prior knowledge directly through fðβ ! Þ. In this section, we discuss a method of eliciting prior knowledge, which is more tractable than prior elicitation directly on β ! . Suppose a researcher selects two arbitrarily doses, say x�<sup>1</sup> < x0. Then, the researcher may express their prior knowledge by two independent beta distributions

$$
\pi\_{\mathbf{x}\_i} = \frac{e^{\theta\_0 + \theta\_1 \mathbf{x}\_i}}{1 + e^{\theta\_0 + \theta\_1 \mathbf{x}\_i}} \sim \text{Beta}(a\_i, b\_i), \ j = -1, 0. \tag{21}
$$

Using the Jacobian transformation from ðπ<sup>x</sup>�<sup>1</sup> , π<sup>x</sup><sup>0</sup> Þ to β ! ¼ ðβ0, β1Þ, it can be shown that the prior density function of β ! is given by

$$f(\overrightarrow{\beta} \mid \mathbf{x} \mid \mathbf{x}\_0 - \mathbf{x}\_{-1}) \prod\_{i=-1}^{0} \left( \frac{e^{\mathfrak{f}\_0 + \mathfrak{f}\_1 \mathbf{x}\_i}}{1 + e^{\mathfrak{f}\_0 + \mathfrak{f}\_1 \mathbf{x}\_i}} \right)^{a\_i} \left( \frac{1}{1 + e^{\mathfrak{f}\_0 + \mathfrak{f}\_1 \mathbf{x}\_i}} \right)^{b\_i} . \tag{22}$$

It is known as conditional mean priors under the logistic model [7].

#### 4.3. Posterior density function: conjugacy

conflicting perspectives in Phase I clinical trials and a compromising method based on

<sup>π</sup><sup>x</sup> <sup>¼</sup> <sup>e</sup><sup>β</sup>0þβ1<sup>x</sup>

!

where yi = 1 indicates an adverse event and yi = 0 otherwise. Let Lðβ

!

<sup>n</sup>þ<sup>1</sup> <sup>¼</sup> argminxnþ<sup>1</sup> <sup>∈</sup> <sup>A</sup> <sup>E</sup>

θγ ¼

where x is a dose in the logarithmic scale (base e) and π<sup>x</sup> is the probability of observing an adverse event due to the toxicity of a new chemotherapy at dose x. The logarithmic transforma-

allocating the next patient at xn+1. Based on Bayesian Decision Theory, we want to find xn+1 which

A choice of L has a substantial impact on the operating characteristics of a Phase I trial including (i) the degree of under- and over-dosing in trial, (ii) the observed number of adverse events at the

Let N denote an available sample size for a Phase I clinical trial. A typical sample size is N ≤ 30. Let γ denote a target risk level, the probability of an adverse event. In a cancer study, a typical target risk level γ is fixed between .15 and .35 depending on the severity of an adverse event. Then, the dose corresponding to γ is called a maximum tolerable dose (MTD) at level γ, and we denote it by θγ in the logarithmic scale. Under the logistic model in Eq. (18), it is defined as

> log <sup>γ</sup> 1 � γ

> > β1

At the end of a trial (observing N responses), we estimate θγ by the posterior mean

A consequence of sequential decisions heavily depends on a prior density function fðβ

particular, the first decision x<sup>1</sup> must be made based on prior knowledge only because empirical evidence is not observed yet. In addition, the later decisions x2, x3,… and the final inference of

� β<sup>0</sup>

 Lðβ !

possible dose allocations for the next patients, the decision rule can be written as follows:

!

<sup>1</sup> <sup>þ</sup> <sup>e</sup><sup>β</sup>0þβ1<sup>x</sup> , <sup>ð</sup>18<sup>Þ</sup>

<sup>n</sup> ¼ ðy1, …, ynÞ denote a series of observed responses,

, xnþ<sup>1</sup>Þ. If we let A denote an action space, a set of all

, xnþ<sup>1</sup>Þ j y ! n 

<sup>n</sup> ¼ ðx1, …, xnÞ denote a series of decisions for

, xnþ<sup>1</sup>Þ denote a loss by

: ð19Þ

: <sup>ð</sup>20<sup>Þ</sup>

! Þ. In

!

Assume a dose-response relationship follows a logistic model

tion on the dose is to satisfy π<sup>x</sup> ! 0 as x ! 0. Let x

x�

4.1. Parameter of interest: maximum tolerable dose

<sup>N</sup>Þ for future patients.

4.2. Prior density function: conditional mean priors

end of a trial, and (iii) the quality of estimation at the end of a trial.

n patients (i.e., allocated doses) and y

minimizes the posterior mean of Lðβ

follows

<sup>θ</sup>^γ,N <sup>¼</sup> <sup>E</sup>ðθγj<sup>y</sup>

!

Bayesian decision theory.

176 Bayesian Inference

For notational convenience, we let yi = ai and ni = ai + bi for i = �1,0. By conjugacy, the posterior density function of β ! can be concisely written as follows

$$f(\overrightarrow{\beta} \, | \overrightarrow{y}\_n) \propto \frac{e^{\beta\_0 s\_1 + \beta\_1 s\_2}}{\prod\_{i=-1}^n (1 + e^{\beta\_0 + \beta\_1 x\_i})} \, \tag{23}$$

where <sup>s</sup><sup>1</sup> <sup>¼</sup> <sup>X</sup><sup>n</sup> <sup>i</sup>¼�<sup>1</sup> yi and <sup>s</sup><sup>2</sup> <sup>¼</sup> <sup>X</sup><sup>n</sup> <sup>i</sup>¼�<sup>1</sup> xi yi . After observing n responses, the decision rule for the next patient is as follows

$$\mathbf{x}\_{n+1}^{\*} = \operatorname{argmin}\_{\mathbf{x}\_{n+1} \in \mathcal{A}} \left[ L(\overrightarrow{\boldsymbol{\beta}}, \mathbf{x}\_{n+1}) f(\overrightarrow{\boldsymbol{\beta}} | \overrightarrow{\boldsymbol{y}}\_{n}) \, d\overrightarrow{\boldsymbol{\beta}} \right. \tag{24}$$

#### 4.4. Loss functions for individual- and population-level ethics

A loss function, which reflects the perspective of individual-level ethics, is as follows:

$$L\_l(\stackrel{\rightarrow}{\beta}, \mathbf{x}\_{n+1}) = (\mathbf{x}\_{n+1} - \boldsymbol{\theta}\_{\boldsymbol{\gamma}})^2. \tag{25}$$

This loss function is analogous to the original continual reassessment method proposed by O'Quigley et al. [17]. The square error loss attempts to treat a trial participant at θγ, and the expected square error loss is minimized by the posterior mean of θγ.

From the perspective of population-level ethics, Whitehead and Brunier proposed a loss function, which is equal to the asymptotic variance of the maximum likelihood estimator for θγ [18]. The Fisher expected information matrix with a sample of size n + 1 is given by

$$\mathcal{Z}(\overrightarrow{\boldsymbol{\beta}}) = \begin{pmatrix} \sum\_{i=1}^{n+1} \boldsymbol{\tau}\_{i} & \sum\_{i=1}^{n+1} \boldsymbol{\tau}\_{i} \mathbf{x}\_{i} \\ \sum\_{i=1}^{n+1} \boldsymbol{\tau}\_{i} \mathbf{x}\_{i} & \sum\_{i=1}^{n+1} \boldsymbol{\tau}\_{i} \mathbf{x}\_{i}^{2} \end{pmatrix} \tag{26}$$

where τ<sup>i</sup> ¼ πxi ð1 � πxi Þ. Then, the loss function (the asymptotic variance) is given by

$$\Gamma\_{L^p}(\overrightarrow{\boldsymbol{\beta}}, \mathbf{x}\_{n+1}) = \left[\nabla \overrightarrow{h}(\overrightarrow{\boldsymbol{\beta}})\right]^T \left[\mathcal{T}(\overrightarrow{\boldsymbol{\beta}})\right]^{-1} \left[\nabla \overrightarrow{h}(\overrightarrow{\boldsymbol{\beta}})\right],\tag{27}$$

where

$$\nabla \vec{h} \left( \vec{\beta} \right) = \begin{pmatrix} \frac{\partial \theta\_{\mathcal{V}}}{\partial \beta\_0} \\ \frac{\partial \theta\_{\mathcal{V}}}{\partial \theta\_1} \end{pmatrix} = -\frac{1}{\beta\_1} \begin{pmatrix} 1 \\ \theta\_{\mathcal{V}} \end{pmatrix} \tag{28}$$

is the gradient vector, the partial derivatives of θγ with respect to β<sup>0</sup> and β1. Kim and Gillen decomposed the population-level loss function as follows

$$L\_{P}(\overrightarrow{\boldsymbol{\beta}}', \mathbf{x}\_{n+1}) = \frac{\tau\_{n+1} (\mathbf{x}\_{n+1} - \boldsymbol{\theta}\_{\boldsymbol{\gamma}})^2 + s\_n^{(0)} \left[ (\boldsymbol{\theta}\_{\boldsymbol{\gamma}} - \boldsymbol{\mu}\_n)^2 + \sigma\_n^2 \right]}{\left[ s\_n^{(0)} s\_n^{(2)} - s\_n^{(1)} s\_n^{(1)} \right] + s\_n^{(0)} \tau\_{n+1} \left[ (\mathbf{x}\_{n+1} - \boldsymbol{\mu}\_n)^2 + \sigma\_n^2 \right]},\tag{29}$$

where

$$\begin{aligned} s\_n^{(m)} &= \sum\_{i=1}^n \tau\_i \mathbf{x}\_i^m, \quad m = 0, 1, 2, \\ \mu\_n &= \sum\_{i=1}^n w\_i \mathbf{x}\_i \\ \sigma\_n^2 &= \sum\_{i=1}^n w\_i \mathbf{x}\_i^2 - \left(\sum\_{i=1}^n w\_i \mathbf{x}\_i\right)^2 \end{aligned} \tag{30}$$

with the weight defined as wi <sup>¼</sup> <sup>X</sup>τ<sup>i</sup> n i¼1 τj [3]. Eq. (29) has the following important remarks. In fact, LPðβ ! , xnþ<sup>1</sup>Þ considers individual-level ethics by including LIðβ ! , xnþ<sup>1</sup>Þ¼ðxnþ<sup>1</sup> � θγÞ <sup>2</sup> in the numerator. By including ðxnþ<sup>1</sup> � μnÞ <sup>2</sup> in the denominator, where <sup>μ</sup><sup>n</sup> <sup>¼</sup> <sup>X</sup><sup>n</sup> i¼1 wixi, the population-level loss function reduces a loss by allocating the next patient further away from the weighted average of previously allocated doses (i.e., devised from information gain). In long run, LPðβ ! , xnþ<sup>1</sup>Þ is devised from a compromise between individual- and population-level ethics, but the compromising process is rather too slow to be implemented in a small-sample Phase I clinical trial [3].

#### 4.5. Loss function for compromising the two perspectives

Kim and Gillen proposed to accelerate the compromising process by modifying LPðβ ! , xnþ<sup>1</sup>Þ of Eq. (29) as follows

$$L\_{B,\lambda}(\overrightarrow{\boldsymbol{\beta}},\boldsymbol{\chi}\_{n+1}) = \frac{a\_n(\boldsymbol{\lambda})\,\tau\_{n+1}(\boldsymbol{x}\_{n+1}-\boldsymbol{\theta}\_{\boldsymbol{\gamma}})^2 + s\_n^{(0)}\left[\left(\boldsymbol{\theta}\_{\boldsymbol{\gamma}}-\boldsymbol{\mu}\_n\right)^2 + \sigma\_n^2\right]}{\left[s\_n^{(0)}s\_n^{(2)} - s\_n^{(1)}\right] + s\_n^{(0)}\tau\_{n+1}\left[\left(\boldsymbol{x}\_{n+1}-\boldsymbol{\mu}\_n\right)^2 + \sigma\_n^2\right]},\tag{31}$$

where

From the perspective of population-level ethics, Whitehead and Brunier proposed a loss function, which is equal to the asymptotic variance of the maximum likelihood estimator for

> X<sup>n</sup>þ<sup>1</sup> <sup>i</sup>¼<sup>1</sup> <sup>τ</sup>ixi

1

∇h ! ðβ ! Þ h i

1 θγ � �

<sup>n</sup> ðθγ � μnÞ

<sup>n</sup> τ<sup>n</sup>þ<sup>1</sup> ðxnþ<sup>1</sup> � μnÞ

<sup>2</sup> <sup>þ</sup> <sup>σ</sup><sup>2</sup> n

[3]. Eq. (29) has the following important remarks. In

, xnþ<sup>1</sup>Þ¼ðxnþ<sup>1</sup> � θγÞ

i¼1

!

<sup>2</sup> in the denominator, where <sup>μ</sup><sup>n</sup> <sup>¼</sup> <sup>X</sup><sup>n</sup>

<sup>2</sup> <sup>þ</sup> <sup>σ</sup><sup>2</sup> n h i , <sup>ð</sup>29<sup>Þ</sup>

h i

A, ð26Þ

, ð27Þ

ð28Þ

ð30Þ

<sup>2</sup> in the

wixi, the

X<sup>n</sup>þ<sup>1</sup> <sup>i</sup>¼<sup>1</sup> <sup>τ</sup>ix<sup>2</sup> i

Þ. Then, the loss function (the asymptotic variance) is given by

Iðβ ! Þ h i�<sup>1</sup>

θγ [18]. The Fisher expected information matrix with a sample of size n + 1 is given by

X<sup>n</sup>þ<sup>1</sup> <sup>i</sup>¼<sup>1</sup> <sup>τ</sup><sup>i</sup>

0 @

! , xnþ<sup>1</sup>Þ ¼ <sup>∇</sup><sup>h</sup>

∇h ! <sup>ð</sup><sup>β</sup> ! Þ ¼

s ð0Þ <sup>n</sup> s ð2Þ <sup>n</sup> � s ð1Þ <sup>n</sup> s ð1Þ n

s

σ2

decomposed the population-level loss function as follows

! , xnþ<sup>1</sup>Þ ¼

X<sup>n</sup>þ<sup>1</sup> <sup>i</sup>¼<sup>1</sup> <sup>τ</sup>ixi

! ðβ ! Þ h i<sup>T</sup>

> ∂θγ ∂β<sup>0</sup> ∂θγ ∂β<sup>1</sup>

1

is the gradient vector, the partial derivatives of θγ with respect to β<sup>0</sup> and β1. Kim and Gillen

CCA ¼ � <sup>1</sup> β1

> <sup>2</sup> <sup>þ</sup> <sup>s</sup> ð0Þ

<sup>i</sup> , m ¼ 0; 1; 2,

<sup>i</sup> � <sup>X</sup><sup>n</sup> i¼1 wixi !<sup>2</sup>

þ s ð0Þ

0

BB@

τ<sup>n</sup>þ<sup>1</sup>ðxnþ<sup>1</sup> � θγÞ

h i

<sup>ð</sup>m<sup>Þ</sup> <sup>n</sup> <sup>¼</sup> <sup>X</sup><sup>n</sup>

<sup>μ</sup><sup>n</sup> <sup>¼</sup> <sup>X</sup><sup>n</sup>

<sup>n</sup> <sup>¼</sup> <sup>X</sup><sup>n</sup>

n i¼1 τj

, xnþ<sup>1</sup>Þ considers individual-level ethics by including LIðβ

i¼1 τixm

i¼1

i¼1

wixi,

wix<sup>2</sup>

population-level loss function reduces a loss by allocating the next patient further away from the weighted average of previously allocated doses (i.e., devised from information gain). In

, xnþ<sup>1</sup>Þ is devised from a compromise between individual- and population-level

Iðβ ! Þ ¼

LPðβ

where τ<sup>i</sup> ¼ πxi

178 Bayesian Inference

where

where

fact, LPðβ !

long run, LPðβ

!

ð1 � πxi

LPðβ

with the weight defined as wi <sup>¼</sup> <sup>X</sup>τ<sup>i</sup>

numerator. By including ðxnþ<sup>1</sup> � μnÞ

$$a\_n(\lambda) \quad = \left(1 + \frac{n}{N}\right)^{\lambda} \binom{\sum\_{i=1}^n y\_i}{n} \tag{32}$$

is an accelerating factor [3]. It has two implications. First, the compromising process is accelerated toward the individual-level ethics as the trial proceeds (i.e., n increases). Second, the compromising process toward the individual-level ethics is accelerated at a faster rate when an adverse event is observed (i.e., <sup>X</sup><sup>n</sup> <sup>i</sup>¼<sup>1</sup> yi increases). The tuning parameter <sup>λ</sup> controls the rate of acceleration. It imposes more emphasis on population-level ethics as λ ! 0 and more emphasis on individual-level ethics as λ ! ∞. The choice of λ shall depend on the severity level of an adverse event.

#### 4.6. Simulation

To study the operating characteristics of LB,<sup>λ</sup> with respect to λ, we assume the logistic model with β<sup>0</sup> = �3 and β<sup>1</sup> = .8 as a true dose-response relationship as shown in Figure 5 in the left panel. The target risk level is fixed at γ = .2, so the true MTD is given by θ.2 = 2.02 in the logarithmic scale. We consider three different priors based on the conditional mean priors given in Eq. (22). For simplicity, we set a�<sup>1</sup> = 1, b�<sup>1</sup> = 3, a<sup>0</sup> = 3 and b<sup>0</sup> = 1 for all three priors. Then, we let x�<sup>1</sup> ¼ �4 and x<sup>0</sup> ¼ 4 for Prior 1; x�<sup>1</sup> ¼ 0 and x<sup>0</sup> ¼ 8 for Prior 2; and x�<sup>1</sup> ¼ 4 and x<sup>0</sup> ¼ 12 for Prior 3. Figure 5 in the right panel shows an approximated f(θ.2) for each prior. Prior 1 significantly underestimates the true θ:<sup>2</sup> ¼ 2:02 with prior mean Eðθ:<sup>2</sup>޼�1:70, Prior 3 overestimates the truth with Eðθ:<sup>2</sup>Þ ¼ 5:38, and Prior 2 has a prior estimate relatively close to the truth with Eðθ:<sup>2</sup>Þ ¼ 1:40.

Let N = 20 be a fixed sample size. Let Yi = 1 denote an adverse event observed from the ith patient (Yi = 0 otherwise), so <sup>X</sup><sup>N</sup> <sup>i</sup>¼<sup>1</sup> Yi denotes the total number of adverse events observed at the end of a trial. The sum <sup>X</sup><sup>n</sup> <sup>i</sup>¼<sup>1</sup> Yi is random from a trial to another trial, and we want <sup>X</sup><sup>n</sup> <sup>i</sup>¼<sup>1</sup> Yi to behave like Binomialð20; :2Þ which is the case when we treat N = 20 to the true MTD θ.2. Figure 6 shows three simulated trials under the loss function LB,<sup>λ</sup> with λ = 0,1,5. When λ = 0,

Figure 5. The true dose-response relationship π<sup>x</sup> ¼ <sup>e</sup>β0þβ<sup>1</sup> <sup>x</sup> <sup>1</sup>þeβ0þβ<sup>1</sup> <sup>x</sup> with β<sup>0</sup> = �3 and β<sup>1</sup> = .8 (where x is the dose in the logarithmic scale) in the simulation (left panel) and the three prior distributions of θ.2 approximated by kernel density (right panel).

Figure 6. Three simulated trials using the loss function LB,<sup>λ</sup> with λ ¼ 0 (left), λ ¼ 1 (middle) and λ ¼ 5 (right) with a sample of size N = 20 and assumed parameter values β<sup>0</sup> ¼ �3, β<sup>1</sup> ¼ 8 and θ:<sup>2</sup> ¼ 2:02.

the up-and-down scheme has a high degree of fluctuation in order to maximize information about θ.2. When λ = 1, the up-and-down scheme is stabilized after the first few adverse events, and the stabilization occurs quickly when λ = 5 to treat trial participants near an estimated θ.2.

Let <sup>θ</sup>^:<sup>2</sup> <sup>¼</sup> <sup>E</sup>ðθ:<sup>2</sup>j<sup>y</sup> ! <sup>N</sup>Þ, the posterior estimate of <sup>θ</sup>.2 at the end of a trial, so πθ^:<sup>2</sup> implies the true probability of an adverse event at the estimated MTD. We focus on the following criteria: (i) <sup>E</sup>ðπθ^:<sup>2</sup> <sup>Þ</sup> which we desire to be close to <sup>γ</sup> <sup>¼</sup> :2 for future patients, (ii) <sup>V</sup>ðπθ^:<sup>2</sup> Þ which we desire to be as low as possible for future patients, (iii) <sup>E</sup>½ðπθ^:<sup>2</sup> � :2<sup>Þ</sup> 2 � which we desire to be as low as possible for future patients, (iv) Eð X<sup>20</sup> <sup>i</sup>¼<sup>1</sup> Yi<sup>Þ</sup> which we desire to be close to <sup>N</sup><sup>γ</sup> <sup>¼</sup> 4 for trial participants and (v) Pð3 ≤ X<sup>20</sup> <sup>i</sup>¼<sup>1</sup> Yi <sup>≤</sup> <sup>5</sup><sup>Þ</sup> which we desire to be close to one for trial participants.

Bayesian Model Averaging and Compromising in Dose-Response Studies http://dx.doi.org/10.5772/intechopen.68786 181


Table 1. Simulation results of 10,000 replicates for λ = 0, .5, 1, 2, 5 and each prior.

Table 1 summarizes simulation results of 10,000 replicates for each prior. For all three priors, we observe similar tendencies. First, <sup>E</sup>ðπθ^:<sup>2</sup> Þ gets closer to θ = .2 as λ increases. Second, <sup>V</sup>ðπθ^:<sup>2</sup> <sup>Þ</sup> decreases as <sup>λ</sup> decreases to zero. The average square distance between πθ^:<sup>2</sup> and <sup>γ</sup> <sup>¼</sup> :2 measures a balance between <sup>j</sup> <sup>E</sup>ðπθ^:<sup>2</sup> Þ � :<sup>2</sup> <sup>j</sup> and <sup>V</sup>ðπθ^:<sup>2</sup> Þ, and the superiority depends on priors. Lastly, as λ ! 0, we have larger Pð3 ≤ X<sup>20</sup> <sup>i</sup>¼<sup>1</sup> Yi <sup>≤</sup> <sup>5</sup><sup>Þ</sup> and more robust <sup>E</sup><sup>ð</sup> X<sup>20</sup> <sup>i</sup>¼<sup>1</sup> Yi<sup>Þ</sup> to prior elicitation.

In summary, when we emphasize more on population-level ethics, we have a smaller variance in the estimation for future patients (with a greater absolute bias, potentially due to Jensen's Inequality), and the distribution of <sup>X</sup><sup>n</sup> <sup>i</sup>¼<sup>1</sup> Yi becomes more robust to prior elicitations. When we emphasize more on individual-level ethics, we have a larger variance in the estimation, and the distribution of <sup>X</sup><sup>n</sup> <sup>i</sup>¼<sup>1</sup> Yi becomes more sensitive to prior elicitations.

### 5. Consensus prior

the up-and-down scheme has a high degree of fluctuation in order to maximize information about θ.2. When λ = 1, the up-and-down scheme is stabilized after the first few adverse events, and the stabilization occurs quickly when λ = 5 to treat trial participants near an estimated θ.2.

Figure 6. Three simulated trials using the loss function LB,<sup>λ</sup> with λ ¼ 0 (left), λ ¼ 1 (middle) and λ ¼ 5 (right) with a

0.00 0.05 0.10 0.15 0.20 0.25

scale) in the simulation (left panel) and the three prior distributions of θ.2 approximated by kernel density (right panel).

**Simulated Trial**

λ = 1

AE Non−AE

Dose (Logarithmic Scale)

1 5 10 15 20

probability of an adverse event at the estimated MTD. We focus on the following criteria: (i)

<sup>Þ</sup> which we desire to be close to <sup>γ</sup> <sup>¼</sup> :2 for future patients, (ii) <sup>V</sup>ðπθ^:<sup>2</sup>

X<sup>20</sup>

be as low as possible for future patients, (iii) <sup>E</sup>½ðπθ^:<sup>2</sup> � :2<sup>Þ</sup>

X<sup>20</sup>

**True Dose−Response Curve**

Dose (Logarithmic Scale)

−40 4 8 2.02

Figure 5. The true dose-response relationship π<sup>x</sup> ¼ <sup>e</sup>β0þβ<sup>1</sup> <sup>x</sup>

AE Non−AE

Patient Index

sample of size N = 20 and assumed parameter values β<sup>0</sup> ¼ �3, β<sup>1</sup> ¼ 8 and θ:<sup>2</sup> ¼ 2:02.

**Simulated Trial**

λ = 0

Dose (Logarithmic Scale)

1 5 10 15 20

<sup>N</sup>Þ, the posterior estimate of <sup>θ</sup>.2 at the end of a trial, so πθ^:<sup>2</sup> implies the true

2

<sup>i</sup>¼<sup>1</sup> Yi <sup>≤</sup> <sup>5</sup><sup>Þ</sup> which we desire to be close to one for trial participants.

<sup>i</sup>¼<sup>1</sup> Yi<sup>Þ</sup> which we desire to be close to <sup>N</sup><sup>γ</sup> <sup>¼</sup> 4 for trial

Þ which we desire to

� which we desire to be as low as

**Approximated Prior Distributions of MTD**

Prior 1 Prior 2 Prior 3 True MTD

**Simulated Trial**

λ = 5

AE Non−AE

Dose (Logarithmic Scale)

1 5 10 15 20

θ0.2 −5 0 5 10

<sup>1</sup>þeβ0þβ<sup>1</sup> <sup>x</sup> with β<sup>0</sup> = �3 and β<sup>1</sup> = .8 (where x is the dose in the logarithmic

Patient Index

Let <sup>θ</sup>^:<sup>2</sup> <sup>¼</sup> <sup>E</sup>ðθ:<sup>2</sup>j<sup>y</sup>

<sup>E</sup>ðπθ^:<sup>2</sup>

Probability

Patient Index

0.0 0.2 0.4 0.6 0.8 1.0

180 Bayesian Inference

!

possible for future patients, (iv) Eð

participants and (v) Pð3 ≤

In Bayesian inference, researchers are able to utilize information, which is independent of observed data. It allows researchers to incorporate any form of information, such as one's experience and existing literature, which may be particularly useful in a small-sample study. On the other hand, we concern subjectivity and prior sensitivity in sparse data. Furthermore, it is possible to have disagreement among multiple researchers' prior elicitations about a parameter θ.

Suppose there are K researchers with their own prior density functions, say fðθjQkÞ for k ¼ 1;…, K, and they have the same likelihood function fðy !j<sup>θ</sup>Þ. Each prior elicitation leads to a unique Bayes estimator

$$
\hat{\boldsymbol{\theta}}\_{k} = \boldsymbol{E}(\boldsymbol{\theta}|\overrightarrow{\boldsymbol{y}}, \boldsymbol{Q}\_{k}) = \left[\boldsymbol{\theta} f(\boldsymbol{\theta}|\overrightarrow{\boldsymbol{y}}, \boldsymbol{Q}\_{k}) \, d\boldsymbol{\theta}\right.\tag{33}
$$

where fðθjy !, Qk<sup>Þ</sup> <sup>∝</sup> <sup>f</sup>ð<sup>y</sup> !j<sup>θ</sup>ÞfðθjQk<sup>Þ</sup> is the posterior density function of <sup>θ</sup> given data <sup>y</sup> ! and the k th prior elicitation Qk. For posterior estimation, one reasonable approach to compromise is a weighted average <sup>X</sup><sup>K</sup> k¼1 wkθ^k, where wk > 0 for k = 1,…,K and <sup>X</sup><sup>K</sup> k¼1 wk ¼ 1. In this section, we discuss two different weighting methods. The first method is to fix wk before observing data (referred to as prior weighting scheme). The second method is to determine wkðy !Þ after observing data y ! so that wkð<sup>y</sup> !Þ increases when the <sup>k</sup> th prior elicitation Qk is better supported by the observed data y ! (referred to as posterior weighting scheme) [5].

For a prior weighting scheme, we denote wk ¼ PðQkÞ which quantifies the credibility of the k th prior elicitation. For a posterior weighting scheme, we consider

$$w\_k(\overrightarrow{y}) = P(\mathbf{Q}\_k|\overrightarrow{y}) = \frac{f(\overrightarrow{y}|\mathbf{Q}\_k)P(\mathbf{Q}\_k)}{\sum\_{j=1}^{K} f(\overrightarrow{y}|\mathbf{Q}\_j)P(\mathbf{Q}\_j)} = \frac{w\_k f(\overrightarrow{y}|\mathbf{Q}\_k)}{\sum\_{j=1}^{K} w\_j f(\overrightarrow{y}|\mathbf{Q}\_j)},\tag{34}$$

where fðy !jQkÞ ¼ <sup>ð</sup> fðy !j<sup>θ</sup><sup>Þ</sup> <sup>f</sup>ðθjQk<sup>Þ</sup> <sup>d</sup><sup>θ</sup> is the marginal likelihood from the <sup>k</sup> th prior elicitation. This formulation is similar to the BMA method discussed in Section 3. It can be shown that X<sup>K</sup> k¼1 wkðy !Þ <sup>θ</sup>^<sup>k</sup> is the Bayes estimator (the posterior mean of <sup>θ</sup>) when a consensus prior <sup>f</sup>ðθÞ ¼ <sup>X</sup><sup>K</sup> k¼1 wk fðθjQkÞ is used with wk ¼ PðQkÞ [5].

Samaniego discussed self-consistency when compromised inference is used through the prior weighting scheme <sup>X</sup><sup>K</sup> k¼1 wkθ^<sup>k</sup> [4]. Let θ denote a parameter of interest and

$$E(\theta) = \int \theta f(\theta) \, d\theta = \theta^\* \tag{35}$$

be the prior expectation, the mean of the prior density function <sup>f</sup>ðθÞ. Let <sup>θ</sup><sup>~</sup> denote a sufficient statistic, which serves as an unbiased estimator for <sup>θ</sup>. When we satisfy <sup>E</sup>ðθjθ<sup>~</sup> <sup>¼</sup> <sup>θ</sup>� Þ ¼ θ� , it is called self-consistency [4].

Self-consistency can be achieved under simple models. For example, let Y ! ¼ ðY1, …, YnÞ be a random sample, where Yi � BernoulliðθÞ, and assume θ � Betaða, bÞ for prior. It can be shown that the maximum likelihood estimator <sup>θ</sup><sup>~</sup> <sup>¼</sup> <sup>1</sup> n X<sup>n</sup> <sup>i</sup>¼<sup>1</sup> Yi is a sufficient statistic and an unbiased estimator for θ. The posterior mean is a weighted average between θ\* and θ~ as follows

$$E(\theta|\ddot{\theta} = \theta^\*) = \mathcal{c}\,\theta^\* + (1 - \mathcal{c})\,\ddot{\theta}\,\,\prime \,\tag{36}$$

where <sup>c</sup> <sup>¼</sup> <sup>a</sup>þ<sup>b</sup> <sup>a</sup>þbþ<sup>n</sup>. If we observe <sup>θ</sup><sup>~</sup> <sup>¼</sup> <sup>θ</sup>� , we can achieve the self-consistency because <sup>E</sup>ðθjθ^ <sup>¼</sup> <sup>θ</sup>� Þ ¼ θ� . In words, when prior estimate and maximum likelihood estimate are identical, the posterior estimate must be consistent with the prior estimate and the maximum likelihood estimate. The self-consistency can be also achieved in the prior weighting scheme under certain conditions as illustrated in the following example.

#### 5.1. Binomial experiment

other hand, we concern subjectivity and prior sensitivity in sparse data. Furthermore, it is possible

Suppose there are K researchers with their own prior density functions, say fðθjQkÞ for

ð θ fðθjy

th prior elicitation Qk. For posterior estimation, one reasonable approach to compromise is a

discuss two different weighting methods. The first method is to fix wk before observing data

! (referred to as posterior weighting scheme) [5].

For a prior weighting scheme, we denote wk ¼ PðQkÞ which quantifies the credibility of the k

!jQk<sup>Þ</sup> <sup>P</sup>ðQk<sup>Þ</sup>

!j<sup>θ</sup><sup>Þ</sup> <sup>f</sup>ðθjQk<sup>Þ</sup> <sup>d</sup><sup>θ</sup> is the marginal likelihood from the <sup>k</sup>

This formulation is similar to the BMA method discussed in Section 3. It can be shown that

Samaniego discussed self-consistency when compromised inference is used through the prior

be the prior expectation, the mean of the prior density function <sup>f</sup>ðθÞ. Let <sup>θ</sup><sup>~</sup> denote a sufficient

random sample, where Yi � BernoulliðθÞ, and assume θ � Betaða, bÞ for prior. It can be shown

wkθ^<sup>k</sup> [4]. Let θ denote a parameter of interest and

!jQj<sup>Þ</sup> <sup>P</sup>ðQj<sup>Þ</sup>

!Þ <sup>θ</sup>^<sup>k</sup> is the Bayes estimator (the posterior mean of <sup>θ</sup>) when a consensus prior

wkθ^k, where wk > 0 for k = 1,…,K and <sup>X</sup><sup>K</sup>

(referred to as prior weighting scheme). The second method is to determine wkðy

!Þ increases when the <sup>k</sup>

!j<sup>θ</sup>ÞfðθjQk<sup>Þ</sup> is the posterior density function of <sup>θ</sup> given data <sup>y</sup>

!j<sup>θ</sup>Þ. Each prior elicitation leads to

!, Qk<sup>Þ</sup> <sup>d</sup><sup>θ</sup> , <sup>ð</sup>33<sup>Þ</sup>

th prior elicitation Qk is better supported

!jQk<sup>Þ</sup>

!jQj<sup>Þ</sup>

<sup>j</sup>¼<sup>1</sup> wj <sup>f</sup>ð<sup>y</sup>

θ fðθÞ dθ ¼ θ� ð35Þ

!

wk ¼ 1. In this section, we

k¼1

<sup>¼</sup> wk <sup>f</sup>ð<sup>y</sup>

X<sup>K</sup>

! and the

!Þ after

, ð34Þ

th prior elicitation.

Þ ¼ θ�

¼ ðY1, …, YnÞ be a

, it is

th

to have disagreement among multiple researchers' prior elicitations about a parameter θ.

!, QkÞ ¼

k ¼ 1;…, K, and they have the same likelihood function fðy

<sup>θ</sup>^ <sup>k</sup> <sup>¼</sup> <sup>E</sup>ðθj<sup>y</sup>

prior elicitation. For a posterior weighting scheme, we consider

wk fðθjQkÞ is used with wk ¼ PðQkÞ [5].

!Þ ¼ <sup>f</sup>ð<sup>y</sup>

X<sup>K</sup> <sup>j</sup>¼<sup>1</sup> <sup>f</sup>ð<sup>y</sup>

EðθÞ ¼

Self-consistency can be achieved under simple models. For example, let Y

ð

statistic, which serves as an unbiased estimator for <sup>θ</sup>. When we satisfy <sup>E</sup>ðθjθ<sup>~</sup> <sup>¼</sup> <sup>θ</sup>�

! Þ ¼ PðQkjy

a unique Bayes estimator

weighted average <sup>X</sup><sup>K</sup>

by the observed data y

!jQkÞ ¼

k¼1

weighting scheme <sup>X</sup><sup>K</sup>

called self-consistency [4].

observing data y

where fðy

<sup>f</sup>ðθÞ ¼ <sup>X</sup><sup>K</sup>

X<sup>K</sup> k¼1 wkðy !, Qk<sup>Þ</sup> <sup>∝</sup> <sup>f</sup>ð<sup>y</sup>

k¼1

! so that wkð<sup>y</sup>

wkðy

ð fðy

k¼1

where fðθjy

182 Bayesian Inference

k

Let Yi � BernoulliðπÞ for i ¼ 1;…, n and assume Y1,…, Yn are independent. Suppose the k th researcher specifies the prior distribution πjQk � Betaðak, bkÞ for k ¼ 1;…, K. For the prior weighting scheme, let wk ¼ PðQkÞ, the prior probability for the k th prior elicitation (fixed before observing data). Since <sup>E</sup>ðπjQkÞ ¼ ak akþbk and the expectation <sup>E</sup> (�) is a linear operator, the average of "consensus prior" is

$$E(\pi) = \int\_0^1 \pi f(\pi) \, d\pi = \int\_0^1 \pi \left(\sum\_{k=1}^K f(\pi | \mathbb{Q}\_k) \, P(\mathbb{Q}\_k)\right) \, d\pi = \sum\_{k=1}^K w\_k \left(\int\_0^1 \pi f(\pi | \mathbb{Q}\_k) \, d\pi\right) = \sum\_{k=1}^K w\_k \, E(\pi | \mathbb{Q}\_k) \,. \tag{37}$$

Let <sup>E</sup>ðπÞ ¼ <sup>π</sup>� and suppose the <sup>K</sup> researchers observed the consistent result <sup>π</sup><sup>~</sup> <sup>¼</sup> <sup>1</sup> n X<sup>n</sup> <sup>i</sup>¼<sup>1</sup> Yi <sup>¼</sup> <sup>π</sup>�. The individual-specific Bayes estimator is as follows

$$
\hat{\pi}\_k = E(\pi | \tilde{\pi} = \pi^\*, Q\_k) = c\_k \, E(\pi | Q\_k) + (1 - c\_k) \, \pi^\*,\tag{38}
$$

for the k th researcher, where ck <sup>¼</sup> akþbk akþbkþ<sup>n</sup>. The compromised Bayes estimator is as follows

$$E(\boldsymbol{\pi}|\boldsymbol{\tilde{\pi}}=\boldsymbol{\pi}^\*)\quad =\sum\_{k=1}^{K} w\_k \,\boldsymbol{\hat{\pi}}\_k = \sum\_{k=1}^{K} w\_k \left[c\_k E(\boldsymbol{\pi}|\boldsymbol{Q}\_k) + (1-c\_k)\,\boldsymbol{\pi}^\*\right].\tag{39}$$

If we allow individual-specific prior elicitation ak and bk with the restriction ak + bk = m for all K researchers (i.e., the same strength of prior elicitation), the value ck ¼ <sup>m</sup> <sup>m</sup>þ<sup>n</sup> is constant over all researcher. By letting the constant ck = c,

$$E(\boldsymbol{\pi}|\tilde{\boldsymbol{\pi}}=\boldsymbol{\pi}^\*) \quad = c\left(\sum\_{k=1}^K \boldsymbol{w}\_k \boldsymbol{E}(\boldsymbol{\pi}|\boldsymbol{Q}\_k)\right) + (1-c)\,\boldsymbol{\pi}^\* \left(\sum\_{k=1}^K \boldsymbol{w}\_k\right) = c\,\boldsymbol{E}(\boldsymbol{\pi}) + (1-c)\,\boldsymbol{\pi}^\* = \boldsymbol{\pi}^\*,\quad(40)$$

so the self-consistency is satisfied.

For the posterior weighting scheme given data y ! ¼ ðy1, …, ynÞ, the marginal likelihood from the k th prior elicitation is as follows

$$f(\overrightarrow{y} \mid \mathbf{Q}\_k) = \int\_0^1 f(\overrightarrow{y} \mid \pi) f(\pi \mid \mathbf{Q}\_k) \, d\pi = \frac{\Gamma(a\_k + b\_k)}{\Gamma(a\_k)} \, \frac{\Gamma(a\_k + s)}{\Gamma(a\_k + b\_k + n)} \, , \tag{41}$$

where <sup>s</sup> <sup>¼</sup> <sup>X</sup><sup>n</sup> <sup>i</sup>¼<sup>1</sup> yi is an observed sufficient statistic. Then, the posterior weighting scheme becomes <sup>X</sup><sup>K</sup> k¼1 wkðy !Þ <sup>π</sup>^<sup>k</sup> with

$$\begin{aligned} w\_k(\overrightarrow{y}) &= \frac{w\_k f(\overrightarrow{y}|Q\_k)}{\sum\_{j=1}^K w\_j f(\overrightarrow{y}|Q\_j)} \\ \hat{\pi}\_k &= \frac{a\_k + s}{a\_k + b\_k + n}. \end{aligned} \tag{42}$$

If we desire an equal strength from each researcher's prior elicitation, we may fix ak þ bk ¼ m and wk ¼ <sup>1</sup> <sup>K</sup>. In the posterior weighting scheme, it is difficult to achieve the self-consistency.

Whether self-consistency is satisfied, the practical concern is the quality of estimation such as bias, variance and mean square error. Assuming K = 2 researchers have disagreeing prior knowledge and a sample of size n = 10, let us consider three cases. Suppose two researchers express relatively mild disagreement as ða1, b1Þ¼ð1; 3Þ and ða2, b2Þ¼ð3; 1Þ in Case 1, relatively strong disagreement as ða1, b1Þ¼ð2; 6Þ and ða2, b2Þ¼ð6; 2Þ in Case 2, and even stronger disagreement as ða1, b1Þ¼ð3; 9Þ and ða2, b2Þ¼ð9; 3Þ in Case 3. For each case, Figure 7 provides the relative bias, variance and mean square error (MSE) for comparing the posterior weighting scheme <sup>X</sup><sup>3</sup> <sup>k</sup>¼<sup>1</sup> wkð<sup>y</sup> !Þ <sup>π</sup>^<sup>k</sup> to the prior weighting scheme <sup>X</sup><sup>3</sup> <sup>k</sup>¼<sup>1</sup> wk <sup>π</sup>^k. When a relative MSE is smaller than one, it implies a smaller MSE for the posterior weighting scheme. As the true value of π is well between the two prior guesses EðπjQ1Þ ¼ :25 and EðπjQ2Þ ¼ :75, the posterior weighting scheme shows a greater MSE due to greater variance. When the true value of π deviates away from either prior guess, the posterior weighting schemes show a smaller MSE due to smaller bias. The tendency is stronger when the two disagreeing prior elicitations are stronger (i.e., stronger prior disagreement). The bottom line is a clear bias-variance tradeoff

Figure 7. Comparing prior and posterior weighting schemes for different degrees of disagreements.

when we compare the two weighting schemes. <sup>X</sup><sup>3</sup> k¼1 wkðy !Þ <sup>π</sup>^<sup>k</sup> is able to reduce bias when there is strong discrepancy between "consensus prior" and data, but it has larger variance than X<sup>3</sup> k¼1 wk π^<sup>k</sup> because wkðy !Þ depends on random data.

#### 5.2. Applications to Phase I trials under logistic regression model

For the posterior weighting scheme given data y

ð1 0 fðy

!Þ <sup>π</sup>^<sup>k</sup> with

!j<sup>π</sup>ÞfðπjQk<sup>Þ</sup> <sup>d</sup><sup>π</sup> <sup>¼</sup> <sup>Γ</sup>ðak <sup>þ</sup> bk<sup>Þ</sup>

!Þ ¼ wk <sup>f</sup>ð<sup>y</sup>

X

<sup>π</sup>^<sup>k</sup> <sup>¼</sup> ak <sup>þ</sup> <sup>s</sup> ak <sup>þ</sup> bk <sup>þ</sup> <sup>n</sup> :

K <sup>j</sup>¼<sup>1</sup> wj <sup>f</sup>ð<sup>y</sup>

If we desire an equal strength from each researcher's prior elicitation, we may fix ak þ bk ¼ m

Whether self-consistency is satisfied, the practical concern is the quality of estimation such as bias, variance and mean square error. Assuming K = 2 researchers have disagreeing prior knowledge and a sample of size n = 10, let us consider three cases. Suppose two researchers express relatively mild disagreement as ða1, b1Þ¼ð1; 3Þ and ða2, b2Þ¼ð3; 1Þ in Case 1, relatively strong disagreement as ða1, b1Þ¼ð2; 6Þ and ða2, b2Þ¼ð6; 2Þ in Case 2, and even stronger disagreement as ða1, b1Þ¼ð3; 9Þ and ða2, b2Þ¼ð9; 3Þ in Case 3. For each case, Figure 7 provides the relative bias, variance and mean square error (MSE) for comparing the posterior weighting

smaller than one, it implies a smaller MSE for the posterior weighting scheme. As the true value of π is well between the two prior guesses EðπjQ1Þ ¼ :25 and EðπjQ2Þ ¼ :75, the posterior weighting scheme shows a greater MSE due to greater variance. When the true value of π deviates away from either prior guess, the posterior weighting schemes show a smaller MSE due to smaller bias. The tendency is stronger when the two disagreeing prior elicitations are stronger (i.e., stronger prior disagreement). The bottom line is a clear bias-variance tradeoff

**Relative Variance**

Case 1 Case 2 Case 3 012345

π 0.0 0.2 0.4 0.6 0.8 1.0

<sup>K</sup>. In the posterior weighting scheme, it is difficult to achieve the self-consistency.

wkðy

!Þ <sup>π</sup>^<sup>k</sup> to the prior weighting scheme <sup>X</sup><sup>3</sup>

ΓðakÞ ΓðbkÞ

<sup>i</sup>¼<sup>1</sup> yi is an observed sufficient statistic. Then, the posterior weighting scheme

!jQk

Þ

!jQj<sup>Þ</sup> ,

th prior elicitation is as follows

fðy ! jQkÞ ¼

k¼1 wkðy

the k

184 Bayesian Inference

where <sup>s</sup> <sup>¼</sup> <sup>X</sup><sup>n</sup>

becomes <sup>X</sup><sup>K</sup>

and wk ¼ <sup>1</sup>

scheme <sup>X</sup><sup>3</sup>

0.0 0.2 0.4 0.6 0.8 1.0 <sup>k</sup>¼<sup>1</sup> wkð<sup>y</sup>

**Relative Bias**

Case 1 Case 2 Case 3 012345

Figure 7. Comparing prior and posterior weighting schemes for different degrees of disagreements.

π 0.0 0.2 0.4 0.6 0.8 1.0 ! ¼ ðy1, …, ynÞ, the marginal likelihood from

<sup>Γ</sup>ðak <sup>þ</sup> bk <sup>þ</sup> <sup>n</sup><sup>Þ</sup> , <sup>ð</sup>41<sup>Þ</sup>

<sup>k</sup>¼<sup>1</sup> wk <sup>π</sup>^k. When a relative MSE is

**Relative MSE**

Case 1 Case 2 Case 3

π 0.0 0.2 0.4 0.6 0.8 1.0

ð42Þ

Γðak þ sÞ Γðbk þ n � sÞ

In this section, we apply the prior weighting scheme and the posterior weighting scheme to Phase I clinical trials under the logistic regression model. We consider the three priors considered in Section 4.6. We denote Prior 1, 2 and 3 by Q1, Q<sup>2</sup> and Q3, respectively. The three priors had the same hyper-parameters a�1;<sup>k</sup> ¼ 1, b�1;<sup>k</sup> ¼ 3, a<sup>0</sup>;<sup>k</sup> ¼ 3, b<sup>0</sup>;<sup>k</sup> ¼ 1, but they were different by x�1;<sup>k</sup> ¼ �4; 0; 4 and x0, <sup>k</sup> ¼ 4; 8; 12 for k ¼ 1; 2; 3, respectively. By the use of the conditional mean prior in Eq. (22), the prior density function of β ! for prior Qk is given by

$$\,\_{f}(\overrightarrow{\beta}\,\,|\,\mathbf{Q}\_{k})\,\approx\,(\mathbf{x}\_{0,k}-\mathbf{x}\_{-1,k})\prod\_{i=-1}^{0}\left(\frac{e^{\mathfrak{G}\_{0}+\beta\_{1}\mathbf{x}\_{i}}}{1+e^{\mathfrak{G}\_{0}+\beta\_{1}\mathbf{x}\_{i}}}\right)^{a\_{i,k}}\left(\frac{1}{1+e^{\mathfrak{G}\_{0}+\beta\_{1}\mathbf{x}\_{i}}}\right)^{b\_{i,k}}.\tag{43}$$

The prior means were Eðθ:<sup>2</sup>jQ1޼�1:70, Eðθ:<sup>2</sup>jQ2Þ ¼ 1:40 and Eðθ:<sup>2</sup>jQ3Þ ¼ 5:38 for Priors 1, 2 and 3, respectively.

For simulation study, we consider three simulation scenarios with sample size N = 20. In Scenario 1, we assume β<sup>0</sup> = �5 and β<sup>1</sup> = .6, so the true MTD is θ:<sup>2</sup> ¼ 6:02, which deviates significantly from all of the three prior means. In Scenario 2, we assume β<sup>0</sup> = �3 and β<sup>1</sup> = .8 as in Section 4.6, so θ:<sup>2</sup> ¼ 2:02 is well surrounded by the three prior means. In Scenario 3, we assume β<sup>0</sup> = �1 and β<sup>1</sup> = 1.2, so θ:<sup>2</sup> ¼ �:32 is close to the most conservative prior mean Eðθ:<sup>2</sup>jQ1޼�1:70. We consider the loss function LIðβ ! , xnþ<sup>1</sup>Þ¼ðxnþ<sup>1</sup> � θ:<sup>2</sup>Þ <sup>2</sup> discussed in Section 4.4, which focuses on individual-level ethics. We use the uniform prior probabilities wk ¼ PðQkÞ ¼ 1=3 for k ¼ 1; 2; 3 for implementing both prior and posterior weighting scheme.

Table 2 provides the simulation results of 10,000 replicates for each scenario under the prior weighting scheme and under the posterior weighting scheme. Since the posterior weighting scheme adaptively updates wkðy !Þ based on empirical evidence, it can reduce bias, but it has greater variance in the estimation of θ2. As a consequence, when the true MTD was close to one extreme prior estimate (Scenarios 1 and 3), the use of the posterior weighting scheme yields a smaller <sup>E</sup> <sup>ð</sup>πθ^:<sup>2</sup> � :2Þ <sup>2</sup> h i, <sup>E</sup><sup>ð</sup> X<sup>20</sup> <sup>i</sup>¼<sup>1</sup> Yi<sup>Þ</sup> closer to <sup>N</sup><sup>γ</sup> <sup>¼</sup> 4, and <sup>P</sup>ð<sup>3</sup> <sup>≤</sup> X<sup>20</sup> <sup>i</sup>¼<sup>1</sup> Yi <sup>≤</sup> <sup>5</sup><sup>Þ</sup> closer to one when compared to the use of the prior weighting scheme. In Scenario 3, the average number of adverse events was 4.6 for the posterior weighting scheme, but it was as high as 7.1 in the prior weighting scheme. On the other hand, when the true MTD was well surrounded by the three prior estimates (Scenario 2), the use of the prior weighting scheme yielded more plausible results.

The simulation results are analogous to the simpler model in Section 5.1. When the true parameter is not well surrounded by prior guesses, the posterior weighting scheme is preferable with respect to mean square error due to smaller bias. When the true parameter is well surrounded by prior guesses, the prior weighting scheme is beneficial with respect to mean square error due to smaller variance.


Table 2. Simulation results of 10,000 replicates for the prior and posterior weighting schemes.

As a final comment, we shall be careful about the strength of individual prior elicitations when we implement the posterior weighting scheme in Phase I clinical trials. The strength of individual prior elicitations depends on (i) the hyper-parameters ai, <sup>k</sup> and bi, <sup>k</sup>, (ii) the prior weight wk ¼ PðQkÞ as well as (iii) the distance between the two arbitrarily chosen doses x<sup>0</sup>;<sup>k</sup> � x<sup>1</sup>;<sup>k</sup>. It can be seen through the expression

$$f(\overrightarrow{\boldsymbol{\beta}}^{\cdot}) = \sum\_{k=1}^{K} f(\overrightarrow{\boldsymbol{\beta}}^{\cdot}|\boldsymbol{Q}\_{k}) \, \boldsymbol{P}(\boldsymbol{Q}\_{k}) \, \mathbf{a} \, \sum\_{k=1}^{K} w\_{k} \left(\mathbf{x}\_{0,k} - \mathbf{x}\_{-1,k}\right) \prod\_{i=-1}^{0} \left(\frac{\boldsymbol{\sigma}^{\theta\_{0} + \theta\_{1} \mathbf{x}\_{i}}}{1 + \boldsymbol{\sigma}^{\theta\_{0} + \theta\_{1} \mathbf{x}\_{i}}}\right)^{\theta\_{0} \cdot \boldsymbol{\varepsilon}} \left(\frac{1}{1 + \boldsymbol{\sigma}^{\theta\_{0} + \theta\_{1} \mathbf{x}\_{i}}}\right)^{\theta\_{0} \cdot \boldsymbol{\varepsilon}}. \tag{44}$$

When researchers determine consensus prior elicitations before initiating a trial, the multiplicative termwk ðx<sup>0</sup>;<sup>k</sup> � x�1;<sup>k</sup>Þ shall be carefully considered together with the hyper-parameters ai, <sup>k</sup> andbi, <sup>k</sup> [5].

### 6. Concluding remarks

In this chapter, we have discussed Bayesian inference with averaging, balancing, and compromising in sparse data. In the cancer risk assessment, we have observed that low-dose inference can be very sensitive to an assumed parametric model (Section 3.1). In this case, the Bayesian model averaging can be a useful method. It provides robustness by using multiple models and posterior model probabilities to account for model uncertainty. In the application of Bayesian decision theory to Phase I clinical trials, we have observed that the sequential sampling scheme heavily depends on a loss function. A loss function, which is devised from individual-level ethics, focuses on the benefit of trial participants, and a loss function, which is devised from population-level ethics, focuses on the benefit of future patients. It is possible to balance between the two conflicting perspectives, and we can adjust a focusing point by the tuning parameter (Sections 4.5 and 4.6). Finally, the use of a weighted posterior estimate can be a compromising method when two or more researchers have prior disagreement. We have compared the prior and posterior weighting schemes in a small-sample binomial problem (Section 5.1) and in a smallsample Phase I clinical trial (Section 5.2). The prior weighting scheme (data-independent weights) outperforms when prior estimates surround the truth, and the posterior weighting scheme (datadependent weights) outperforms when the truth is not well surrounded by prior estimates. One method does not outperform the other method for all parameter values, so it is important to be aware of their bias-variance tradeoff.

### Author details

Steven B. Kim

Address all correspondence to: stkim@csumb.edu

Department of Mathematics and Statistics, California State University, CA, United States

### References

As a final comment, we shall be careful about the strength of individual prior elicitations when we implement the posterior weighting scheme in Phase I clinical trials. The strength of individual prior elicitations depends on (i) the hyper-parameters ai, <sup>k</sup> and bi, <sup>k</sup>, (ii) the prior weight wk ¼ PðQkÞ as well as (iii) the distance between the two arbitrarily chosen doses x<sup>0</sup>;<sup>k</sup> � x<sup>1</sup>;<sup>k</sup>. It

Posterior weighting 0.1853 0.0073 0.0075 2.7304 0.5900

Posterior weighting 0.2048 0.0110 0.0110 4.2848 0.8920

Posterior weighting 0.1951 0.0133 0.0133 4.6036 0.8646

When researchers determine consensus prior elicitations before initiating a trial, the multiplicative termwk ðx<sup>0</sup>;<sup>k</sup> � x�1;<sup>k</sup>Þ shall be carefully considered together with the hyper-parameters ai, <sup>k</sup> andbi, <sup>k</sup> [5].

In this chapter, we have discussed Bayesian inference with averaging, balancing, and compromising in sparse data. In the cancer risk assessment, we have observed that low-dose inference can be very sensitive to an assumed parametric model (Section 3.1). In this case, the Bayesian model averaging can be a useful method. It provides robustness by using multiple models and posterior model probabilities to account for model uncertainty. In the application of Bayesian decision theory to Phase I clinical trials, we have observed that the sequential sampling scheme heavily depends on a loss function. A loss function, which is devised from individual-level ethics, focuses on the benefit of trial participants, and a loss function, which is devised from population-level ethics, focuses on the benefit of future patients. It is possible to balance between the two conflicting perspectives, and we can adjust a focusing point by the tuning parameter (Sections 4.5 and 4.6). Finally, the use of a weighted posterior estimate can be a compromising method when two or more researchers have prior disagreement. We have compared the prior and posterior weighting schemes in a small-sample binomial problem (Section 5.1) and in a smallsample Phase I clinical trial (Section 5.2). The prior weighting scheme (data-independent weights) outperforms when prior estimates surround the truth, and the posterior weighting scheme (datadependent weights) outperforms when the truth is not well surrounded by prior estimates. One method does not outperform the other method for all parameter values, so it is important to be

Y 0

<sup>Þ</sup> <sup>E</sup>½ðπθ^:<sup>2</sup> � :2<sup>Þ</sup>

2 � Eð

X<sup>20</sup>

<sup>i</sup>¼<sup>1</sup> Yi<sup>Þ</sup> <sup>P</sup>ð<sup>3</sup> <sup>≤</sup>

X<sup>20</sup> <sup>i</sup>¼<sup>1</sup> Yi <sup>≤</sup> <sup>5</sup><sup>Þ</sup>

e<sup>β</sup>0þβ1xi <sup>1</sup> <sup>þ</sup> <sup>e</sup><sup>β</sup>0þβ1xi

� �ai, <sup>k</sup> 1

<sup>1</sup> <sup>þ</sup> <sup>e</sup><sup>β</sup>0þβ1xi � �bi, <sup>k</sup>

: ð44Þ

i¼�1

wk ðx<sup>0</sup>;<sup>k</sup> � x�1;<sup>k</sup>Þ

<sup>Þ</sup> <sup>V</sup>ðπθ^:<sup>2</sup>

1 Prior weighting 0.0967 0.0014 0.0121 1.1090 0.0398

2 Prior weighting 0.2018 0.0059 0.0059 3.8432 0.9042

3 Prior weighting 0.2929 0.0071 0.0157 7.1090 0.0568

Table 2. Simulation results of 10,000 replicates for the prior and posterior weighting schemes.

can be seen through the expression

Scenario Method <sup>E</sup>ðπθ^:<sup>2</sup>

<sup>j</sup>Qk<sup>Þ</sup> <sup>P</sup>ðQk<sup>Þ</sup> <sup>∝</sup> <sup>X</sup>

K

k¼1

fðβ

186 Bayesian Inference

! Þ ¼ <sup>X</sup> K

k¼1 fðβ !

6. Concluding remarks

aware of their bias-variance tradeoff.


Provisional chapter

### **Two Examples of Bayesian Evidence Synthesis with the Hierarchical Meta-Regression Approach** Two Examples of Bayesian Evidence Synthesis with the Hierarchical Meta-Regression Approach

DOI: 10.5772/intechopen.70231

Pablo Emilio Verde

[14] Whitehead J, Williamson D. Bayesian decision procedures based on logistic regression models for dose-finding studies. Journal of Biopharmaceutical Statistics. 1998;8:445-467

[15] O'Quigley J, Conaway M. Continual reassessment and related dose-finding designs.

[16] Bartroff J, Lai TL. Incorporating individual and collective ethics into Phase I cancer trial

[17] O'Quigley J, Pepe M, Fisher L. Continual reassessment method: A practical design for

[18] Whitehead J, Brunier H. Continual reassessment method: Bayesian decision procedures

for dose determining experiments. Statistics in Medicine. 1995;14:885-893

Statistical Science. 2010;25:202-216

188 Bayesian Inference

designs. Biometrics. 2011;67:596-603

Phase 1 clinical trials in cancer. Biometrics. 1990;46:33-48

Additional information is available at the end of the chapter Pablo Emilio Verde

http://dx.doi.org/10.5772/intechopen.70231 Additional information is available at the end of the chapter

#### Abstract

This is the Information Age. We can expect that for a particular research question that is empirically testable, we should have a collection of evidence which indicates the best way to proceed. Unfortunately, this is not the case in several areas of empirical research and decision making. Instead, when researchers and policy makers ask a specific question, such as "What is the effectiveness of a new treatment?", the structure of the evidence available to answer this question may be complex and fragmented (e.g. published experiments may have different grades of quality, observational data, subjective judgments, etc.).

Meta-analysis is a branch of statistical techniques that helps researchers to combine evidence from a multiplicity of indirect sources. A main hurdle in meta-analysis is that we not only combine results from a diversity of sources but we also combine their multiplicity of biases. Therefore, commonly applied meta-analysis methods, e.g. random-effects models, could be misleading.

In this chapter we present a new method for meta-analysis that we have called: the "Hierarchical Meta-Regression" (HMR). The HMR is an integrated approach for evidence synthesis when a multiplicity of bias, coming from indirect and disparate evidence, has to be incorporated in a meta-analysis.

Keywords: Bayesian hierarchical models, meta-analysis, multi-parameters evidence synthesis, conflict of evidence, randomized control trials, retrospective studies

### 1. Introduction

In today's information age one can expect that the digital revolution can create a knowledgebased society surrounded by global communications that influence our world in an efficient and convenient way. It is recognized that never in human history we have accumulated such

© The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and eproduction in any medium, provided the original work is properly cited.

an astronomical amount of data, and we keep on generating data at in an alarming rate. A new term, "big data," was coined to indicate the existence of "oceans of data" where we may expect to extract useful information for any problem of interest.

In this technological society, one could expect that for a particular research question we should have a collection of high quality evidence which indicates the best way to proceed. Paradoxically, this is not the case in several areas of empirical research and decision making. Instead, when researchers and policy makers ask a specific and important question, such as "What is the effectiveness of a new treatment?", the structure of the evidence available to answer this question may be complex and fragmented (e.g., published experiments may have different grades of quality, observational data, subjective judgments, etc.). The way how researchers interpret this multiplicity of evidence will be the basis for their understanding of reality and it will determine their future decisions.

Bayesian meta-analysis, which has its roots in the work of Eddy et al. [1], is a branch of statistical techniques for interpreting and displaying results of different sources of evidence, exploring the effects of biases and assessing the propagation of uncertainty into a coherent statistical model. A gentle introduction of this area can be found in Chap. 8 of Spiegelhalter et al. [2] and a recent review in Verde and Ohmann [3].

In this chapter we present a new method for meta-analysis that we have called: the "Hierarchical Meta-Regression" (HMR). The aim of HMR is to have an integrated approach for bias modeling when disparate pieces of evidence are combined in meta-analysis, for instance randomized and non-randomized studies or studies with different qualities. This is a different application of Bayesian inference than those applications with which we could be familiar, for instance an intricate regression model, where the available data bear directly upon the question of interest.

We are going to discuss two recent meta-analyses in clinical research. The reason for highlighting these two cases is that they illustrate a main problem in evidence synthesis, which is the presence of a multiplicity of bias in systematic reviews.

### 1.1. An example of meta-analysis of therapeutic trials

The first example, is a meta-analysis of 31 randomized controlled trials (RCTs) of two treatment groups of heart disease patients, where the treatment group received bone marrow stem cells and the control group a placebo treatment, Nowbar et al. [4]. The data of this meta-analysis appear in the Appendix, see Table 1. Figure 1 presents the forest plot of these 31 trials, where the treatment effect is measured as the difference of the ejection fraction between groups, which measures the improvement of left ventricular function in the heart.

At the bottom of Figure 1 we see average summaries represented by two diamonds: the first one corresponds to the fixed effect meta-analysis model. This model is based under the assumption that studies are identical and the between study variability is zero. The widest diamond represents the results of a random effects meta-analysis model, which assume a substantial heterogeneity between studies. In this meta-analysis both models confirmed a positive treatment of effect of a mean difference 3.95 95% CI [3.43; 4.47] and 2.92 and a 95% CI of [1.47, 4.36], respectively.

Two Examples of Bayesian Evidence Synthesis with the Hierarchical Meta-Regression Approach http://dx.doi.org/10.5772/intechopen.70231 191


an astronomical amount of data, and we keep on generating data at in an alarming rate. A new term, "big data," was coined to indicate the existence of "oceans of data" where we may expect

In this technological society, one could expect that for a particular research question we should have a collection of high quality evidence which indicates the best way to proceed. Paradoxically, this is not the case in several areas of empirical research and decision making. Instead, when researchers and policy makers ask a specific and important question, such as "What is the effectiveness of a new treatment?", the structure of the evidence available to answer this question may be complex and fragmented (e.g., published experiments may have different grades of quality, observational data, subjective judgments, etc.). The way how researchers interpret this multiplicity of evidence will be the basis for their understanding of reality and it

Bayesian meta-analysis, which has its roots in the work of Eddy et al. [1], is a branch of statistical techniques for interpreting and displaying results of different sources of evidence, exploring the effects of biases and assessing the propagation of uncertainty into a coherent statistical model. A gentle introduction of this area can be found in Chap. 8 of Spiegelhalter

In this chapter we present a new method for meta-analysis that we have called: the "Hierarchical Meta-Regression" (HMR). The aim of HMR is to have an integrated approach for bias modeling when disparate pieces of evidence are combined in meta-analysis, for instance randomized and non-randomized studies or studies with different qualities. This is a different application of Bayesian inference than those applications with which we could be familiar, for instance an intricate regression model, where the available data bear directly upon the question of interest. We are going to discuss two recent meta-analyses in clinical research. The reason for highlighting these two cases is that they illustrate a main problem in evidence synthesis, which is the

The first example, is a meta-analysis of 31 randomized controlled trials (RCTs) of two treatment groups of heart disease patients, where the treatment group received bone marrow stem cells and the control group a placebo treatment, Nowbar et al. [4]. The data of this meta-analysis appear in the Appendix, see Table 1. Figure 1 presents the forest plot of these 31 trials, where the treatment effect is measured as the difference of the ejection fraction between groups, which

At the bottom of Figure 1 we see average summaries represented by two diamonds: the first one corresponds to the fixed effect meta-analysis model. This model is based under the assumption that studies are identical and the between study variability is zero. The widest diamond represents the results of a random effects meta-analysis model, which assume a substantial heterogeneity between studies. In this meta-analysis both models confirmed a positive treatment of effect of a mean difference 3.95 95% CI [3.43; 4.47] and 2.92 and a 95%

to extract useful information for any problem of interest.

et al. [2] and a recent review in Verde and Ohmann [3].

presence of a multiplicity of bias in systematic reviews.

1.1. An example of meta-analysis of therapeutic trials

CI of [1.47, 4.36], respectively.

measures the improvement of left ventricular function in the heart.

will determine their future decisions.

190 Bayesian Inference

Figure 1. Meta-analysis results of studies applying treatments based on bone marrow stem cells to improve the left ventricular function.

Could we conclude that we have enough evidence to demonstrate the efficacy of the treatment? Unfortunately, these apparently confirming results are completely misleading. The problem is that these 31 studies are very heterogeneous, which resulted in a wide 95% prediction interval [4.33; 10.16] covering the no treatment effect, and a large number of contradictory evidence displayed in Figure 1.

In order to explain the sources of heterogeneity in this area Nowbar et al. [4] investigated whether detected discrepancies in published trials, might account for the variation in reported effect sizes. They define a discrepancy in a trial as two or more reported facts that cannot both be true because they are logically or mathematically incompatible. In other words, the term discrepancies is a polite way to indicate that a published study suffers from poor reporting, could be implausible or its results have been manipulated. For example, as we see at the bottom of Table 1 in the appendix, it would be difficult to believe in the results of a study with 55 discrepancies. In Section 2 we present a HMR model to analyze a possible link between the risk of bias results and the amount of discrepancies.

### 1.2. An example of meta-analysis of diagnostic trials

The topic of Section 3 is the meta-analysis of diagnostic trials. These trials play a central role in personalized medicine, policy making, healthcare and health economics. Figure 2 presents our example in this area. The scatter plot shows the diagnostic summaries of a meta-analysis investigating the diagnostic accuracy of computer tomography scans in the diagnostic of appendicitis [5]. Each circle identifies the true positive rate vs. the false positive rate of each study, where the different circles' sizes indicate different sample sizes. One characteristic of this meta-analysis is the combination of disparate data. From 51 studies 22 were retrospective and 29 were prospective, which is indicated by the different grey scale of the circles.

The main problem in this area is the multiple sources of variability behind those diagnostic results. Diagnostic studies are usually performed under different diagnostic setups and patients' populations. For a particular diagnostic technique we may have a small number of studies which may differ in their statistical design, their quality, etc. Therefore, the main question in metaanalysis of diagnostic test is: How can we combine the multiplicity of diagnostic accuracy rates

Figure 2. Display of the meta-analysis results of studies performing computer tomography scans in the diagnostic of appendicitis. Each circle identifies the true positive rate vs. the false positive rate of each study. Different colors are used for different study designs and different diameters for sample sizes.

in a single coherent model? A possible answer to this question is a HMR presented in Section 3. This model has been introduced by Verde [5] and it is available in the R's package bamdit [6].

### 2. A Hierarchical Meta-Regression model to assess reported bias

bottom of Table 1 in the appendix, it would be difficult to believe in the results of a study with 55 discrepancies. In Section 2 we present a HMR model to analyze a possible link between the

The topic of Section 3 is the meta-analysis of diagnostic trials. These trials play a central role in personalized medicine, policy making, healthcare and health economics. Figure 2 presents our example in this area. The scatter plot shows the diagnostic summaries of a meta-analysis investigating the diagnostic accuracy of computer tomography scans in the diagnostic of appendicitis [5]. Each circle identifies the true positive rate vs. the false positive rate of each study, where the different circles' sizes indicate different sample sizes. One characteristic of this meta-analysis is the combination of disparate data. From 51 studies 22 were retrospective and

The main problem in this area is the multiple sources of variability behind those diagnostic results. Diagnostic studies are usually performed under different diagnostic setups and patients' populations. For a particular diagnostic technique we may have a small number of studies which may differ in their statistical design, their quality, etc. Therefore, the main question in metaanalysis of diagnostic test is: How can we combine the multiplicity of diagnostic accuracy rates

Figure 2. Display of the meta-analysis results of studies performing computer tomography scans in the diagnostic of appendicitis. Each circle identifies the true positive rate vs. the false positive rate of each study. Different colors are used

for different study designs and different diameters for sample sizes.

29 were prospective, which is indicated by the different grey scale of the circles.

risk of bias results and the amount of discrepancies.

192 Bayesian Inference

1.2. An example of meta-analysis of diagnostic trials

Figure 3 shows the reported effect size and the 95% confidence intervals of 31 trials from [4] against the number of discrepancies (in logarithmic scale). The authors reported a positive statistical significant correlation between the size effect and the number of discrepancies detected in the papers. However, a direct correlation analysis of aggregated results is threatened by ecological bias and it may lead to misleading conclusions. The amount of variability presented by the 95% confidence intervals is very big to accept a positive correlation at face value. In this section we are going to present a HMR model to link the risk of reporting bias with the amount of reported discrepancies. This model assumes that the connection between discrepancies and size effect could be much more subtle.

The starting point of any meta-analytic model is the description of a model for the pieces of evidence at face value. In statistical terms, this means the likelihood of the parameter of interest. Let y1, …, yN and SE1, …, SEN be the reported effect sizes and their corresponding standard errors, we assume a normal likelihood of θ<sup>i</sup> the treatment effect of study i:

Figure 3. Relationship between effect size and number of discrepancies. The vertical axis corresponds to the effect size, the treatment group received a treatment based on bone marrow stem cells and the control group a placebo treatment. The horizontal axis corresponds to the number of discrepancies (in the logarithmic scale) found in the publication.

$$y\_i | \theta\_i \sim \mathcal{N}(\theta\_{i\prime} \, \text{SE}\_i^2), \quad i = 1, \ldots, \text{N}. \tag{1}$$

If a prior assumption of exchangeability was considered reasonable, a random effects Bayesian model incorporates all the studies into a single model, where the θ1, …, θ<sup>N</sup> are assumed to be a random sample from a prior distribution with unknown parameters, which is known as a hierarchical model.

In this section we assume that exchangeability is unrealistic and we wish to learn how the unobserved treatment effects θ1, …, θ<sup>N</sup> are linked with some observed covariate xi.

Let xi be the number of observed discrepancies in the logarithmic scale. We propose to model the association between the treatment effect θ<sup>i</sup> and the observed discrepancies xi with the following HMR model:

$$\frac{1}{i}\Theta\_i \big| I\_{i\prime} \ge I\_i N\left(\mu\_{\text{biased}}, \tau^2\right) + (1 - I\_i) N\left(\mu, \tau^2\right), \tag{2}$$

where the non-observable variable Ii indicates if study i is at risk of bias:

$$I\_i|\mathbf{x}\_i = \begin{cases} 1 & \text{if study } i \text{ is biased} \\ 0 & \text{otherwise.} \end{cases} \tag{3}$$

The parameter μ corresponds to the mean treatment effect of studies with low risk of bias. We assume that in our context of application biased studies could report higher effect sizes and the biased mean μbiased can be expressed as:

$$
\mu\_{\text{biased}} = \mu + K\_\prime \text{ with } K > 0. \tag{4}
$$

In this way, K measures the average amount of bias with respect to the mean effect μ. Eq. (4) also ensures that μ and μbiased are identifiable parameters in this model. The parameter τ measures the between-studies variability in both components of the mixture distributions.

We model the probability that a study is biased as a function of xi as follows:

$$\text{logit}(\text{Pr}(I\_i = 1|\mathbf{x}\_i)) = \alpha\_0 + \alpha\_1 \mathbf{x}\_i. \tag{5}$$

In Eq. (5) positive values of α<sup>1</sup> indicate that an increase in the number of discrepancies is associated with an increased risk of study bias.

In this HMR model the conditional mean is given by

$$\operatorname{E}\left(\theta|\mathbf{x}\_{i}\right) = \operatorname{Pr}\left(I\_{i} = 1|\mathbf{x}\_{i}\right)\mu\_{\text{biased}} + \left(1 - \operatorname{Pr}(I\_{i} = 1|\mathbf{x}\_{i})\right)\mu. \tag{6}$$

Eqs. (5) and (6) can be calculated as functional parameters for a grid of values of x. Their posteriors intervals are calculated at each value of x.

This HMR not only quantifies the average bias K and the relationship between bias and discrepancies in Eq. (5), but also allows to correct the treatment effect θ<sup>i</sup> by its propensity of being biased:

yi <sup>θ</sup><sup>i</sup> � <sup>N</sup> <sup>θ</sup>i, SE<sup>2</sup>

observed treatment effects θ1, …, θ<sup>N</sup> are linked with some observed covariate xi.

where the non-observable variable Ii indicates if study i is at risk of bias:

(

We model the probability that a study is biased as a function of xi as follows:

hierarchical model.

194 Bayesian Inference

following HMR model:

biased mean μbiased can be expressed as:

associated with an increased risk of study bias.

In this HMR model the conditional mean is given by

posteriors intervals are calculated at each value of x.

i � �, i <sup>¼</sup> <sup>1</sup>, …, N: �

If a prior assumption of exchangeability was considered reasonable, a random effects Bayesian model incorporates all the studies into a single model, where the θ1, …, θ<sup>N</sup> are assumed to be a random sample from a prior distribution with unknown parameters, which is known as a

In this section we assume that exchangeability is unrealistic and we wish to learn how the un-

Let xi be the number of observed discrepancies in the logarithmic scale. We propose to model the association between the treatment effect θ<sup>i</sup> and the observed discrepancies xi with the

<sup>θ</sup><sup>i</sup> Ii, xi � Ii <sup>N</sup> <sup>μ</sup>biased; <sup>τ</sup><sup>2</sup> � � <sup>þ</sup> ð Þ <sup>1</sup> � Ii <sup>N</sup> <sup>μ</sup>; <sup>τ</sup><sup>2</sup> � �; �

Iijxi <sup>¼</sup> 1 if study <sup>i</sup> is biased 0 otherwise:

The parameter μ corresponds to the mean treatment effect of studies with low risk of bias. We assume that in our context of application biased studies could report higher effect sizes and the

In this way, K measures the average amount of bias with respect to the mean effect μ. Eq. (4) also ensures that μ and μbiased are identifiable parameters in this model. The parameter τ measures the between-studies variability in both components of the mixture distributions.

In Eq. (5) positive values of α<sup>1</sup> indicate that an increase in the number of discrepancies is

Eqs. (5) and (6) can be calculated as functional parameters for a grid of values of x. Their

� (1)

� (2)

μbiased ¼ μ þ K, withK > 0: (4)

logit Pr Ii ¼ 1 xi ð Þ¼ j Þ α<sup>0</sup> þ α<sup>1</sup> xi ð : (5)

<sup>E</sup> <sup>θ</sup> xi <sup>j</sup> Þ ¼ Pr Ii <sup>¼</sup> <sup>1</sup> xi <sup>j</sup> <sup>Þ</sup>μbiased <sup>þ</sup> <sup>1</sup> � Pr Ii <sup>¼</sup> <sup>1</sup> xi ð Þ <sup>j</sup> <sup>Þ</sup> <sup>μ</sup>: � � � (6)

(3)

$$\theta\_i^{\text{corrected}} = (\theta\_i - K) \mathbf{Pr}(I\_i = 1|\mathbf{x}\_i) + \theta\_i (1 - \mathbf{Pr}(I\_i = 1|\mathbf{x}\_i)), \tag{7}$$

where the amount (θ<sup>i</sup> � K) measures the bias of study i and Pr(Ii = 1|xi) its propensity of being biased.

The HMR model presented above is completed by the following vague hyper-priors: For the regression parameters α0, α<sup>1</sup> � N(0, 100). We give to the mean μ<sup>1</sup> � N(0, 100) and for the bias parameter K � Uniform(0, 50). Finally, for the variability between studies we use τ � Uniform(0, 100), which represent a vague prior within the range of possible study deviations.

The model presented in this section is mathematically non-tractable. We approximated the posterior distributions of the model parameters with Markov Chain Monte Carlo (MCMC) techniques implemented in OpenBUGS.

BUGS stands for Bayesian Analysis Using Gibbs Sampling, the OpenBUGS software constructs a Directed Acyclic Graph (DAG) representation of the posterior distribution of all model's parameters. This representation allows to automatically factorize the DAG as a product of each node (parameters or data) conditionally on its parents and children. The software scans each node and proposes a method of sampling. The kernel of the Gibbs sampling is built upon this algorithm.

Computations were performed with the statistical language R and MCMC computations were linked to R with the package R2OpenBUGS. We used two chains of 20,000 iterations and we discarded the first 5000 for the burn-in period. Convergence was assessed visually by using the R package coda.

The diagonal panels of Figure 4 summarize the resulting posterior distributions for μ, K, τ, α<sup>0</sup> and α1. The posterior of μ clearly covers the zero indicating that the stem cells treatment is not effective. The bias parameter K indicates a considerable over-estimation of treatment effects reported for some trials. The posterior of α<sup>1</sup> is concentrated in positive values, which indicates that an increase in discrepancies is associated with an increase of the risk of reporting bias. The posteriors of α<sup>0</sup> and α<sup>1</sup> also present a large variability, which is expected when a hidden effect is modeled.

Further results of the Hierarchical Meta-Regression model appears in Figure 5, where posteriors 95% intervals are plotted against the number of discrepancies. On the left panel, we can see the relationship between the number of discrepancies and the probability that a study is biased. We can observe an increase of probability with an increase of the number of discrepancies, but also a large amount of variability. On the right panel appears the conditional mean of effect size as a function of the number of discrepancies, which corresponds to Eq. (6). Our analysis shows that the 95% posterior intervals of the conditional mean covers the zero effect in most of the range of discrepancies. Only for studies with more than 33 (exp(3.5)) discrepancies

Figure 4. Posterior distributions for the hyper-parameters of the HRM model. The diagonal displays the posterior distributions, the upper panels the pairwise correlations and the lower panels the pairwise posterior densities.

the model predicts a positive effect. One interesting result of this analysis is, that a horizontal line which may represent a zero correlation is also predicted by the model. This means that the regression calculated directly from the aggregated data contains an ecological bias and it is misleading. We have added this regression line to the plot to highlight this issue.

The results presented so far indicate that increases in the amount of discrepancies increases the propensity of bias. The question is: How can we correct a particular study for its bias? Eq. (7) gives the bias correction of treatment effect in this HMR model.

In Figure 6 we can see HMR bias correction in action. We display two studies which have 21 and 18 discrepancies respectively. The solid lines correspond to the likelihood functions of these studies. These likelihoods represent the information of the effect size at face value. The dashed lines correspond to the posterior treatment effects after bias correction. Clearly, we can see a strong bias correction with the conclusion of no treatment effect.

Two Examples of Bayesian Evidence Synthesis with the Hierarchical Meta-Regression Approach http://dx.doi.org/10.5772/intechopen.70231 197

Figure 5. Results of the Hierarchical Meta-Regression model. The posterior median and 95% intervals are displayed as solid lines. Left panel: relationship between the number of discrepancies and probability that a study is biased. Right panel: conditional mean of effect size as a function of the number of discrepancies.

the model predicts a positive effect. One interesting result of this analysis is, that a horizontal line which may represent a zero correlation is also predicted by the model. This means that the regression calculated directly from the aggregated data contains an ecological bias and it is

Figure 4. Posterior distributions for the hyper-parameters of the HRM model. The diagonal displays the posterior

distributions, the upper panels the pairwise correlations and the lower panels the pairwise posterior densities.

The results presented so far indicate that increases in the amount of discrepancies increases the propensity of bias. The question is: How can we correct a particular study for its bias? Eq. (7)

In Figure 6 we can see HMR bias correction in action. We display two studies which have 21 and 18 discrepancies respectively. The solid lines correspond to the likelihood functions of these studies. These likelihoods represent the information of the effect size at face value. The dashed lines correspond to the posterior treatment effects after bias correction. Clearly, we can

misleading. We have added this regression line to the plot to highlight this issue.

gives the bias correction of treatment effect in this HMR model.

196 Bayesian Inference

see a strong bias correction with the conclusion of no treatment effect.

Figure 6. Bias correction for two studies with 21 and 18 discrepancies respectively. The solid lines correspond to the likelihood functions of effect sizes. The dashed lines represent the posteriors for treatment effect after bias correction.

### 3. Hierarchical Meta-Regression analysis for diagnostic test data

In meta-analysis of diagnostic test data, the pieces of evidence that we aim to combine are the results of N diagnostic studies, where results of the ith study (i = 1, …, N) are summarized in a 2 � 2 table as follows:


where tpi and fni are the number of patients with positive and negative diagnostic results from ni,1 patients with disease, and fpi and tni are the positive and negative diagnostic results from ni,2 patients without disease.

Assuming that ni,1 and ni,2 have been fixed by design, we model the tpi and fpi outcomes with two independent Binomial distributions:

$$fp\_i \sim B(\text{TPR}\_i, n\_{i,1}) \quad \text{and} \quad fp\_i \sim B(\text{FPR}\_i, n\_{i,2}), \tag{8}$$

where TPR<sup>i</sup> is the true positive rate or sensitivity, Sei, of study i and FPR<sup>i</sup> is the false positive rate or complementary specificity, i.e., 1 � Spi.

At face value, diagnostic performance of each study is summarized by the empirical true positive rate and true negative rate or specificity

$$
\widehat{\text{TPR}}\_i = \frac{tp\_i}{n\_{i,1}} \quad \text{and} \quad \widehat{\text{TNR}}\_i = \frac{tn\_i}{n\_{i,2}} \tag{9}
$$

and the complementary empirical rates of false positive rate and false negative diagnostic results,

$$
\widehat{\text{FPR}}\_i = \frac{fp\_i}{n\_{i,2}} \quad \text{and} \quad \widehat{\text{FNR}}\_i = \frac{fn\_i}{n\_{i,1}}.\tag{10}
$$

In this type of meta-analysis we could separately model TPR<sup>i</sup> and FPR<sup>i</sup> (or Spi), but this approach ignores that these rates could be correlated by design. Therefore, it is more sensible to handle TPR<sup>i</sup> and FPR<sup>i</sup> jointly.

We define the random effect Di which represents the study effect associated with the diagnostic discriminatory power:

$$D\_i = \log\left(\frac{\text{TPR}\_i}{1 - \text{TPR}\_i}\right) - \log\left(\frac{\text{FPR}\_i}{1 - \text{FPR}\_i}\right). \tag{11}$$

However, diagnostic results are sensitive to diagnostic settings (e.g., the use of different thresholds) and to populations where the diagnostic procedure under investigation is applied. These issues are associated with the external validity of diagnostic results. To model external validity bias we introduce the random effect Si:

$$\mathbf{S}\_{i} = \log\left(\frac{\mathbf{TPR}\_{i}}{1 - \mathbf{TPR}\_{i}}\right) + \log\left(\frac{\mathbf{FPR}\_{i}}{1 - \mathbf{FPR}\_{i}}\right). \tag{12}$$

This random effect quantifies variability produced by patients' characteristics and diagnostic setup, that may produce a correlation between the observed TPRs and d FPRs. In short, we d called Si the threshold effect of study i and it represents an adjustment of external validity in the meta-analysis.

We could assume exchangeability of pairs (Di, Si), but study's quality is known to be an issue in diagnostic studies. For this reason we model the internal validity of a study by introducing random weights w1, …, wN. Conditionally to a study weight wi, the study effects Di and Si are modeled as exchangeable between studies and they follow a scale-mixture of bivariate Normal distributions with the following mean and variance:

$$E\left[\begin{pmatrix} D\_i \\ S\_i \end{pmatrix} \Big| w\_i \right] = \begin{pmatrix} \mu\_D \\ \mu\_S \end{pmatrix} \quad \text{and} \quad \text{var}\left[\begin{pmatrix} D\_i \\ S\_i \end{pmatrix} \Big| w\_i \right] = \frac{1}{w\_i} \begin{pmatrix} \sigma\_D^2 & \rho \sigma\_D \sigma\_S \\ \rho \sigma\_D \sigma\_S & \sigma\_S^2 \end{pmatrix} = \Sigma\_i,\tag{13}$$

and scale mixing density

3. Hierarchical Meta-Regression analysis for diagnostic test data

Test + tpi fpi Outcome � fni tni Sum: ni,1 ni,2

2 � 2 table as follows:

198 Bayesian Inference

ni,2 patients without disease.

to handle TPR<sup>i</sup> and FPR<sup>i</sup> jointly.

tic discriminatory power:

two independent Binomial distributions:

rate or complementary specificity, i.e., 1 � Spi.

positive rate and true negative rate or specificity

In meta-analysis of diagnostic test data, the pieces of evidence that we aim to combine are the results of N diagnostic studies, where results of the ith study (i = 1, …, N) are summarized in a

Patient status

With disease Without disease

where tpi and fni are the number of patients with positive and negative diagnostic results from ni,1 patients with disease, and fpi and tni are the positive and negative diagnostic results from

Assuming that ni,1 and ni,2 have been fixed by design, we model the tpi and fpi outcomes with

where TPR<sup>i</sup> is the true positive rate or sensitivity, Sei, of study i and FPR<sup>i</sup> is the false positive

At face value, diagnostic performance of each study is summarized by the empirical true

and the complementary empirical rates of false positive rate and false negative diagnostic results,

In this type of meta-analysis we could separately model TPR<sup>i</sup> and FPR<sup>i</sup> (or Spi), but this approach ignores that these rates could be correlated by design. Therefore, it is more sensible

We define the random effect Di which represents the study effect associated with the diagnos-

and TNR <sup>d</sup><sup>i</sup> <sup>¼</sup> tni

and FNR <sup>d</sup><sup>i</sup> <sup>¼</sup> f ni

� log FPR<sup>i</sup>

1 � FPR<sup>i</sup> � �

TPRd<sup>i</sup> <sup>¼</sup> tpi ni, <sup>1</sup>

FPRd<sup>i</sup> <sup>¼</sup> f pi ni,<sup>2</sup>

Di <sup>¼</sup> log TPR<sup>i</sup>

1 � TPR<sup>i</sup> � �

tpi � B TPR<sup>i</sup> ð Þ ; ni, <sup>1</sup> and f pi � B FPR<sup>i</sup> ð Þ ; ni,<sup>2</sup> ; (8)

ni,<sup>2</sup>

ni,<sup>1</sup>

(9)

: (10)

: (11)

$$w\_i \sim \text{Gamma}\left(\frac{\nu}{2}, \frac{\nu}{2}\right). \tag{14}$$

The inclusion of the random weights wi into the model was proposed by [5]. This approach was generalized in [6] in two ways: firstly, by splitting wi in two weights w1,<sup>i</sup> and w2,<sup>i</sup> corresponding to each component Di and Si respectively. Secondly, by putting a prior on the degrees of freedom parameter ν, which corresponds to an adaptive robust distribution of the random-effects.

The Hierarchical Meta-Regression representation of the model introduced above is the model based on the conditional distribution of (Di|Si = x) and the marginal distribution of Si. This HMR model was introduced by [7], who followed the stepping stones of the classical Summary Receiving Operating Characteristic (SROC) [8].

The conditional mean of (Di|Si = x) is given by:

$$E(D\_i|S\_i = \mathbf{x}) = A + B\mathbf{x} \tag{15}$$

where the functional parameters A and B are

$$A = \mu\_{D'} \quad \text{and} \quad B = \rho \frac{\sigma\_D}{\sigma\_S}. \tag{16}$$

We define the Bayesian SROC Curve (BSROC) by transforming back results from (S, D) to (FPR, TPR) with

$$\text{BSROC}(\text{FPR}) = \text{g}^{-1} \left[ \frac{A}{(1-B)} + \frac{B+1}{(1-B)} \,\text{g}(\text{FPR}) \right],\tag{17}$$

where g(p) is the logit(p) transformation, i.e. logit(p) = log(p/(1 � p)).

The BSROC curve is obtained by calculating TPR in a grid of values of FPR which gives a posterior conditionally on each value of FPR. Therefore, it is straightforward to give credibility intervals for the BSROC for each value of FPR.

One important aspect of the BSROC is that it incorporates the variability of the model's parameters, which influences the width of its credibility intervals. In addition, given that FPR is modeled as a random variable, the curve is corrected by measurement error bias in FPR.

Finally, we can define a Bayesian Area Under the SROC Curve (BAUC) by numerically integrating the BSROC for a range of values of the FPR:

$$\text{BAUC} = \int\_0^1 \text{BSROC}(x) \, d\chi. \tag{18}$$

In some applications it is recommend to use the limits of integration within the observed values of FPRs. d

In order to make this complex HMR model applicable in practice, we have implemented the model in the R's package bamdit, which uses the following set of hyper-priors:

$$
\mu\_D \sim \text{Logistic}(m\_1, \upsilon\_1), \quad \mu\_S \sim \text{Logistic}(m\_2, \upsilon\_2) \tag{19}
$$

and

$$
\sigma\_D \sim \text{Uniform}(0, \mu\_1), \quad \sigma\_S \sim \text{Uniform}(0, \mu\_2). \tag{20}
$$

The correlation parameter ρ is transformed by using the Fisher transformation,

$$z = \text{logit}\left(\frac{\rho + 1}{2}\right) \tag{21}$$

and a Normal prior is used for z:

$$z \sim \mathbf{N}(m\_r, v\_r). \tag{22}$$

Modeling priors in this way guarantees that in each MCMC iteration the variance-covariance matrix of the random effects θ<sup>1</sup> and θ<sup>2</sup> is positive definite. The values of the constants m1, v1, m2, v2, u1, u2, mr and vr have to be given. They can be used to include valid prior information which might be empirically available or they could be the result of expert elicitation. If such information is not available, we recommend setting these parameters to values that represent weakly informative priors. In this work, we use m<sup>1</sup> = m<sup>2</sup> = mr = 0, v<sup>1</sup> = v<sup>2</sup> = 1, u1 = u2 = 5 and vr <sup>¼</sup> ffiffiffiffiffiffi <sup>1</sup>:<sup>7</sup> <sup>p</sup> as weakly informative prior setup.

We define the Bayesian SROC Curve (BSROC) by transforming back results from (S, D) to

ð Þ <sup>1</sup> � <sup>B</sup> <sup>þ</sup>

The BSROC curve is obtained by calculating TPR in a grid of values of FPR which gives a posterior conditionally on each value of FPR. Therefore, it is straightforward to give credibility

One important aspect of the BSROC is that it incorporates the variability of the model's parameters, which influences the width of its credibility intervals. In addition, given that FPR is modeled as a random variable, the curve is corrected by measurement error bias in FPR.

Finally, we can define a Bayesian Area Under the SROC Curve (BAUC) by numerically integrat-

In some applications it is recommend to use the limits of integration within the observed

In order to make this complex HMR model applicable in practice, we have implemented the

<sup>z</sup> <sup>¼</sup> logit <sup>ρ</sup> <sup>þ</sup> <sup>1</sup>

Modeling priors in this way guarantees that in each MCMC iteration the variance-covariance matrix of the random effects θ<sup>1</sup> and θ<sup>2</sup> is positive definite. The values of the constants m1, v1, m2, v2, u1, u2, mr and vr have to be given. They can be used to include valid prior information which might be empirically available or they could be the result of expert elicitation. If such information is not available, we recommend setting these parameters to values

2 � �

μ<sup>D</sup> � Logisticð Þ m1; v<sup>1</sup> , μ<sup>S</sup> � Logisticð Þ m2; v<sup>2</sup> (19)

σ<sup>D</sup> � Uniform 0ð Þ ; u<sup>1</sup> , σ<sup>S</sup> � Uniform 0ð Þ ; u<sup>2</sup> : (20)

z � N mr ð Þ ; vr : (22)

ð1 0

BAUC ¼

model in the R's package bamdit, which uses the following set of hyper-priors:

The correlation parameter ρ is transformed by using the Fisher transformation,

B þ 1 ð Þ <sup>1</sup> � <sup>B</sup> g FPR ð Þ

; (17)

(21)

BSROCð Þx dx: (18)

� �

BSROC FPR ð Þ¼ <sup>g</sup>�<sup>1</sup> <sup>A</sup>

where g(p) is the logit(p) transformation, i.e. logit(p) = log(p/(1 � p)).

intervals for the BSROC for each value of FPR.

ing the BSROC for a range of values of the FPR:

(FPR, TPR) with

200 Bayesian Inference

values of FPRs. d

and a Normal prior is used for z:

and

These values are fairly conservative, in the sense that they induce prior uniform distributions for TPRi and FPRi. They give locally uniform distributions for μ<sup>1</sup> and μ2; uniforms for σ<sup>1</sup> and σ2; and a symmetric distribution for ρ centered at 0.

Figure 7 summarizes the meta-analysis results of fitting the bivariate random-effect model to the computer tomography diagnostic data. The Bayesian Predictive Surface are presented by contours at different credibility levels and compare these curves with the observed data represented by the circles with varying diameters according to the sample size of each study. The scattered points are samples from the predictive posteriors and the histograms correspond to the posterior predictive marginals. This result was generated by using the functions metadiag() and plot in the R package bamdit.

Figure 7. Results of the meta-analysis: Bayesian Predictive Surface by contours at different credibility levels.

Figure 8 displays the posteriors of each components' weights. The left panel shows that prospective studies number 25 and 33 deviate with respect to the prior mean of 1, while on the right panel we see that a prospective study (number 47) and five retrospective studies (number 1, 3, 4, 8 and 29) have substantial variability.

Figure 8. Posterior distributions of the component weights: it is expected that the posterior is centered at 1. Studies with retrospective design tend to present deviations in FPR.

Two Examples of Bayesian Evidence Synthesis with the Hierarchical Meta-Regression Approach http://dx.doi.org/10.5772/intechopen.70231 203

Figure 9. Hierarchical Meta-Regression model: left panel shows the BSROC curve, the central line corresponds to the posterior median and the upper and lower curves correspond to the quantiles of the 2.5 and 97.5%, respectively. The right panel displays the posterior distribution of the area under the BSROC curve.

An important aspect of wi is its interpretation as estimated bias correction. A priori all studies included in the review have a mean of E(wi) = 1. We can expect that studies which are unusually heterogeneous will have posteriors substantially greater than 1. Unusual studies' results could be produced by factors that may affect the quality of the study, such as errors in recording diagnostic results, confounding factors, loss to follow-up, etc. For that reason, the studies' weights wi can be interpreted as an adjustment of studies' internal validity bias.

The BSROC curve and its area under the curve are presented in Figure 9. The left panel shows this HMR as a meta-analytic summary for this data. On the right panel the posterior distribution of the BAUC show quite a high diagnostic ability for computer tomography scans as diagnostic of appendicitis.

### 4. Conclusions

Figure 8 displays the posteriors of each components' weights. The left panel shows that prospective studies number 25 and 33 deviate with respect to the prior mean of 1, while on the right panel we see that a prospective study (number 47) and five retrospective studies

Figure 8. Posterior distributions of the component weights: it is expected that the posterior is centered at 1. Studies with

retrospective design tend to present deviations in FPR.

(number 1, 3, 4, 8 and 29) have substantial variability.

202 Bayesian Inference

In this work we have seen the HMR in action. This approach of meta-analysis is based on a simple strategy: two sub-models are defined in the meta-analysis, one which models the problem of interest, for instance the treatment effect, and one which handles the multiplicity of bias. The meta-analysis is summarized by understanding how these components interact with each other.

The examples presented in this work have shown that we could have misleading conclusions from indirect evidence, if it were analyzed as directly contributing to the problem of interest.

For instance, in the first example, Section 2, we have seen in Figure 1 that pooling studies gave a wrong conclusion about the effect of stem cells treatment. The positive correlation between the aggregated effect size and the number of discrepancies exaggerates its relationship.

Actually, in Figure 5 the HMR has shown that it is possible to simultaneously have a zero correlation between effect size and discrepancies while still having a risk of reporting bias. In addition, the HMR allows to extract the amount of bias in the meta-analysis and to correct the treatment effect at the level of the study (Figure 6).

In the second example, Section 3, biases come from the external validity of diagnostic studies and the internal validity due to their quality. In this example the HMR showed that it was possible to simultaneously model these two types of subtle biases.

To account for internal validity bias, the application of a scale mixture of normal distributions allows us to detect conflictive studies, which can be considered as outliers. The Bayesian Summary Receiving Operative Curve accounts for the external validity bias due to changes in factors that affected the diagnostic results. In addition, the posterior for its Area Under the Curve (AUC) summarizes the results of the meta-analysis.

### Acknowledgements

This work was supported by the German Research Foundation project DFG VE 896 1/1.


### Appendix: Source Data for Sections 1.1 and Section 2

Two Examples of Bayesian Evidence Synthesis with the Hierarchical Meta-Regression Approach http://dx.doi.org/10.5772/intechopen.70231 205


Table 1. Results from 31 randomized controlled trials of heart disease patients, where the treatment group received bone marrow stem cells and the control group a placebo treatment. The source of this table is Nowbar et al. [4].

### Author details

of bias. The meta-analysis is summarized by understanding how these components interact

The examples presented in this work have shown that we could have misleading conclusions from indirect evidence, if it were analyzed as directly contributing to the problem of interest. For instance, in the first example, Section 2, we have seen in Figure 1 that pooling studies gave a wrong conclusion about the effect of stem cells treatment. The positive correlation between the aggregated effect size and the number of discrepancies exaggerates its relationship.

Actually, in Figure 5 the HMR has shown that it is possible to simultaneously have a zero correlation between effect size and discrepancies while still having a risk of reporting bias. In addition, the HMR allows to extract the amount of bias in the meta-analysis and to correct the

In the second example, Section 3, biases come from the external validity of diagnostic studies and the internal validity due to their quality. In this example the HMR showed that it was

To account for internal validity bias, the application of a scale mixture of normal distributions allows us to detect conflictive studies, which can be considered as outliers. The Bayesian Summary Receiving Operative Curve accounts for the external validity bias due to changes in factors that affected the diagnostic results. In addition, the posterior for its Area Under the

This work was supported by the German Research Foundation project DFG VE 896 1/1.

t01 1.5 3.67 21 17 Quyyumi 2011 USA t02 1.1 2.09 100 7 Lunde 2007 Norway t03 1.7 2.91 23 7 Srimahachota 2011 Thailand t05 0.8 2.78 60 4 Meyer 2006 Germany t06 7 0.63 40 4 Meluzín 2006 Czech Republic

t09 7.8 2.76 38 21 Piepoli 2010 Italy t11 14 4.05 20 13 Suárez de Lezo 2007 Spain t12 5.4 2.44 77 18 Huikuri HV 2008 Finland t13 2.7 1.2 82 16 Perin 2012 USA t15 4.1 0.98 46 0 Assmus 2006 Germany

discrepancies

Author or principal investigator

Year Country

treatment effect at the level of the study (Figure 6).

possible to simultaneously model these two types of subtle biases.

Appendix: Source Data for Sections 1.1 and Section 2

Trial ID Effect size SE (effect size) Sample size Number of

Curve (AUC) summarizes the results of the meta-analysis.

Acknowledgements

with each other.

204 Bayesian Inference

Pablo Emilio Verde

Address all correspondence to: pabloemilio.verde@hhu.de

Coordination Center for Clinical Trials, University of Duesseldorf, Duesseldorf, Germany

### References

[1] Eddy DM, Hasselblad V, Shachter R. Meta-Analysis by the Confidence Profile Method: The Statistical Synthesis of Evidence. San Diego, CA: Academic Press; 1992


Provisional chapter

## **Bayesian Modeling in Genetics and Genomics**

DOI: 10.5772/intechopen.70167

Bayesian Modeling in Genetics and Genomics

Hafedh Ben Zaabza, Abderrahmen Ben Gara and Boulbaba Rekik Hafedh Ben Zaabza, Abderrahmen Ben Gara

Additional information is available at the end of the chapter and Boulbaba Rekik

http://dx.doi.org/10.5772/intechopen.70167 Additional information is available at the end of the chapter

### Abstract

[2] Spiegelhalter DJ, Abrams KR, Myles JP. Bayesian Approaches to Clinical Trials and Health-Care Evaluation. The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ, England:

[3] Verde PE, Ohmann C. Combining randomized and non-randomized evidence in clinical research: A review of methods and applications. Research Synthesis Methods. Vol. 6. 2014.

[4] Nowbar AN, Mielewczik M, Karavassilis M, Dehbi HM, Shun-Shin MJ, Jones S, Howard JP, Cole GD, Francis DP. Discrepancies in autologous bone marrow stem cell trials and enhancement of ejection fraction (damascene): Weighted regression and meta-analysis.

[5] Verde PE. Meta-analysis of diagnostic test data: A bivariate Bayesian modeling approach.

[6] Verde PE. bamdit: An R package for Bayesian meta-analysis of diagnostic test data.

[7] Verde PE. Meta-analysis of diagnostic test data: Modern statistical approaches. PhD Thesis, University of Düsseldorf. Deutsche Nationalbibliothek. July, 2008. Available from:

[8] Moses LE, Shapiro D, Littenberg B. Combining independent studies of a diagnostic test into a summary roc curve: Data-analytic approaches and some additional considerations.

http://docserv.uni-duesseldorf.de/servlets/DocumentServlet?id=8494

John Wiley & Sons, Ltd.; 2004

Statistics in Medicine. 2010;29(30):3088-3102

Journal of Statistical Software. 2017, in press

Statistics in Medicine. 1993;12:1293-1316

DOI: 10.1002/jrsm.1122

206 Bayesian Inference

BMJ. 2014;348:1-9

This chapter provides a critical review of statistical methods applied in animal and plant breeding programs, especially Bayesian methods. Classical and Bayesian procedures are presented in pedigree-based and marker-based models. The flexibility of the Bayesian approaches and their high accuracy of prediction of the breeding values are illustrated. We show a tendency of the superiority of Bayesian methods over best linear unbiased prediction (BLUP) in accuracy of selection, but some difficulties on elicitation of some complex prior distributions are investigated. Genetic models including marker and pedigree information are more accurate than statistical models based on markers or pedigree alone.

Keywords: accuracy of prediction, breeding value, Bayesian methods, BLUP, pedigree, markers

### 1. Introduction

Quantitative genetics result from the (connection) combination of statistics and the principles of animal and plant breeding. In quantitative genetics, selection for economically important traits refers to use of phenotypic values of the individual and pedigree information. Genomic is based on the use of dense markers through the whole genome to predict the breeding value of the individuals [1]. Linear models (univariate and multivariate) are of fundamental importance in applied and theoretical quantitative genetics [2]. In animal breeding, two major methods were particularly applied, restricted maximum likelihood (REML) and Bayesian methods. REML has emerged as the method of choice in animal breeding for variance component estimation [3]. Bayesian analysis is gaining popularity because of its more comprehensive assumptions than those of classical approaches and its flexibility in

© 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and eproduction in any medium, provided the original work is properly cited.

resolving a wide range of biological problems [4, 5]. In the Bayesian approach, the idea is to combine what is known about the statistical ensemble before the data are observed (prior probability distributions) with the information coming from the data, to obtain a posterior distribution from which inferences are made using the standard probability calculus techniques [2, 6]. In recent years, Bayesian methods were broadly used to solve many of the difficulties faced by conventional statistical methods and extend the applicability of statistics on animal and plant breeding data [7]. Furthermore, Markov chain Monte Carlo (MCMC) has an important impact in applied statistics, especially from Bayesian perspective for the estimation of genetic parameters in the linear mixed effect model [2, 5]. The specific objective of this chapter was to illustrate applications of Bayesian inference in quantitative genetics and genomics. First, Bayesian models in the quantitative genetics theory are examined. Second, and in the context of the genomic selection, we presented the details of statistical modeling, using BLUP and Bayesian analyses. Third, a critical review with a focus on the prior distributions is illustrated. Finally, genomic predictions from several methods used in many countries are discussed.

### 2. A brief introduction to Bayesian analyses

In Bayesian inference, the idea is to combine what is known about the statistical ensemble before the data are observed (prior probability distributions) with the information coming from the data, to obtain a posterior distribution from which inferences are made using the standard probability calculus techniques.

$$P(\theta/y)\alpha \text{ } P(\mathbf{y}/\theta)\mathbf{P}(\theta) \text{ } \tag{1}$$

P(θ) is the prior distribution, which reflects the relative uncertainty about the possible values of θ before the data are seen. P(y/θ) is the likelihood function of observing the data given the parameter which represents the contribution of y to knowledge about the parameter θ. P(θ/y) is the posterior distribution of the parameter θ given the previous information on the data.

### 3. Bayesian analyses of linear models

#### 3.1. The mixed linear model

The mixed linear model is of great importance in genetics and is one of the most used statistical models. Arguably, variance components and genetic parameters are important because they give an indication of the ability of species to respond to selection and thus the potential of that species to evolve. Mixed linear model is the simplest method for estimating the variance components for quantitative traits in population. In the "frequentist" view, mixed linear model is one included linearly the fixed and random effects. In the Bayesian context, there is no distinction between fixed and random effects. Detailed Bayesian analyses of models with two or more component variances will be discussed.

#### 3.1.1. The univariate linear additive genetic model

The mixed linear model is one that includes fixed and random effects.

Consider the linear model:

resolving a wide range of biological problems [4, 5]. In the Bayesian approach, the idea is to combine what is known about the statistical ensemble before the data are observed (prior probability distributions) with the information coming from the data, to obtain a posterior distribution from which inferences are made using the standard probability calculus techniques [2, 6]. In recent years, Bayesian methods were broadly used to solve many of the difficulties faced by conventional statistical methods and extend the applicability of statistics on animal and plant breeding data [7]. Furthermore, Markov chain Monte Carlo (MCMC) has an important impact in applied statistics, especially from Bayesian perspective for the estimation of genetic parameters in the linear mixed effect model [2, 5]. The specific objective of this chapter was to illustrate applications of Bayesian inference in quantitative genetics and genomics. First, Bayesian models in the quantitative genetics theory are examined. Second, and in the context of the genomic selection, we presented the details of statistical modeling, using BLUP and Bayesian analyses. Third, a critical review with a focus on the prior distributions is illustrated. Finally, genomic predictions from several methods used in

In Bayesian inference, the idea is to combine what is known about the statistical ensemble before the data are observed (prior probability distributions) with the information coming from the data, to obtain a posterior distribution from which inferences are made using the

P(θ) is the prior distribution, which reflects the relative uncertainty about the possible values of θ before the data are seen. P(y/θ) is the likelihood function of observing the data given the parameter which represents the contribution of y to knowledge about the parameter θ. P(θ/y) is the posterior distribution of the parameter θ given the previous information on the data.

The mixed linear model is of great importance in genetics and is one of the most used statistical models. Arguably, variance components and genetic parameters are important because they give an indication of the ability of species to respond to selection and thus the potential of that species to evolve. Mixed linear model is the simplest method for estimating the variance components for quantitative traits in population. In the "frequentist" view, mixed linear model is one included linearly the fixed and random effects. In the Bayesian context, there is no distinction between fixed and random effects. Detailed Bayesian analyses of models with two

Pðθ=yÞα Pðy=θÞPðθÞ (1)

many countries are discussed.

208 Bayesian Inference

2. A brief introduction to Bayesian analyses

standard probability calculus techniques.

3. Bayesian analyses of linear models

or more component variances will be discussed.

3.1. The mixed linear model

$$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\mathbf{a} + \mathbf{e} \tag{2}$$

y is a n�1 vector of records on a trait; β is the vector of fixed effects affecting records; a is the vector of additive genetic effects; e is a vector of residual effects. X and Z are incidence matrices relating records to fixed effects and additive genetic effects, respectively. Data are assumed to be generated from the following distribution:

$$\begin{aligned} \mathbf{y}|\mathfrak{B}, a, \sigma\_\mathbf{e}^2 &\sim N(\mathbf{X}\mathbf{b} + \mathbf{Z}\mathbf{a}, \mathbf{I}\sigma\_\mathbf{e}^2) \\ \mathbf{e} &\sim \mathbf{N}(0, \mathbf{I}\sigma\_\mathbf{e}^2) \end{aligned}$$

where, <sup>I</sup> is an identity matrix of order <sup>n</sup>�<sup>n</sup> and <sup>σ</sup><sup>2</sup> <sup>e</sup> is the residual variance. Independence of various effects was assumed for the sake of simplicity in implementation. We assume a genetic model in which genes act additively within and between loci, and there are effectively an infinite number of loci. Under this infinitesimal model, and assuming further initial Hardy-Weinberg and linkage equilibrium, the distribution of additive genetic values conditional on the additive genetic covariance is multivariate normal.

$$\mathbf{a}|\mathbf{A}, \sigma\_a^2 \sim N(\mathbf{0}, \mathbf{A}\sigma\_a^2)$$

where A is the numerator relationship matrix of order q�q; β is assumed to have a uniform distribution with bounds βmin and βmax.

$$\mathbf{P}(\sigma\_i^2 | \nu\_i, S\_i^2) \sim (\sigma\_i^2)^{(\frac{\mathcal{V}\_i}{2} + 1)} \exp\left(-\frac{\nu\_i S\_i^2}{2\sigma\_i^2}\right), \quad (i = a, e)$$

where ve, S2 <sup>e</sup> and va, S<sup>2</sup> <sup>a</sup> are interpreted as degrees of belief and a priori values for residual and additive genetic covariances. Posterior conditional distributions derived from the likelihood and the prior distributions for these parameters are,

bi| b-i, a, σ<sup>2</sup> <sup>a</sup> , σ<sup>2</sup> <sup>e</sup> , <sup>y</sup> � <sup>N</sup>ð^bi,ðx<sup>0</sup> i xiÞ �1 σ2 <sup>e</sup> Þ, with (x<sup>0</sup> ixi) is the ith element of the diagonal of X<sup>0</sup> X

#### 3.1.2. The univariate linear additive genetic model with permanent and genetic group effects

The model equation [8] used to estimate genetic parameters and genetic breeding value for milk yield was as follows:

$$\mathbf{y} = \mathbf{X}\mathbf{\beta} + \mathbf{Z}\mathbf{a} + \mathbf{Z}\mathbf{Q}\mathbf{g} + \mathbf{W}\mathbf{p} + \mathbf{e} \tag{3}$$

where y is the vector of milk yield, b is the vector of fixed effects, a is the vector of additive genetic effects, g is the vector of genetic group effects, p is the vector of random permanent environmental effects, and e is the vector of residual effects. X, Z, W, and ZQ are incidence matrices relating a record to fixed environmental effects in b, to a random animal effects in a, to a random permanent environment effects in p, and to genetic groups in g, respectively. g\* is the vector of genetic group effects, â is a vector of breeding values. A is the numerator relationship matrix. where a^� ¼ Q g^ þ a^.

The conditional distribution of observed yield is defined by:

$$\mathbf{y}|\mathbf{b}, \ \mathbf{p}, \ \mathbf{a}\*, \sigma\_{\varepsilon}^{2} \sim N(\mathbf{X}\mathbf{b} \ + \ \mathbf{Z}\mathbf{a}\* + \mathbf{W}\mathbf{p}, \mathbf{I}\sigma\_{\varepsilon}^{2})$$

with the assumption of P(b) being a constant; a\*|A\* , σ<sup>2</sup> <sup>a</sup> � NðQg, A� σ2 aÞ;

$$\mathbf{p}|\sigma\_{\mathbf{p}}^{2} \sim \mathcal{N}(0, \mathbf{I}\sigma\_{\mathbf{p}}^{2}); \text{ and } \mathbf{P}(\sigma\_{\mathbf{i}}^{2}|\nu\_{\mathbf{i}}, \mathbf{S}\_{\mathbf{i}}^{2}) \sim (\sigma\_{\mathbf{i}}^{2})^{(\frac{\alpha}{2}+1)} \exp\left(-\frac{\nu\_{\mathbf{i}}\mathbf{S}\_{\mathbf{i}}^{2}}{2\sigma\_{\mathbf{i}}^{2}}\right)$$

where S<sup>2</sup> <sup>i</sup> are prior values for the variances, χ�<sup>2</sup> <sup>ν</sup><sup>i</sup> are inverted chi-square distributions, and ν<sup>i</sup> are degrees of freedom of parameters.

#### 3.1.2.1. Management and environmental effects

The distribution of a fixed effect is:

$$\mathbf{b}\_{\mathbf{i}} \| \mathbf{b}\_{-\mathbf{i}\prime} \mathbf{a}^\* , \sigma\_{a\prime}^2 \sigma\_{p\prime}^2 \sigma\_{\mathbf{e}\prime}^2 \mathbf{y} \sim \mathbf{N}(\hat{\mathbf{b}}\_{\boldsymbol{\nu}} \ (\mathbf{x}\_{\boldsymbol{\nu}}^{\prime} \mathbf{x}\_{\boldsymbol{\nu}})^{-1} \sigma\_{\mathbf{e}}^2)$$

with (x<sup>0</sup> <sup>i</sup> xi) ^bi <sup>¼</sup> <sup>x</sup> 0 y � x 0 i x 0 �i b�<sup>i</sup> � x 0 i wp � x 0 i za�,

where (x<sup>0</sup> <sup>i</sup> xi) is the ith element of the diagonal of X<sup>0</sup> X

#### 3.1.2.2. Permanent environmental effects

The distribution of a permanent effect is:

$$p\_i | b\_{i\prime} p\_{-i\prime} a^\*, \sigma\_{a\prime}^2 \sigma\_{p\prime}^2 \sigma\_{\varepsilon\prime}^2 y \sim \text{N}(\hat{p}\_{i\prime} \left(w\_i^{\prime} w\_i + \delta\right)^{-1} \sigma\_{\varepsilon}^2)$$

with ðw<sup>0</sup> i wi þ δÞp^<sup>i</sup> ¼ w<sup>0</sup> <sup>i</sup> y � w<sup>0</sup> <sup>i</sup> Xb � ðwiW�<sup>i</sup> <sup>þ</sup> <sup>δ</sup>Þp�<sup>i</sup> � wiza �,

where w<sup>0</sup> <sup>i</sup> wi is the ith element of the diagonal of W<sup>0</sup> W.

#### 3.1.2.3. Breeding values

The distribution of a breeding value is:

$$\mathbf{a\_i^\*} | \mathbf{b}, \mathbf{p\_{-i'}} \mathbf{a\_{-i'}^\*} \sigma\_{\mathbf{a'}}^2 \sigma\_{\mathbf{p'}}^2 \sigma\_{\mathbf{e'}}^2 \mathbf{y} \sim \mathrm{N}(\mathbf{a\_i^\*} (\mathbf{z'\_i} \mathbf{z\_i} + \mathbf{A\_{i,i}^{\*-1}} \alpha^{-1}) \sigma\_{\mathbf{e}}^2)$$

with ðz 0 i zi <sup>þ</sup> <sup>A</sup>��<sup>1</sup> i,i αÞ^ai ¼ z 0 i y � z 0 i Xb � z 0 i WP � <sup>A</sup>��<sup>1</sup> i,i αa� �i , where z<sup>0</sup> <sup>i</sup> zi is the i th element of the diagonal of Z<sup>0</sup> Z.

#### 3.1.2.4. Variance components

environmental effects, and e is the vector of residual effects. X, Z, W, and ZQ are incidence matrices relating a record to fixed environmental effects in b, to a random animal effects in a, to a random permanent environment effects in p, and to genetic groups in g, respectively. g\* is the vector of genetic group effects, â is a vector of breeding values. A is the numerator

<sup>i</sup> <sup>j</sup>νi, <sup>S</sup><sup>2</sup>

<sup>e</sup> , <sup>y</sup> � <sup>N</sup>ð^

<sup>e</sup> , y � Nðp^<sup>i</sup>

<sup>e</sup>, y � Nða�

i,i αa� �i ,

Z.

WP � <sup>A</sup>��<sup>1</sup>

<sup>i</sup> Xb � ðwiW�<sup>i</sup> <sup>þ</sup> <sup>δ</sup>Þp�<sup>i</sup> � wiza

X

bi,ðx<sup>0</sup> <sup>i</sup>xiÞ �1 σ2 eÞ

,ðw<sup>0</sup>

W.

iðz<sup>0</sup>

iwi þ δÞ

�,

izi <sup>þ</sup> <sup>A</sup>��<sup>1</sup>

<sup>i</sup>,<sup>i</sup> α�<sup>1</sup> Þσ2 eÞ

�1 σ2 e Þ

<sup>e</sup> � <sup>N</sup>ðXb <sup>þ</sup> Za � þ Wp,Iσ<sup>2</sup>

, σ<sup>2</sup>

iÞ�ðσ<sup>2</sup> iÞ ððνi

<sup>a</sup> � NðQg, A�

e Þ

σ2 aÞ;

<sup>ν</sup><sup>i</sup> are inverted chi-square distributions, and ν<sup>i</sup> are

i 2σ<sup>2</sup> i

<sup>2</sup>þ1ÞÞ exp � <sup>ν</sup>iS<sup>2</sup>

relationship matrix. where a^� ¼ Q g^ þ a^.

<sup>p</sup>jσ<sup>2</sup>

degrees of freedom of parameters.

The distribution of a fixed effect is:

0 y � x 0 i x 0 �i b�<sup>i</sup> � x 0 i wp � x 0 i za�,

3.1.2.2. Permanent environmental effects

wi þ δÞp^<sup>i</sup> ¼ w<sup>0</sup>

3.1.2.3. Breeding values

zi <sup>þ</sup> <sup>A</sup>��<sup>1</sup>

The distribution of a permanent effect is:

The distribution of a breeding value is:

i,i αÞ^ai ¼ z

a� <sup>i</sup> <sup>j</sup>b, <sup>p</sup>�<sup>i</sup>

> 0 i y � z 0 i Xb � z 0 i

pi <sup>j</sup>bi, p�<sup>i</sup> , a� , σ<sup>2</sup> <sup>a</sup> , σ<sup>2</sup> <sup>p</sup>, σ<sup>2</sup>

<sup>i</sup> y � w<sup>0</sup>

<sup>i</sup> wi is the ith element of the diagonal of W<sup>0</sup>

, a� �i , σ<sup>2</sup> <sup>a</sup>, σ<sup>2</sup> <sup>p</sup>, σ<sup>2</sup>

<sup>i</sup> zi is the i th element of the diagonal of Z<sup>0</sup>

<sup>i</sup> xi) ^bi <sup>¼</sup> <sup>x</sup>

3.1.2.1. Management and environmental effects

where S<sup>2</sup>

210 Bayesian Inference

with (x<sup>0</sup>

where (x<sup>0</sup>

with ðw<sup>0</sup> i

where w<sup>0</sup>

with ðz 0 i

where z<sup>0</sup>

The conditional distribution of observed yield is defined by:

with the assumption of P(b) being a constant; a\*|A\*

<sup>p</sup> � <sup>N</sup>ð0,Iσ<sup>2</sup>

<sup>i</sup> are prior values for the variances, χ�<sup>2</sup>

bijb�<sup>i</sup>, a�

<sup>i</sup> xi) is the ith element of the diagonal of X<sup>0</sup>

<sup>y</sup>jb, <sup>p</sup>, <sup>a</sup> � , <sup>σ</sup><sup>2</sup>

<sup>p</sup>Þ; and <sup>P</sup>ðσ<sup>2</sup>

, σ<sup>2</sup> <sup>a</sup> , σ<sup>2</sup> <sup>p</sup>, σ<sup>2</sup> The additive genetic variance is defined by

$$
\sigma\_a^2 | b, p, a^\*, \sigma\_{p^\*}^2 \sigma\_{e^\*}^2 \mathbf{y} \sim \tilde{V}\_a \tilde{S}\_a^2 \chi\_{\tilde{\mathcal{U}}\_a}^{-2}
$$

with <sup>V</sup><sup>~</sup> <sup>a</sup> <sup>¼</sup> na <sup>þ</sup> Va, <sup>S</sup>~<sup>2</sup> <sup>a</sup> ¼ ða� 0 A��<sup>1</sup> <sup>a</sup>� <sup>þ</sup> VaS<sup>2</sup> <sup>a</sup> <sup>Þ</sup>=V<sup>~</sup> <sup>a</sup>, and np is the number of animals being evaluated. The variance of permanent environmental effects is given by:

$$
\sigma\_{p'}^2 \vert b, p, a^\*, \sigma\_{a'}^2 \sigma\_{e'}^2 \mathbf{y} \sim \tilde{V}\_p \tilde{S}\_p^2 \chi\_{\tilde{\mathcal{U}}\_p}^{-2}
$$

with <sup>V</sup><sup>~</sup> <sup>p</sup> <sup>¼</sup> np <sup>þ</sup> Vp, <sup>S</sup>~<sup>2</sup> <sup>p</sup> ¼ ðp<sup>0</sup> <sup>p</sup> <sup>þ</sup> VpS<sup>2</sup> <sup>p</sup>Þ=V<sup>~</sup> <sup>p</sup>, and np is the number of animals being evaluated. Residual variance:

$$
\sigma\_\epsilon^2 | b, p, a^\*, \sigma\_{a'}^2, \sigma\_{p'}^2 \mathbf{y} \sim \tilde{V}\_\epsilon \tilde{S}\_\epsilon^2 \chi\_{\tilde{\mathcal{U}}\_\epsilon}^{-2}
$$

with <sup>V</sup><sup>~</sup> <sup>e</sup> <sup>¼</sup> ne <sup>þ</sup> Ve,

$$\tilde{\mathcal{S}}\_{\varepsilon}^{2} = \left[ \left( y - \mathbf{X}b - \mathbf{W}p - \mathbf{Z}a\* \right)' \left( y - \mathbf{X}b - \mathbf{W}p - \mathbf{Z}a\* \right) + V\_{\varepsilon} \mathbf{S}\_{\varepsilon}^{2} \right] / \tilde{V}\_{\varepsilon \varepsilon}$$

and ne is the total number of records.

Comparing genetic value predictions based on polygenic model in Tunisian Holstein Population using BLUP and Bayesian analyses, Ref. [8] reported that the rankings of animals with Bayesian methods are similar to those obtained by BLUP method. Spearman's rank correlation between genetic values estimated from Bayesian procedures and genetic values estimated from BLUP methods were high (0.99). Again, Bayesian and best linear unbiased estimator (BLUE) solutions of fixed effects (month of calving, herd-year, and age-parity) showed the same patterns. The same result is reported by Ref. [9]. However, Ref. [8] illustrated different correlation estimates between two methods (Bayesian and BLUP) for cow's and bull's breeding value.

### 4. Genomic selection

A massive quantity of genomic data is now available in animal and plant breeding with the revolutionary development in sequencing and genotyping. The cost of genotyping is dramatically reduced. Consequently, practices of genomic selection are nowadays possible with the high number of single nucleotide polymorphism (SNP) markers available. Therefore, it is feasible to perform analysis of the genome at a level that was not possible before [10–13]. The concept of genomic selection was introduced by Ref. [1]. The latter suggested that a set of markers covering the whole genome explain the all genetic variances and each marker is likely to be associated with a quantitative trait locus (QTL), and each QTL is in linkage disequilibrium with the

markers. The number of effects per QTL to be estimated is very small. The estimated effects of all markers are summed in order to obtain the genetic value of the individual. Using simulation, Ref. [1] showed in simulation that with a high-density SNP marker, it is possible to predict the breeding value with an accuracy of 0.85 (where accuracy is the correlation between the estimated breeding value and true breeding value). The challenge in genomic evaluation is to find the best prediction method to obtain accurate genetic values of candidates. Many genomic evaluation methods have been proposed [14, 15]. The main objective of this section is to compare Bayesian methods to other methods used in genomic selection based on their predictive abilities. The study reported by Ref. [1] was considered an influential paper on dairy cattle breeding programs. First, the methods suggested correspond well to the data structures where the number of SNPs substantially exceeds the number of observations. Second, the methods of Ref. [1] constitute a logical evolution of the BLUP methodology, which is the reference method in animal genetics by considering specific variances of SNPs in the different loci. Third, the Bayesian approaches used in Ref. [1] that take into account unknown effects (measuring prior uncertainty) in a model, and combined with the ability of the Monte Carlo Markov chain, can be used in the majority of parametric statistical models.

### 4.1. Genomic BLUP (GBLUP)

The GBLUP method assumes that effects of all SNPs are sampled from the same normal distribution; the effects of all markers are assumed to be small with equal variance. Genomic BLUP was defined by the model:

$$y = 1\mu + Z\mathfrak{g} + \varepsilon \tag{4}$$

where y is the data vector; μ is the overall mean; 1 is a vector of n ones; Z is a matrix of incidence, allocating records to the markers' effects; g is a vector of SNP effects assumed to be normally distributed <sup>g</sup> � <sup>N</sup>ð0, Gσ<sup>2</sup> <sup>g</sup>Þ, where <sup>σ</sup><sup>2</sup> <sup>g</sup> is the additive genetic variance and G is the genomic relationship matrix; <sup>e</sup> is the vector of normal error, <sup>e</sup> � <sup>N</sup>ð0, <sup>σ</sup><sup>2</sup> <sup>e</sup> <sup>Þ</sup> where <sup>σ</sup><sup>2</sup> <sup>e</sup> is the error variance. The genomic relationship matrix was defined as <sup>G</sup> <sup>¼</sup> <sup>X</sup><sup>0</sup> X <sup>X</sup> <sup>m</sup> i¼1 pi ð1 � pi Þ , where X is matrix for

specified SNP genotype coefficient at each locus, pi is the rare allele frequency for SNPi.

#### 4.2. Bayesian approaches

In Bayesian estimation, the information from the data is combined with the information from the prior distribution of the variances of the markers. Several Bayesian statistical analyses have been used in genomic evaluation, which differ in the hypotheses of distributions of marker effects. At the level of the modeling of the variances of the effects of the markers, Meuwissen et al. [1] proposed different distributions a priori between the Bayes A and Bayes B methods.

#### 4.2.1. Bayes A

Bayes A method assumes that variance of marker effects differ among loci (e.g., σ<sup>2</sup> gj is different across the j) [16]. The variances are modeled according to the scaled inverted chi-square distribution: The a priori distribution of the variances of the SNP effects is written:

<sup>P</sup>ðσ<sup>2</sup> gj Þ � <sup>χ</sup>�<sup>2</sup>ðν, <sup>S</sup>Þ, where <sup>S</sup> is the scale parameter and <sup>ν</sup> is the number of degrees of freedom. This has the advantage, if we consider a normal distribution of the data, to lead to an a posteriori conditional distribution of χ�<sup>2</sup> .

$$P(\sigma\_{\mathfrak{g}\_{\dot{\jmath}}}^2 | \mathbf{g}\_{\dot{\jmath}}) \sim \chi^{-2} (\nu + \eta\_{\dot{\nu}} \mathbf{S} + \mathbf{g}\_{\dot{\jmath}}^{\prime} \mathbf{g}\_{\dot{\jmath}})\_{\prime})$$

where, nj is the number of marker effects at segment j. The posterior distribution combines both the information provided by the data and the a priori distribution.

### 4.2.2. Bayes B

markers. The number of effects per QTL to be estimated is very small. The estimated effects of all markers are summed in order to obtain the genetic value of the individual. Using simulation, Ref. [1] showed in simulation that with a high-density SNP marker, it is possible to predict the breeding value with an accuracy of 0.85 (where accuracy is the correlation between the estimated breeding value and true breeding value). The challenge in genomic evaluation is to find the best prediction method to obtain accurate genetic values of candidates. Many genomic evaluation methods have been proposed [14, 15]. The main objective of this section is to compare Bayesian methods to other methods used in genomic selection based on their predictive abilities. The study reported by Ref. [1] was considered an influential paper on dairy cattle breeding programs. First, the methods suggested correspond well to the data structures where the number of SNPs substantially exceeds the number of observations. Second, the methods of Ref. [1] constitute a logical evolution of the BLUP methodology, which is the reference method in animal genetics by considering specific variances of SNPs in the different loci. Third, the Bayesian approaches used in Ref. [1] that take into account unknown effects (measuring prior uncertainty) in a model, and combined with the ability of the Monte Carlo Markov chain, can be used in the

The GBLUP method assumes that effects of all SNPs are sampled from the same normal distribution; the effects of all markers are assumed to be small with equal variance. Genomic

where y is the data vector; μ is the overall mean; 1 is a vector of n ones; Z is a matrix of incidence, allocating records to the markers' effects; g is a vector of SNP effects assumed to be normally

In Bayesian estimation, the information from the data is combined with the information from the prior distribution of the variances of the markers. Several Bayesian statistical analyses have been used in genomic evaluation, which differ in the hypotheses of distributions of marker effects. At the level of the modeling of the variances of the effects of the markers, Meuwissen et al. [1] proposed different distributions a priori between the Bayes A and Bayes B methods.

across the j) [16]. The variances are modeled according to the scaled inverted chi-square

specified SNP genotype coefficient at each locus, pi is the rare allele frequency for SNPi.

Bayes A method assumes that variance of marker effects differ among loci (e.g., σ<sup>2</sup>

distribution: The a priori distribution of the variances of the SNP effects is written:

y ¼ 1μ þ Zg þ e (4)

<sup>g</sup> is the additive genetic variance and G is the genomic

X <sup>X</sup> <sup>m</sup> i¼1 pi ð1 � pi Þ

<sup>e</sup> <sup>Þ</sup> where <sup>σ</sup><sup>2</sup>

<sup>e</sup> is the error variance.

, where X is matrix for

gj

is different

majority of parametric statistical models.

4.1. Genomic BLUP (GBLUP)

212 Bayesian Inference

BLUP was defined by the model:

<sup>g</sup>Þ, where <sup>σ</sup><sup>2</sup>

relationship matrix; <sup>e</sup> is the vector of normal error, <sup>e</sup> � <sup>N</sup>ð0, <sup>σ</sup><sup>2</sup>

The genomic relationship matrix was defined as <sup>G</sup> <sup>¼</sup> <sup>X</sup><sup>0</sup>

distributed <sup>g</sup> � <sup>N</sup>ð0, Gσ<sup>2</sup>

4.2. Bayesian approaches

4.2.1. Bayes A

In a genomic evaluation context, Bayes B method [1, 17] assumes different variances of SNP effects, with many SNP contribute per zero effects, and a few contribute per a large effects on the trait. Meuwissen et al. [1] propose a model in which a proportion π (arbitrarily fixed at 0.95) of the markers having zero effect. The a priori distribution of the variances of the effects to the markers is then written:

σ2 <sup>g</sup> <sup>¼</sup> 0 with a probability <sup>π</sup>, <sup>P</sup>ðσ<sup>2</sup> gj Þ � <sup>χ</sup>�<sup>2</sup>ðν, <sup>S</sup><sup>Þ</sup> with a probability (1 � <sup>π</sup>), Gibbs sampling cannot be used to estimate the effects and variances of the Bayes B model because of the high probability on some markers of being of zero variance. We therefore use a Metropolis-Hastings algorithm which allows the simultaneous estimation of σ<sup>2</sup> gj and gj. On the basis of the results of Ref. [1] and many subsequent works, the Bayes B method is often considered the "benchmark" in terms of genomic prediction efficiency, but it is extremely costly in computational time. However, Meuwissen [18] propose an alternative to the Bayes B method which relies on a fast algorithm.

### 4.2.3. Bayesian lasso

Legarra et al. [19] proposed a model of Bayesian lasso (BL) with different variances for residual and SNP effects which they termed BL2Var. It is therefore assumed that a large number of SNPs have an effect practically zero and that very few have large effects. Tibshirani [20] showed that the distribution of the lasso estimators can be written:

$$\mathbb{P}(\sigma\_{\mathcal{S}\_{\beta}}^2 | \sigma^2, \lambda) \sim \frac{\lambda}{2} \exp\left(-\lambda |\mathbf{g}\_{\dot{\beta}}|\right)$$

He suggests that the lasso estimators can be interpreted as an a posteriori mode of a model in which the regression parameters would be independent and identically distributed according to a prior double exponential distribution. Park and Casella [21] propose to use a complete Bayesian approach by assuming an a priori distribution of regression coefficients such as:

$$\mathcal{P}(\sigma\_{\mathcal{S}\_j}^2 | \sigma^2, \lambda) \sim \frac{\lambda}{2\sqrt{\sigma^2}} \exp\left(-\frac{\lambda}{\sqrt{\sigma^2}} | \mathcal{g}\_j |\right)$$

where σ<sup>2</sup> represents the variance of residual effects of the model and the variance of the SNP effects. Applications of the Bayesian lasso to the genomic selection proposed by Refs. [22, 23] use the same variance σ<sup>2</sup> to model both the distribution of effects of SNPs and residuals. De los Campos et al. [22] showed that the Bayesian lasso is close in terms of precision of prediction to the Bayes B method but with a significant reduction in the complexity of the calculations. In addition, these authors suggested using Bayesian lasso against the large number of markers included in regression models, which is typically larger than the number of records.

### 4.2.4. The Bayes C method

Bayesian methods such as Bayes A and Bayes B [1] have been widely used for genomic evaluation. Similar methods exist, with similar performances, developed in order to reduce computation times and to simplify statistical modeling. The Bayes C method [24] differs from Bayes B by assuming the variance associated with SNPs common to all markers. In Bayes C, as in Bayes B, the probability π that an SNP has a nonzero effect is assumed to be known. The model is similar to the Bayes B model but for a homogeneous variance of effects on all loci: σ<sup>2</sup> <sup>g</sup> ¼ 0 with a probability 1 � <sup>π</sup>;ðσ<sup>2</sup> <sup>g</sup>Þ � <sup>χ</sup>�<sup>2</sup>ðν, <sup>S</sup>Þ. The main problem with the Bayes C method is that SNPs with a nonzero effect is assumed to be known. With the Bayes A method, the parameter π is equal to 1, which implies that all the markers have an effect. For the Bayes B method, π is strictly less than 1 in order to take into account the hypothesis that some SNPs may have a zero effect but is fixed arbitrarily while the intensity of the selection of variables is controlled by this parameter. Habier et al. [25] propose to modify the Bayes C method by estimating the parameter π: the parameter π is assumed to be unknown. Thus, the a priori distribution of π becomes uniform over [0, 1]. SNP modeling is the same as with Bayes C. Pðgj <sup>j</sup>π, <sup>σ</sup><sup>2</sup> <sup>g</sup>Þ ¼ 0 with a probability 1 � π; Pðgj <sup>j</sup>π, <sup>σ</sup><sup>2</sup> <sup>g</sup>Þ � <sup>N</sup>ð0, <sup>σ</sup><sup>2</sup> <sup>g</sup>Þ where Pðσ 2 <sup>g</sup>Þ � <sup>χ</sup>�<sup>2</sup>ðν, <sup>S</sup><sup>Þ</sup> with a probability <sup>π</sup>. The various parameters of this model are estimated by MCMC methods, Markov Chain Monte Carlo [6, 26] as proposed by Ref. [25]. It is written as a function of the additive genetic variance σ<sup>2</sup> a . σ2 <sup>g</sup> <sup>¼</sup> <sup>σ</sup><sup>2</sup> a ð1�πÞ X<sup>p</sup> j¼1 2pj ð1 � pj Þ , where pj is the allelic frequency of SNP j.

#### 4.3. A critique

The extreme speed with which events are running handicaps the process of linking new development to extant theory, and the understanding of statistical models suggested up until now [27]. The latter authors criticize the theoretical and statistical concepts followed by Ref. [1] in three levels. The first is the connection between parameters (additive genetic variances with Bayesian view) from infinitesimal models with those from marker-based models. The second is the relationship between molecular marker genotypes and similarity between relatives. The third is the connection between infinitesimal genetic models and marker-based regression models. Gianola et al. [27] argued that the methods Bayes A and Bayes B proposed by Ref. [18] require specifying parameters. The latter used formulas for obtaining the variance of SNP effects, based on some knowledge of the additive genetic variance in the population. Their development begins on the assumption that the effects of the markers are fixed and in other development, they consider them as random without a clear demonstration. Meuwissen et al. [1] explained that affecting a priori a value σ<sup>2</sup> <sup>g</sup> ¼ 0 with a probability π means that the specific SNP does not have an effect on the trait. By contrast, Ref. [27] illustrated that a parameter having zero variance does not obligatory imply that the parameter takes zero value. The parameter could have any value, but with certainty. Gianola et al. [27] suggested the use of a nonparametric method as developed by Refs. [22, 28] because these methods do not impose hypotheses about mode of inheritance as Bayesian A and Bayesian B methods.

### 5. Applications in genomics

Campos et al. [22] showed that the Bayesian lasso is close in terms of precision of prediction to the Bayes B method but with a significant reduction in the complexity of the calculations. In addition, these authors suggested using Bayesian lasso against the large number of markers

Bayesian methods such as Bayes A and Bayes B [1] have been widely used for genomic evaluation. Similar methods exist, with similar performances, developed in order to reduce computation times and to simplify statistical modeling. The Bayes C method [24] differs from Bayes B by assuming the variance associated with SNPs common to all markers. In Bayes C, as in Bayes B, the probability π that an SNP has a nonzero effect is assumed to be known. The model is similar

with a nonzero effect is assumed to be known. With the Bayes A method, the parameter π is equal to 1, which implies that all the markers have an effect. For the Bayes B method, π is strictly less than 1 in order to take into account the hypothesis that some SNPs may have a zero effect but is fixed arbitrarily while the intensity of the selection of variables is controlled by this parameter. Habier et al. [25] propose to modify the Bayes C method by estimating the parameter π: the parameter π is assumed to be unknown. Thus, the a priori distribution of π becomes

eters of this model are estimated by MCMC methods, Markov Chain Monte Carlo [6, 26] as proposed by Ref. [25]. It is written as a function of the additive genetic variance σ<sup>2</sup>

, where pj is the allelic frequency of SNP j.

The extreme speed with which events are running handicaps the process of linking new development to extant theory, and the understanding of statistical models suggested up until now [27]. The latter authors criticize the theoretical and statistical concepts followed by Ref. [1] in three levels. The first is the connection between parameters (additive genetic variances with Bayesian view) from infinitesimal models with those from marker-based models. The second is the relationship between molecular marker genotypes and similarity between relatives. The third is the connection between infinitesimal genetic models and marker-based regression models. Gianola et al. [27] argued that the methods Bayes A and Bayes B proposed by Ref. [18] require specifying parameters. The latter used formulas for obtaining the variance of SNP effects, based on some knowledge of the additive genetic variance in the population. Their development begins on the assumption that the effects of the markers are fixed and in other development, they consider them as random without a clear demonstration. Meuwissen et al.

SNP does not have an effect on the trait. By contrast, Ref. [27] illustrated that a parameter

<sup>g</sup>Þ � <sup>χ</sup>�<sup>2</sup>ðν, <sup>S</sup>Þ. The main problem with the Bayes C method is that SNPs

<sup>j</sup>π, <sup>σ</sup><sup>2</sup>

<sup>g</sup> ¼ 0 with a probability π means that the specific

<sup>g</sup>Þ � <sup>χ</sup>�<sup>2</sup>ðν, <sup>S</sup><sup>Þ</sup> with a probability <sup>π</sup>. The various param-

<sup>g</sup> ¼ 0 with a

a .

<sup>g</sup>Þ ¼ 0 with a probability

included in regression models, which is typically larger than the number of records.

to the Bayes B model but for a homogeneous variance of effects on all loci: σ<sup>2</sup>

uniform over [0, 1]. SNP modeling is the same as with Bayes C. Pðgj

<sup>g</sup>Þ where Pðσ

2

4.2.4. The Bayes C method

214 Bayesian Inference

probability 1 � <sup>π</sup>;ðσ<sup>2</sup>

1 � π; Pðgj

<sup>g</sup> <sup>¼</sup> <sup>σ</sup><sup>2</sup>

ð1�πÞ X<sup>p</sup> j¼1 2pj ð1 � pj Þ

4.3. A critique

σ2

<sup>j</sup>π, <sup>σ</sup><sup>2</sup>

a

<sup>g</sup>Þ � <sup>N</sup>ð0, <sup>σ</sup><sup>2</sup>

[1] explained that affecting a priori a value σ<sup>2</sup>

Major dairy breeding countries are now using genomic evaluation [27]. Several results have been reported around the world. Several authors reported that the reliabilities of genomic estimated breeding values (GEBV) were substantially greater than breeding values from estimated breeding values (EBV) based on pedigree information [29]. The accuracy of selection was different between countries [12]. The accuracy was dependent on the size of reference population, the heritability of the trait studied, the statistical models and approaches used for prediction of genetic values for quantitative traits, and the method achieved to estimate the accuracy [12, 27, 29]. Ref. [14] found the reliability of GEBV bulls of the Canadian and American Holstein population. A genotyping of 39,416 molecular markers of 3576 Holstein bulls was used to establish the prediction equations.

The prediction methods contained a linear model, in which marker effects are assumed to be normal, and a nonlinear model with a heavier tailed prior distribution to account for major genes as described by [1]. VanRaden et al. [14] reported that the combination of the polygenic effects based on pedigree information with the genomic predictions can improve the reliability to 23% greater than the reliability of polygenic effects only. The same study showed that the nonlinear model had a little advantage in reliability over the linear model for all traits except for fat and protein percentages. Genomic breeding values of 25 traits in New Zealand dairy cattle were estimated by Ref. [30]. The reference population consisted of 4500 bulls genotyped using the BovineSNP50Beadchip, containing 44,146 SNPs. Harris and Johnson [31] reported an increase in accuracy was found by using Bayesian approaches compared to BLUP methods. In Ref. [31], genomic breeding values (GBVs) for young bulls with no daughter information had accuracies ranging from 50 to 67% for milk traits, live weight, fertility, somatic cell, and longevity, versus an average 34% for progeny test. Meuwissen et al. [1] compared least squares method with BLUP and two Bayesian methods (Bayesian A and Bayesian B). The latter authors estimated the effects of 50,000 marker haplotypes from a limited number of observations (2200). Using least squares method, it is not possible to estimate all effects simultaneously. For this reason, different steps have been adopted to incorporate the effects of markers. First, they performed regression on markers for every segment of 1 cm each. Second, they calculated a Log-likelihood, which assumed to be normal at every segment of chromosome. Third, they summed all segments corresponding to a likelihood peak into multiple regression models. Using BLUP analyses, Ref. [1] considered that all SNP effects were independent and identically distributed with a known variance. Bayes A method was as BLUP at the level of the data, but differs in the variance of the chromosome segments, which assumed to have an inverted chi-square distribution. A mixture prior distribution of genetic variances was used in Bayes B method. Table 1 shows the accuracy of selection obtained by Ref. [1] from the GBLUP methods, the least squares regression and the

#### 216 Bayesian Inference


Table 1. Comparing estimated versus the breeding value [1].

Bayes A and Bayes B approaches. The predictive abilities of the different methods are estimated by calculating the correlation (ρ) between true and estimated breeding values and the regression (b) of true on estimated breeding value.

The least squares method is the least efficient because it overestimates effects on QTL [32]. The Bayes B approach is the most accurate both in terms of correlation and regression. However, the regression coefficient obtained by the Bayesian methods was still less than 1, and probably due to the hypothesis of a priori distribution χ<sup>2</sup> for Bayes A and Bayes B being different from the simulated distribution of the variances. Goddard and Hayes [11] compared the correlation of 0.85 as reported by Ref. [1] to results obtained on real data by Refs. [14, 33, 34]. VanRaden et al. [14] produced a mean correlation over several characters of 0.71 from a reference population of more than 3500 bulls. Studies have shown the superiority of genomic evaluation [35] or marker-assisted selection in France [36] on classical infinitesimal model of quantitative genetics. Several authors have applied the first genomic evaluation methods described by Ref. [1] or their derived methods on real data. The Bayes A and Bayes B approaches have found results that are often similar or slightly superior to GBLUP in terms of accuracy of genetic value prediction for the Australian Holstein-Friesian cattle breed (+0.02 to +0.07 of correlation gain between predicted and observed values), for example [12] and New Zealand (+2% correlation gain, [31]). However, the GBLUP method required less computing time than the Bayes A method [32, 37]. Gredler et al. [38] demonstrated the superiority of the Bayes B method, in terms of the accuracy of genomic estimates, on a modified Bayes A method for integrating a polygenic effect [39]. Thus, although the Bayes B method seems slightly more efficient than the Bayes A method, numerous studies showed that the Bayes B method is not so much better in terms of accuracy of the genomic estimates than a GBLUP model [40]. Again, all researches indicate that the Bayesian approaches, which assume an a priori distribution of SNPs, increase the reliability of breeding values over traditional BLUP methods [1, 12, 14]. A common conclusion is that for most quantitative traits, the hypothesis of the traditional BLUP method, that all markers are associated with equal variances, is far from reality. By comparing the results obtained in the various populations around the world, clearly, the accuracies of GEBVs were greater than breeding values estimated from progeny test based on pedigree information. Several researches suggested combining the progeny test based on pedigree information with the breeding value from genomic to calculate the final GEBV [5, 25]. Accuracy based on modeling molecular marker and pedigree information was generally superior to that of the model including only genomic or pedigree information. Hayes et al. [12] reported that a main

advantage of using the both sources of information coming from polygenic breeding values and genomic information is that any QTL not detected by the marker effects may be detected by the progeny test based on pedigree information. A significant reduction in posterior mean of residual variance component was reported by Ref. [22] when pedigree and markers were considered jointly compared to pedigree-based model. In the same study, Spearman's rank correlation of estimated breeding value between model including marker information and pedigree-based model was close to 1.

### 6. Conclusion

Bayes A and Bayes B approaches. The predictive abilities of the different methods are estimated by calculating the correlation (ρ) between true and estimated breeding values and the regression

Methods ρ b Least squares 0.318 0.285 GBLUP 0.732 0.896 Bayes A 0.798 0.827 Bayes B 0.848 0.946

The least squares method is the least efficient because it overestimates effects on QTL [32]. The Bayes B approach is the most accurate both in terms of correlation and regression. However, the regression coefficient obtained by the Bayesian methods was still less than 1, and probably due to the hypothesis of a priori distribution χ<sup>2</sup> for Bayes A and Bayes B being different from the simulated distribution of the variances. Goddard and Hayes [11] compared the correlation of 0.85 as reported by Ref. [1] to results obtained on real data by Refs. [14, 33, 34]. VanRaden et al. [14] produced a mean correlation over several characters of 0.71 from a reference population of more than 3500 bulls. Studies have shown the superiority of genomic evaluation [35] or marker-assisted selection in France [36] on classical infinitesimal model of quantitative genetics. Several authors have applied the first genomic evaluation methods described by Ref. [1] or their derived methods on real data. The Bayes A and Bayes B approaches have found results that are often similar or slightly superior to GBLUP in terms of accuracy of genetic value prediction for the Australian Holstein-Friesian cattle breed (+0.02 to +0.07 of correlation gain between predicted and observed values), for example [12] and New Zealand (+2% correlation gain, [31]). However, the GBLUP method required less computing time than the Bayes A method [32, 37]. Gredler et al. [38] demonstrated the superiority of the Bayes B method, in terms of the accuracy of genomic estimates, on a modified Bayes A method for integrating a polygenic effect [39]. Thus, although the Bayes B method seems slightly more efficient than the Bayes A method, numerous studies showed that the Bayes B method is not so much better in terms of accuracy of the genomic estimates than a GBLUP model [40]. Again, all researches indicate that the Bayesian approaches, which assume an a priori distribution of SNPs, increase the reliability of breeding values over traditional BLUP methods [1, 12, 14]. A common conclusion is that for most quantitative traits, the hypothesis of the traditional BLUP method, that all markers are associated with equal variances, is far from reality. By comparing the results obtained in the various populations around the world, clearly, the accuracies of GEBVs were greater than breeding values estimated from progeny test based on pedigree information. Several researches suggested combining the progeny test based on pedigree information with the breeding value from genomic to calculate the final GEBV [5, 25]. Accuracy based on modeling molecular marker and pedigree information was generally superior to that of the model including only genomic or pedigree information. Hayes et al. [12] reported that a main

(b) of true on estimated breeding value.

216 Bayesian Inference

Table 1. Comparing estimated versus the breeding value [1].

Standard quantitative genetic model based on phenotypic and pedigree information has been very successful in term of genetic value prediction. Also, the availability of genome-wide dense markers leads researchers to be able to perform advanced genetic evaluation of quantitative traits with a high accuracy of prediction of genetic value. However, a main problem is how this information should be included into statistical genetic models. Bayesian MCMC methods appear to be convenient for genetic value prediction with a focus on the precision of the choice of prior distribution for the different parameters.

### Author details

Hafedh Ben Zaabza<sup>1</sup> \*, Abderrahmen Ben Gara<sup>2</sup> and Boulbaba Rekik<sup>2</sup>

\*Address all correspondence to: hafedhbenzaabza@gmail.com

1 Institut National Agronomique, Tunis-Mahrajène, Tunisie

2 Département des productions animales, Ecole supérieure d'Agriculture de Mateur, Mateur, Tunisie

### References


[18] Meuwissen THE. Accuracy of breeding values of "unrelated" individuals predicted by dense SNP genotyping. Genetics Selection Evolution. 2009;41:35. DOI: 10.1186/1297-9686-41-35

[5] Hallander J, Waldmann P, Chunkao W, Sillanpaa MJ. Bayesian inference of genetic parameters based on conditional decompositions of multivariate normal distributions.

[6] Robert CP. Le choix bayésien Principes et pratique. 1st ed. Paris: Springer-Verlag France;

[7] Ben Zaabza H, Ben Gara A, Hammami H, Ferchichi MA, Rekik B. Estimation of variance components of milk, fat, and protein yields of Tunisian Holstein dairy cattle using Bayesian and REML methods. Archives Animal Breeding. 2016;59:243-248. DOI: 10.5194/aab-

[8] Ben Gara A, Rekik B, Bouallègue M. Genetic parameters and evaluation of the Tunisian dairy cattle population for milk yield by Bayesian and BLUP analyses. Livestock Science.

[9] Schenkel FS, Schaeffer LR, Boettcher PJ. Comparison between estimation of breeding values and fixed effects using Bayesian and empirical BLUP estimation under selection on parents and missing pedigree information. Genetic Selection Evolution. 2002;34:41-59.

[10] Gianola D, Fernando RL, Stella A. Genomic-assisted prediction of genetic value with semiparametric procedures. Genetics. 2006;173(3):1761-1776. DOI: 10.1534/genetics.105.049510

[11] Goddard ME, Hayes BJ. Genomic selection. Journal of Animal Breeding and Genetics.

[12] Hayes BJ, Bowman PJ, Chamberlain AJ, Goddard ME. Genomic selection in dairy cattle: Progress and challenges. Journal of Dairy Science. 2009;92:433-443. DOI: 10.3168/jds.

[13] Wittenburg D, Melzer N, Reinsch N. Including non-additive genetic effects in Bayesian methods for the prediction of genetic values based on genome-wide markers. BMC

[14] VanRaden PM, Van Tassell CP, Wiggans GR, Sonstegard TS, Schnabel RD, Taylor JF, et al. Reliability of genomic predictions for North American Holstein bulls. Journal of Dairy

[15] Colombani C, Croiseau P, Fritz S, Guillaume F, Legarra A, Ducrocq V, Robert-Granié C. A comparison of partial least squares (PLS) and sparse PLS regressions in genomic selection in French dairy cattle. Journal of Dairy Science. 2012;95:2120-2131. DOI: 10.3168/jds.2011-4647

[16] Su G, Guldbrandtsen B, Gregersen VR, Lund MS. Preliminary investigation on reliability of genomic estimated breeding values in the Danish Holstein population. Journal of

[17] Villumsen TM, Janss L, Lund MS. The importance of haplotype length and heritability using genomic selection in dairy cattle. Journal of Animal Breeding and Genetics.

Genetics. 2010;185:645-654. DOI: 10.1534/genetics.110.114249

2006;100:142-149. DOI: 10.1016/j.livsci.2005.08.012

2007;124:323-330. DOI: 10.1111/j.1439-0388.2007

Science. 2009;92:16-24. DOI: 10.3168/jds.2008-1514

2009;126(1):3-13. DOI: 10.1111/j.1439-0388.2008

Dairy Science. 2010;93(3):1175-1183. DOI: 10.3168/jds.2009-2192

2006. p. 638

218 Bayesian Inference

59-243-2016

2008-1646

DOI: 10.1051/gse:2001003

Genetics. 2011;12(74):14


Provisional chapter

### **Bayesian Two-Stage Robust Causal Modeling with Instrumental Variables using Student's t Distributions** Bayesian Two-Stage Robust Causal Modeling with

DOI: 10.5772/intechopen.70393

Instrumental Variables using Student's t Distributions

Dingjing Shi and Xin Tong

[32] Moser G, Tier B, Crump RE, Khatkar MS, Raadsma HM. A comparison of five methods to predict genomic breeding values of dairy bulls from genome-wide SNP markers. Genetics

[33] Legarra A, Misztal I. Technical note: Computing strategies in genome-wide selection.

[34] González-Recio O, Gianola G, Rosa GJM, Weigel KA, Kranis A. Genome-assisted prediction of a quantitative trait measured in parents and progeny: Application to food conversion rate in chickens. Genetics Selection Evolution. 2009;41(3):10. DOI: 10.1186/1297-9686-41-3

[35] VanRaden P. Efficient methods to compute genomic predictions. Journal of Dairy Science.

[36] Boichard D, Fritz S, Rossignol MN, Bosher MY, Malafosse A, Colleau JJ. Implementation of marker-assisted selection in French dairy cattle. In: 7th World Congress on Genetics Applied to Livestock Production; 19-23 August 2002; Montpellier, France. 2002. Session 22. Exploitation of molecular information in animal breeding. Electronic communication

[37] Solberg TR, Sonesson AK, Woolliams JA, Meuwissen THE. Reducing dimensionality for prediction of genome-wide breeding values. Genetics Selection Evolution. 2009;41(29):8.

[38] Gredler B, Nirea KG, Solberg TR, Egger-Danner C, Meuwissen THE, Solkner J. Genomic selection in Fleckvieh/Simmental—First results. In: Proceedings of the Interbull Meeting;

[39] Hayes BJ. Genomic selection in the era of the \$1000 genome sequence. In: Symposium Statistical Genetics of Livestock for the Post-Genomic Era; USA: Wisconsin-Madison,

[40] Habier DJ, Tetens J, Seefried FR, Lichtner P, Thaller G. The impact of genetic relationship information on genomic breeding values in German Holstein cattle. Genetics Selection

21-24 August 2009; Interbull Bulletin, Barcelone, Espagne; 2009;40:209-213

Journal of Dairy Science. 2008;91(1):360-366. DOI: 10.3168/jds.2007-0403

Selection Evolution. 2009;41(56). DOI: 10.1186/1297-9686-41-56

2008;91(11):4414-4423. DOI: 10.3168/jds.2007-0980

Evolution. 2010;42(5). DOI: 10.1186/1297-9686-42-5

22-03. p. 4

220 Bayesian Inference

USA; 2009

DOI: 10.1186/1297-9686-41-29

Additional information is available at the end of the chapter Dingjing Shi and Xin Tong

http://dx.doi.org/10.5772/intechopen.70393 Additional information is available at the end of the chapter

### Abstract

In causal inference research, the issue of the treatment endogeneity is commonly addressed using the two-stage least squares (2SLS) modeling with instrumental variables (IVs), where the local average treatment effect (LATE) is the causal effect of interest. Because practical data are usually heavy tailed or contain outliers, using traditional 2SLS modeling based on normality assumptions may result in inefficient or even biased LATE estimate. This study proposes four types of Bayesian two-stage robust causal models with IVs to model normal and nonnormal data, and evaluates the performance of the four types of models with IVs. The Monte Carlo simulation results show that the Bayesian two-stage robust causal modeling produces reliable parameter estimates and model fits. Particularly, in different types of the two-stage robust models with IVs, the models that take outliers into consideration and use Student's t distributions in the second stage to model heavy-tailed data or data containing outliers provide more accurate and efficient LATE estimates and better model fits than other distribution models when data are contaminated. The preferred models are recommended to be adopted in general in the two-stage causal modeling with IVs.

Keywords: Bayesian methods, two-stage causal modeling with instrumental variables, nonnormal data, robust method using Student's t distributions

### 1. Introduction

Causal inference and experimental researchers are often interested in the average treatment effect (ATE), measured by the outcome difference between participants who are assigned to the treatment and those being assigned to the control. The estimation of ATE for the whole population is neither reliable nor feasible when certain conditions are not achieved or assumptions are violated [6, 9]. The treatment effects for only a subset of participants is instead estimated, which is called the local average treatment effect (LATE) [2, 13]. Different studies may have different

© 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and eproduction in any medium, provided the original work is properly cited.

LATEs, depending on the subgroup of interest. Often the subgroup of interest is those who have been assigned to the treatment and have actually received the treatment [3]. One way to estimate the LATE is to incorporate instrumental variables (IVs), which are correlated with both the endogenous regressors and error terms when the linearity assumption of the traditional linear models is violated and the endogenous regressors are correlated with the errors. Instrumental variables are incorporated in the analysis to estimate the LATE, or a part of the treatment effect whose estimation is not contaminated by the violation of the linearity assumption.

Two-stage least squares (2SLS) modeling [1] is widely used to estimate the LATE with IVs. In the first stage, IVs are used to predict the partial treatment effect that can be explained by the variations of IVs, and in the second stage, the fitted treatment values are used to predict the experimental outcome, and to estimate the LATE. In estimating the LATE in traditional 2SLS modeling with IVs, it is typically assumed that the measurement errors at both stages are normally distributed. However, practical data in social and behavioral research usually violate the normality assumption and often have heavy tails or contain outliers [25]. Failure to take the nonnormal data into consideration but instead treating the heavy-tailed data or data containing outliers as if they were normally distributed may result in unreliable parameter estimates and inflated type I error rates [35, 38–40], which will eventually lead to misleading statistical inference.

Routine methods to accommodate heavy-tailed data or data with outliers include data transformation and data truncation. However, transformed data are often difficult to interpret especially when the raw scores have meaningful scales [17], and the exclusion of outliers may lead to underestimated standard errors and reduced efficiency [14, 32]. Alternatively, different robust procedures have been developed to provide reliable parameter estimates, the associated standard errors, and statistical tests. The rationale of most robust procedures is to weigh each observation according to its distance from the center of the majority of the data, so that outliers that are far from the center of the data are downweighted [10, 11, 37]. In recent research, more and more robust methods have been used to estimate complex models, such as linear and generalized linear mixed-effects models [19, 26], structural equation models [15, 31], and hierarchical linear and nonlinear models [20, 29].

Over the past decades, robust procedures based on Student's t distributions have been developed and advanced to model heavy-tailed data or data containing outliers [14, 33]. For example, Student's t distributions have been applied under the structural equation modeling framework and were found to produce reliable parameter estimates and inferences [15, 16]; in robust mixture models, Wang et al. [30] used the multivariate t distribution to fit heavy-tailed data and data with missing information, Shoham [24] implemented a robust clustering algorithm in mixture models by modeling data that are contaminated by outliers using multivariate t distributions, Seltzer et al. [21] and Seltzer and Choi [22] conducted sensitivity analysis employing Student's t distributions in robust multilevel models and downweighted outliers in level two (the between-subject level), and Tong and Zhang [28] and Zhang et al. [36] advanced the Student's t distributions to robust growth curve models and provided online software to carry out the analysis. Although robust methods based on Student's t distributions have been used in different modeling frameworks, few have been adopted in the causal modeling, where heavy-tailed data or data containing outliers are not uncommon [18].

Recently, Shi and Tong [23] implemented a robust Bayesian estimation method using Student's t distributions to the two-stage causal modeling with IVs to fit data that contain outliers or are normally distributed concurrently at both stages. However, in the two-stage causal models with IVs, data at either stage are equally likely having outliers or are nonnormally distributed. Previous studies have noticed such a situation. For example, Pinheiro et al. [19] used a robust estimation to the linear mixed-effects model and applied the multivariate t distribution to both the random effects and intraindividual errors simultaneously. Tong and Zhang [28] conducted a robust estimation to growth curve modeling and modeled the measurement errors and random effects separately with t distributions or normal distributions rather than the same distribution for the two effects. Therefore, this article extends the study of Shi and Tong [23] and proposes four possible types of two-stage causal models with IVs to the data. The study evaluates the performance of the robust method in four types of models. In the following section, the robust method based on Student's t distributions is reviewed. Then, the two-stage causal models with IVs, the associated LATE, and the corresponding four types of models are introduced. Next, a Monte Carlo simulation study is conducted to evaluate the performance of the robust method in four possible types of two-stage causal models with IVs. In the end, conclusions are summarized and discussions are provided.

### 2. Robust methods based on Student's t distributions

LATEs, depending on the subgroup of interest. Often the subgroup of interest is those who have been assigned to the treatment and have actually received the treatment [3]. One way to estimate the LATE is to incorporate instrumental variables (IVs), which are correlated with both the endogenous regressors and error terms when the linearity assumption of the traditional linear models is violated and the endogenous regressors are correlated with the errors. Instrumental variables are incorporated in the analysis to estimate the LATE, or a part of the treatment effect

Two-stage least squares (2SLS) modeling [1] is widely used to estimate the LATE with IVs. In the first stage, IVs are used to predict the partial treatment effect that can be explained by the variations of IVs, and in the second stage, the fitted treatment values are used to predict the experimental outcome, and to estimate the LATE. In estimating the LATE in traditional 2SLS modeling with IVs, it is typically assumed that the measurement errors at both stages are normally distributed. However, practical data in social and behavioral research usually violate the normality assumption and often have heavy tails or contain outliers [25]. Failure to take the nonnormal data into consideration but instead treating the heavy-tailed data or data containing outliers as if they were normally distributed may result in unreliable parameter estimates and inflated type I error rates [35, 38–40], which will eventually lead to misleading

Routine methods to accommodate heavy-tailed data or data with outliers include data transformation and data truncation. However, transformed data are often difficult to interpret especially when the raw scores have meaningful scales [17], and the exclusion of outliers may lead to underestimated standard errors and reduced efficiency [14, 32]. Alternatively, different robust procedures have been developed to provide reliable parameter estimates, the associated standard errors, and statistical tests. The rationale of most robust procedures is to weigh each observation according to its distance from the center of the majority of the data, so that outliers that are far from the center of the data are downweighted [10, 11, 37]. In recent research, more and more robust methods have been used to estimate complex models, such as linear and generalized linear mixed-effects models [19, 26], structural equation models [15, 31], and

Over the past decades, robust procedures based on Student's t distributions have been developed and advanced to model heavy-tailed data or data containing outliers [14, 33]. For example, Student's t distributions have been applied under the structural equation modeling framework and were found to produce reliable parameter estimates and inferences [15, 16]; in robust mixture models, Wang et al. [30] used the multivariate t distribution to fit heavy-tailed data and data with missing information, Shoham [24] implemented a robust clustering algorithm in mixture models by modeling data that are contaminated by outliers using multivariate t distributions, Seltzer et al. [21] and Seltzer and Choi [22] conducted sensitivity analysis employing Student's t distributions in robust multilevel models and downweighted outliers in level two (the between-subject level), and Tong and Zhang [28] and Zhang et al. [36] advanced the Student's t distributions to robust growth curve models and provided online software to

whose estimation is not contaminated by the violation of the linearity assumption.

statistical inference.

222 Bayesian Inference

hierarchical linear and nonlinear models [20, 29].

As a robust procedure, the fundamental idea of using Student's t distributions to model heavytailed data or data containing outliers is to assign a weight to each case and properly downweight cases that are far from the center of the majority of the data [10, 11, 37]. Suppose a population of k random variables, y, follow a multivariate t distribution, with mean vector μ, scale matrix Ψ, and degrees of freedom ν, denoted by t (μ, Ψ, ν). The probability density function of y can be expressed as:

$$p(\boldsymbol{y}|\boldsymbol{\mu},\,\boldsymbol{\Psi},\,\boldsymbol{\nu}) = \frac{|\boldsymbol{\Psi}|^{-\frac{1}{2}}\Gamma\left(\frac{\boldsymbol{\nu}+\boldsymbol{k}}{2}\right)}{\left(\Gamma\left(\frac{1}{2}\right)^{k}\Gamma\left(\frac{\boldsymbol{\nu}}{2}\right)\boldsymbol{\nu}^{\frac{1}{2}}\right)} \times \left(1 + \frac{(\boldsymbol{y}-\boldsymbol{\mu})^{T}\boldsymbol{\Psi}^{-1}(\boldsymbol{y}-\boldsymbol{\mu})}{\boldsymbol{\nu}}\right)^{-\frac{(\boldsymbol{\nu}+\boldsymbol{k})}{2}}.\tag{1}$$

The maximum likelihood estimates of model parameters under the model with t distribution assumptions satisfy

$$
\Sigma\_{i=1}^{\mu} w\_i \mathbf{A}\_i \Psi\_i^{-1} \left( y\_i - \mu \right) = 0,\tag{2}
$$

where n is the total sample size, yi is a sample from y, Ai is the partial derivatives of μ, and

$$w\_i = \frac{\nu + \tau\_i}{\nu + \sigma\_i^2} \tag{3}$$

is the weight assigned to case i. In the equation for wi, τ<sup>i</sup> is the dimension of the parameter for each i and σ<sup>2</sup> <sup>i</sup> is the squared Mahalanobis distances σ<sup>2</sup> <sup>i</sup> <sup>¼</sup> <sup>y</sup><sup>i</sup> � <sup>μ</sup> <sup>T</sup> <sup>Ψ</sup>�<sup>1</sup> <sup>y</sup><sup>i</sup> � <sup>μ</sup> . Note that (y<sup>i</sup> � μ) is the distance between each observation and the population mean, and a large (y<sup>i</sup> � μ) indicates a potential outlier as well as a large squared Mahalanobis distance σ<sup>2</sup> <sup>i</sup> . The outliers are downweighted in the analysis because the weight wi decreases with increasing squared Mahalanobis distances σ<sup>2</sup> <sup>i</sup> ; given fixed degrees of freedom ν, and dimensions τ<sup>i</sup> [14].

The shape of a t distribution is controlled by its degrees of freedom ν, and ν can be set a priori or estimated in the analysis. Under certain conditions, the degrees of freedom have been recommended setting a priori. Lange et al. [14] and Zhang et al. [36] suggested fixing the value for the degrees of freedom of Student's t distributions when sample size is small, as small sample sizes could lead to biased degrees of freedom estimate. Moreover, Tong and Zhang [28] argued that by fixing the degrees of freedom, more accurate parameter estimates and credible intervals can be obtained when model specification is built on solid substantive theories. In contrast, estimating the degrees of freedom can make the model more flexible. When the degrees of freedom ν are freely estimated, Student's t distributions have an additional parameter ν, compared with normal distributions. As the degrees of freedom ν increase, the Student's t distribution approaches a normal distribution.

There are several advantages in using Student's t distributions for robust data analysis [28]. First, unlike the nonparametric robust analysis, Student's t distributions have parametric forms, and inferences based on them can be carried out relatively easily through maximum likelihood estimation or Bayesian estimation methods. Second, the degrees of freedom of Student's t distributions control the weight of outliers and can flexibly set a priori or be estimated. Third, when data have heavy tails or contain outliers, considering Student's t distribution as a natural extension of the normal distribution is rather intuitive.

### 3. Bayesian two-stage robust causal modeling with IVs

In causal Ordinary Least Squares (OLS) regression, when the error terms are related to some regressors, the estimated ATE is biased due to the violation of the linearity assumption. Variables that are related to both endogenous regressors and errors are used as instruments to differentiate the correlations between endogenous regressors and errors, leaving only a part of the treatment effects that have not been contaminated by the violation of the linearity assumption to be estimated, and such variables are called instrumental variables (IVs). The ATE of interest becomes the LATE of interest. For example, Currie and Yelowitz [8] studied the effect of public housing voucher program of having a larger housing unit on housing quality and educational attainment. Based on the fact that some families in voucher program tradeoff physical housing amenities and reductions in rental payments that are bad and have negative effects for the housing quality and their children, some regressors are correlated with errors and become endogenous. Previous theory supports that a household having an extra number of kids is entitled to a larger housing unit, whether there are extra kids in the household and the sex decomposition of the extra kids are chosen as the IVs, to study the voucher program effect to participants who have one girl and one boy (i.e., having sex decomposition) in the household. It was found that the voucher program participants who have the sex decomposition in the household are more likely to have better housing quality and educational attainment. The example shows that when IVs are introduced, the external validity is traded for the improvement of the internal validity, and the ATE (i.e., all of the voucher program participants) becomes the LATE (i.e., program participants who have extra kids and who have the sex decomposition).

wi <sup>¼</sup> <sup>ν</sup> <sup>þ</sup> <sup>τ</sup><sup>i</sup> <sup>ν</sup> <sup>þ</sup> <sup>σ</sup><sup>2</sup> i

is the weight assigned to case i. In the equation for wi, τ<sup>i</sup> is the dimension of the parameter for

(y<sup>i</sup> � μ) is the distance between each observation and the population mean, and a large (y<sup>i</sup> � μ)

are downweighted in the analysis because the weight wi decreases with increasing squared

The shape of a t distribution is controlled by its degrees of freedom ν, and ν can be set a priori or estimated in the analysis. Under certain conditions, the degrees of freedom have been recommended setting a priori. Lange et al. [14] and Zhang et al. [36] suggested fixing the value for the degrees of freedom of Student's t distributions when sample size is small, as small sample sizes could lead to biased degrees of freedom estimate. Moreover, Tong and Zhang [28] argued that by fixing the degrees of freedom, more accurate parameter estimates and credible intervals can be obtained when model specification is built on solid substantive theories. In contrast, estimating the degrees of freedom can make the model more flexible. When the degrees of freedom ν are freely estimated, Student's t distributions have an additional parameter ν, compared with normal distributions. As the degrees of freedom ν increase, the Student's

There are several advantages in using Student's t distributions for robust data analysis [28]. First, unlike the nonparametric robust analysis, Student's t distributions have parametric forms, and inferences based on them can be carried out relatively easily through maximum likelihood estimation or Bayesian estimation methods. Second, the degrees of freedom of Student's t distributions control the weight of outliers and can flexibly set a priori or be estimated. Third, when data have heavy tails or contain outliers, considering Student's t distribution as a

In causal Ordinary Least Squares (OLS) regression, when the error terms are related to some regressors, the estimated ATE is biased due to the violation of the linearity assumption. Variables that are related to both endogenous regressors and errors are used as instruments to differentiate the correlations between endogenous regressors and errors, leaving only a part of the treatment effects that have not been contaminated by the violation of the linearity assumption to be estimated, and such variables are called instrumental variables (IVs). The ATE of interest becomes the LATE of interest. For example, Currie and Yelowitz [8] studied the effect of public housing voucher program of having a larger housing unit on housing quality and educational attainment. Based on the fact that some families in voucher program tradeoff physical housing amenities and reductions in rental payments that are bad and have negative effects for the housing quality and their children, some regressors are correlated with errors and become endogenous. Previous theory supports that a household having an extra number of kids is

<sup>i</sup> ; given fixed degrees of freedom ν, and dimensions τ<sup>i</sup> [14].

<sup>i</sup> <sup>¼</sup> <sup>y</sup><sup>i</sup> � <sup>μ</sup> <sup>T</sup>

<sup>i</sup> is the squared Mahalanobis distances σ<sup>2</sup>

indicates a potential outlier as well as a large squared Mahalanobis distance σ<sup>2</sup>

each i and σ<sup>2</sup>

224 Bayesian Inference

Mahalanobis distances σ<sup>2</sup>

t distribution approaches a normal distribution.

natural extension of the normal distribution is rather intuitive.

3. Bayesian two-stage robust causal modeling with IVs

(3)

<sup>Ψ</sup>�<sup>1</sup> <sup>y</sup><sup>i</sup> � <sup>μ</sup> . Note that

<sup>i</sup> . The outliers

One commonly used framework to estimate LATE is the 2SLS modeling with IVs. Let di and yi be the treatment and the outcome for individual i, respectively, and Z<sup>i</sup> = (zi1, …, ziJ) <sup>0</sup> be a vector of instrumental variables for individual i (i = 1, …, N). Here, N is the sample size and J is the total number of instrumental variables. In the first stage of the 2SLS model, the IVs Z are used to predict the treatment d. In other words, the portion of variations in the treatment d is identified and estimated by the IVs Z; and then the second stage relies on the estimated exogenous portion of treatment variations in the form of the predicted treatment values to estimate the treatment effect on the outcome y. A typical form of the 2SLS model with IVs can be expressed as:

$$d\_i = \pi\_{10} + \pi\_{11} \mathbf{Z}\_i + e\_{1i},\tag{4}$$

$$y\_i = \pi\_{20} + \pi\_{21}\dot{d}\_i + e\_{2i},\tag{5}$$

where π<sup>10</sup> and π<sup>11</sup> = (π11, …, π1J) <sup>0</sup> are the intercept and regression coefficients for the linear model where the treatment d is regressed on the IVs Z, respectively; and π<sup>20</sup> and π<sup>21</sup> are the intercept and slope for the linear model where the outcome y is regressed on the predicted treatment values of bd, respectively. The IVs help estimate the treatment effects in which the causal effect of IVs on the treatment is first estimated in Eq. (4), and the causal effect of this estimated partial treatment effect on the outcome is then estimated in Eq. (5). From the model, π<sup>11</sup> is the causal effect of the IVs Z on the treatment Z, and π<sup>21</sup> is the treatment effect on the outcome y for a subset of participants whose treatment effect has been partialled out and explained by the IVs Z. π<sup>21</sup> is the causal effect of interest and is called LATE. There are several advantages in using 2SLS modeling to estimate LATE. First, unlike method of point estimate such as Wald estimator [4], 2SLS modeling also provides standard error estimate and confidence intervals of the LATE, making statistical inferences more efficient. Second, when 2SLS models are used, covariates could be controlled simultaneously at both stages of the 2SLS model when the effect of Z on d and the effect of bd on y are estimated. Mathematically, the estimated LATE <sup>π</sup>b<sup>21</sup> in 2SLS can be derived as:

$$
\widehat{\pi}\_{21} = \frac{\text{cov}\left(y\_i, \widehat{d}\_i\right)}{\text{var}\left(\widehat{d}\_i\right)} = \frac{\text{cov}\left(y\_i, \widehat{\pi}\_{10} + \widehat{\pi}\_{11}z\_{i1} + \dots + \widehat{\pi}\_{1}z\_{i\parallel}\right)}{\text{var}\left(\widehat{\pi}\_{10} + \widehat{\pi}\_{11}z\_{i1} + \dots + \widehat{\pi}\_{1}z\_{i\parallel}\right)} = \frac{\widehat{\pi}\_{11}\text{cov}\left(y\_i, z\_{i1}\right) + \dots + \widehat{\pi}\_{1\parallel}\text{cov}\left(y\_i, z\_{i\parallel}\right)}{\widehat{\pi}^2\_{11}\text{var}\left(z\_{i1}\right) + \dots + \widehat{\pi}^2\_{11}\text{var}\left(z\_{i\parallel}\right)}.\tag{6}
$$

Traditional causal 2SLS models with IVs are commonly estimated using OLS methods or maximum likelihood estimation from the frequentist approach. The measurement errors at both stages, <sup>e</sup>1<sup>i</sup> and <sup>e</sup>2i, are assumed to be normally distributed as <sup>e</sup>1<sup>i</sup> � <sup>N</sup>ð0, <sup>σ</sup><sup>2</sup> e1 Þ and <sup>e</sup>2<sup>i</sup> � <sup>N</sup>ð0, <sup>σ</sup><sup>2</sup> e2 Þ. Because practical data usually violate the normality assumption, it was proposed from a Bayesian approach that the normal distributions can be replaced by Student's t distributions for heavy-tailed data or data containing outliers [23, 28, 36]. In the two-stage causal model with IVs, data at either stage are equally likely to be nonnormal or containing outliers. Therefore, we propose four possible types of Bayesian two-stage causal models to data with (a) normal measurement errors at both stages, denoted as Bayesian normal model, (b) t measurement errors in the first stage and normal measurement errors in the second stage, denoted as Bayesian nonnormal-s1 model, (c) normal measurement errors in the first stage and t measurement errors in the second stage, denoted as Bayesian nonnormals2 model, and (d) t measurement errors at both stages, denoted as Bayesian nonnormal-both model. The four types of Bayesian two-stage causal models have the same mathematical model expressions as those from the frequentist approach. Namely, for the Bayesian normal model, measurement errors are assumed to be distributed as <sup>e</sup>1<sup>i</sup> � <sup>N</sup>ð0, <sup>σ</sup><sup>2</sup> e1 Þ and <sup>e</sup>2<sup>i</sup> � <sup>N</sup>ð0, <sup>σ</sup><sup>2</sup> e2 Þ; for the Bayesian nonnormal-s1 model, the measurement errors are assumed to be distributed as <sup>e</sup>1<sup>i</sup> � <sup>t</sup>ð0, <sup>σ</sup><sup>2</sup> e1 , <sup>ν</sup>1<sup>Þ</sup> and <sup>e</sup>2<sup>i</sup> � <sup>N</sup>ð0, <sup>σ</sup><sup>2</sup> e2 Þ; for the Bayesian nonnormal-s2 model, the measuremenet errors are assumed to be distributed as <sup>e</sup>1<sup>i</sup> � <sup>N</sup>ð0, <sup>σ</sup><sup>2</sup> e1 Þ and <sup>e</sup>2<sup>i</sup> � <sup>t</sup>ð0, <sup>σ</sup><sup>2</sup> e2 , ν2Þ; finally, for the Bayesian nonnormal-both model, the measurement errors are assumed to be distributed as <sup>e</sup>1<sup>i</sup> � <sup>t</sup>ð0, <sup>σ</sup><sup>2</sup> e1 , <sup>ν</sup>1<sup>Þ</sup> and <sup>e</sup>2<sup>i</sup> � <sup>t</sup>ð0, <sup>σ</sup><sup>2</sup> e2 , ν2Þ. All four types of models are estimated using Bayesian methods.

In the Bayesian approach, we obtain the joint posterior distributions of the parameters based on the prior distributions of the parameters and the likelihood of the data information. Making statistical inferences directly from the joint posterior distributions is usually difficult. Gibbs sampling, a Markov chain Monte Carlo (MCMC) method is a widely used algorithm to draw a sequence of samples from the joint posterior distribution of two or more random variables, given that the conditional posterior distributions of the model parameters can be obtained [7]. In specific, Gibbs sampling alternately samples parameters one at a time from their conditional posterior distribution on the current values of other parameters, which are treated as known. After a sufficient number of iterations, the sequence of samples constitutes a Markov chain that converges to a stationary distribution. This stationary distribution is the sought-after joint posterior distribution of the parameters [12].

The Gibbs sampling algorithm is used to obtain the LATE estimate for the two-stage causal model with IVs. Because the t distribution can be viewed as a normal distribution with variance weighted by a Gamma distribution, the data augmentation method is used here to simplify the posterior distribution. Specifically, a Gamma random variable ω is augmented with a normal random variable because if <sup>ω</sup><sup>i</sup> � <sup>G</sup> <sup>ν</sup> 2 ; ν 2 , and yi|ω<sup>i</sup> ~ N(μ, Ψ/ωi), then y<sup>i</sup> ~ t (μ, Ψ, ν). The detailed steps of the Gibbs sampling algorithm for the Bayesian nonnormals2 model are given below. The Gibbs sampling procedures for the other models are similar.

1. Startwithinitial values πð Þ<sup>0</sup> <sup>1</sup> , <sup>π</sup>ð Þ<sup>0</sup> <sup>2</sup> , σ 2 0ð Þ <sup>e</sup><sup>1</sup> , σ 2 0ð Þ <sup>e</sup><sup>2</sup> , <sup>ν</sup>ð Þ<sup>0</sup> , <sup>ω</sup>ð Þ<sup>0</sup> <sup>i</sup> , where <sup>π</sup>ð Þ<sup>0</sup> <sup>1</sup> <sup>¼</sup> <sup>π</sup>ð Þ<sup>0</sup> <sup>10</sup> , <sup>π</sup>ð Þ<sup>0</sup> <sup>10</sup> <sup>0</sup> and πð Þ<sup>0</sup> <sup>2</sup> <sup>¼</sup> <sup>π</sup>ð Þ<sup>0</sup> <sup>20</sup> , <sup>π</sup>ð Þ<sup>0</sup> <sup>21</sup> <sup>0</sup> .


<sup>e</sup>2<sup>i</sup> � <sup>N</sup>ð0, <sup>σ</sup><sup>2</sup>

226 Bayesian Inference

<sup>e</sup>2<sup>i</sup> � <sup>N</sup>ð0, <sup>σ</sup><sup>2</sup>

<sup>e</sup>2<sup>i</sup> � <sup>t</sup>ð0, <sup>σ</sup><sup>2</sup>

similar.

πð Þ<sup>0</sup> <sup>2</sup> <sup>¼</sup> <sup>π</sup>ð Þ<sup>0</sup>

1. Startwithinitial values πð Þ<sup>0</sup>

<sup>20</sup> , <sup>π</sup>ð Þ<sup>0</sup> 21 <sup>0</sup>

.

e2

e2

to be distributed as <sup>e</sup>1<sup>i</sup> � <sup>t</sup>ð0, <sup>σ</sup><sup>2</sup>

are assumed to be distributed as <sup>e</sup>1<sup>i</sup> � <sup>t</sup>ð0, <sup>σ</sup><sup>2</sup>

models are estimated using Bayesian methods.

posterior distribution of the parameters [12].

with a normal random variable because if <sup>ω</sup><sup>i</sup> � <sup>G</sup> <sup>ν</sup>

<sup>1</sup> , <sup>π</sup>ð Þ<sup>0</sup> <sup>2</sup> , σ 2 0ð Þ <sup>e</sup><sup>1</sup> , σ 2 0ð Þ <sup>e</sup><sup>2</sup> , <sup>ν</sup>ð Þ<sup>0</sup> , <sup>ω</sup>ð Þ<sup>0</sup>

e2

Þ. Because practical data usually violate the normality assumption, it was

Þ; for the Bayesian nonnormal-s1 model, the measurement errors are assumed

, ν2Þ; finally, for the Bayesian nonnormal-both model, the measurement errors

e2

, <sup>ν</sup>1<sup>Þ</sup> and <sup>e</sup>2<sup>i</sup> � <sup>t</sup>ð0, <sup>σ</sup><sup>2</sup>

e1 Þ and

e1 Þ and

, ν2Þ. All four types of

Þ; for the Bayesian nonnormal-s2

e2

, and yi|ω<sup>i</sup> ~ N(μ, Ψ/ωi), then y<sup>i</sup> ~ t

<sup>1</sup> <sup>¼</sup> <sup>π</sup>ð Þ<sup>0</sup>

<sup>10</sup> , <sup>π</sup>ð Þ<sup>0</sup> 10 <sup>0</sup>

and

proposed from a Bayesian approach that the normal distributions can be replaced by Student's t distributions for heavy-tailed data or data containing outliers [23, 28, 36]. In the two-stage causal model with IVs, data at either stage are equally likely to be nonnormal or containing outliers. Therefore, we propose four possible types of Bayesian two-stage causal models to data with (a) normal measurement errors at both stages, denoted as Bayesian normal model, (b) t measurement errors in the first stage and normal measurement errors in the second stage, denoted as Bayesian nonnormal-s1 model, (c) normal measurement errors in the first stage and t measurement errors in the second stage, denoted as Bayesian nonnormals2 model, and (d) t measurement errors at both stages, denoted as Bayesian nonnormal-both model. The four types of Bayesian two-stage causal models have the same mathematical model expressions as those from the frequentist approach. Namely, for the Bayesian normal

model, measurement errors are assumed to be distributed as <sup>e</sup>1<sup>i</sup> � <sup>N</sup>ð0, <sup>σ</sup><sup>2</sup>

model, the measuremenet errors are assumed to be distributed as <sup>e</sup>1<sup>i</sup> � <sup>N</sup>ð0, <sup>σ</sup><sup>2</sup>

, <sup>ν</sup>1<sup>Þ</sup> and <sup>e</sup>2<sup>i</sup> � <sup>N</sup>ð0, <sup>σ</sup><sup>2</sup>

e1

In the Bayesian approach, we obtain the joint posterior distributions of the parameters based on the prior distributions of the parameters and the likelihood of the data information. Making statistical inferences directly from the joint posterior distributions is usually difficult. Gibbs sampling, a Markov chain Monte Carlo (MCMC) method is a widely used algorithm to draw a sequence of samples from the joint posterior distribution of two or more random variables, given that the conditional posterior distributions of the model parameters can be obtained [7]. In specific, Gibbs sampling alternately samples parameters one at a time from their conditional posterior distribution on the current values of other parameters, which are treated as known. After a sufficient number of iterations, the sequence of samples constitutes a Markov chain that converges to a stationary distribution. This stationary distribution is the sought-after joint

The Gibbs sampling algorithm is used to obtain the LATE estimate for the two-stage causal model with IVs. Because the t distribution can be viewed as a normal distribution with variance weighted by a Gamma distribution, the data augmentation method is used here to simplify the posterior distribution. Specifically, a Gamma random variable ω is augmented

(μ, Ψ, ν). The detailed steps of the Gibbs sampling algorithm for the Bayesian nonnormals2 model are given below. The Gibbs sampling procedures for the other models are

2 ; ν 2

<sup>i</sup> , where <sup>π</sup>ð Þ<sup>0</sup>

e1


### 4. Evaluation of four types of distributional 2SLS models

In this section, the performance of the four types of two-stage robust causal models is evaluated through a Monte Carlo simulation study. Data are generated from a general causal inference model as presented in Eq. (7). Full Bayesian methods are used for the estimation of all four types of two-stage causal models. In specific, noninformative priors are applied to all model parameters, conditional posterior distributions of all model parameters are obtained and Markov chains are generated through Gibbs sampling algorithm, convergence tests are conducted and finally statistical inferences for the model parameters are made. Free software (R Development Core Team, 2011) R [41] and OpenBUGS [42] (Thomas, O'Hara, Ligges, & Sturtz, 2006) were used for the implementation of MCMC algorithms and model estimation. A total of 20,000 iterations was conducted for each simulation condition, with the first 10,000 iterations as the burn-in period.

#### 4.1. Study design

Data are generated from a general causal inference model

$$y\_i = \mathfrak{Z} + \mathbf{0}.5\mathbf{x}\_i + \mathfrak{e}\_i,\tag{7}$$

where yi is the causal outcome, xi is the causal treatment, and ei is the measurement error. Three potential influential factors are considered. First, sample size (N) is either 200 or 600. Second, correlation between x and e(Φ) is manipulated to be either 0.3 or 0.7, reflecting relatively weak or strong linear relationship between the treatment and the measurement error. Third, a proportion of observations that contains outliers is manipulated. The proportion of outliers (OP) is considered to be 0, 5, or 10%. When the OP is 0%, data contain no outliers and measurement errors ei are normally distributed. When the OP is above zero, data contain outliers. For outliers, the measurement errors are generated from a different normal distribution with the same standard deviation, but a larger mean (eight times of the standard deviation). An IV is also generated from a normal distribution and correlated with x with the correlation coefficient being 0.6.

If we fit a linear regression to the generated data, we will immediately notice that the residuals and the regressors are not independent. Therefore, we adopt the two-stage causal model with IVs. The four types of two-stage models (normal model, nonnormal-s1 model, nonnormal-s2 model, and nonnormal-both model) are used to fit the data. In the first stage, the IV is used to predict the endogenous treatment, and the estimated treatment is then used in the second stage to estimate the LATE. Based on Eq. (6), the theoretical LATE is 5/6.

As discussed previously, Bayesian methods using Gibbs sampling algorithm are used to obtain the LATE estimates in four types of two-stage causal models. The bias and standard error (SE) of the LATE estimate for each of the four distributional models are assessed. In addition, the deviance information criterion (DIC) [27] for each condition is examined to study the model fit. A lower value of DIC indicates a better model fit.

### 4.2. Results

The bias and SEs of the LATE estimates from four types of models when ϕ = 0.3 are presented in Table 1.

In almost all cases, models that use normal distributions to model the normal data and that use Student's t distributions to model the data with outliers provide the best estimates with smaller bias and SEs among other types of two-stage causal models. For example, when N = 200, the normal model provides smaller bias and SE for normal data; similarly, nonnormal-s2 and nonnormal-both models lead to the smaller bias and SEs when they are used to fit data containing outliers. This shows that using Student's t distributions to model data containing


Table 1. Bias and SEs of the LATE estimates for all the conditions when Φ = 0.3.

outliers is an effective way to accommodate heavy-tailed data or data containing outliers, and this finding is consistent with the previous research [34, 36]. In causal inference study, because practical data at either stage are equally likely having outliers or are normally distributed in the two-stage causal model with IVs, we fit all four types of distributional models and try to decide which one is the best-fitted model. From the results, modeling heavy-tailed data or data containing outliers with nonnormal-both model provides more reliable parameter estimates than traditional methods that ignore the data distributions and model all data exclusively with normal distributions.

error. Third, a proportion of observations that contains outliers is manipulated. The proportion of outliers (OP) is considered to be 0, 5, or 10%. When the OP is 0%, data contain no outliers and measurement errors ei are normally distributed. When the OP is above zero, data contain outliers. For outliers, the measurement errors are generated from a different normal distribution with the same standard deviation, but a larger mean (eight times of the standard deviation). An IV is also generated from a normal distribution and correlated with x with the

If we fit a linear regression to the generated data, we will immediately notice that the residuals and the regressors are not independent. Therefore, we adopt the two-stage causal model with IVs. The four types of two-stage models (normal model, nonnormal-s1 model, nonnormal-s2 model, and nonnormal-both model) are used to fit the data. In the first stage, the IV is used to predict the endogenous treatment, and the estimated treatment is then used in the second

As discussed previously, Bayesian methods using Gibbs sampling algorithm are used to obtain the LATE estimates in four types of two-stage causal models. The bias and standard error (SE) of the LATE estimate for each of the four distributional models are assessed. In addition, the deviance information criterion (DIC) [27] for each condition is examined to study the model fit.

The bias and SEs of the LATE estimates from four types of models when ϕ = 0.3 are presented

In almost all cases, models that use normal distributions to model the normal data and that use Student's t distributions to model the data with outliers provide the best estimates with smaller bias and SEs among other types of two-stage causal models. For example, when N = 200, the normal model provides smaller bias and SE for normal data; similarly, nonnormal-s2 and nonnormal-both models lead to the smaller bias and SEs when they are used to fit data containing outliers. This shows that using Student's t distributions to model data containing

N Data OP Bias SE Bias SE Bias SE Bias SE 200 Normal 0% 0.001 0.154 0.004 0.154 0.004 0.155 0.003 0.155 Nonnormal 5% 0.210 0.283 0.200 0.281 0.022 0.177 0.020 0.171

600 Normal 0% 0.021 0.076 0.023 0.076 0.023 0.077 0.023 0.076 Nonnormal 5% 0.180 0.168 0.170 0.167 0.064 0.099 0.060 0.096

Table 1. Bias and SEs of the LATE estimates for all the conditions when Φ = 0.3.

Normal model Nonnormal-s1 model Nonnormal-s2 model Nonnormal-both model

10% 0.342 0.379 0.341 0.378 0.060 0.157 0.050 0.155

10% 0.390 0.230 0.380 0.210 0.077 0.099 0.070 0.098

stage to estimate the LATE. Based on Eq. (6), the theoretical LATE is 5/6.

A lower value of DIC indicates a better model fit.

correlation coefficient being 0.6.

4.2. Results

228 Bayesian Inference

in Table 1.

Although it is always a good choice to model normal data with normal distributions and heavy-tailed data or data containing outliers with Student's t distributions, in practice, researchers may not know whether the first stage or the second stage of the model should account for the nonnormality. The simulation results show that when data contain outliers, the nonnormal-s2 model and nonnormal-both model that use t distributions in the second stage produce the smallest bias and SEs of the LATE estimates. This is probably because the causal effect of interest, LATE, is housed in the second stage, and using Student's t distribution to model outliers in that stage is effective in capturing the LATE. On the contrary, in the normal model or the nonnormal-s1 model, the normal distribution is being used to model the second stage data that are heavy tailed or contain outliers. For example, for all the nonnormal data that contain outliers (i.e., OP = 5 or 10%), the nonnormal-s2 model and the nonnormal-both model, both of which use t distributions to model data in the second stage, outperform other models, providing smaller bias and SEs of the LATE estimates regardless of sample size (N) and proportion of outliers (OP). Comparing between nonnormal-s2 and nonnormal-both models, the nonnormal-both models perform slightly better than the nonnormal-s2 model does. Take N = 600 and OP = 10% as an example, the bias and SEs for the nonnormal-s2 model are 0.077 and 0.099, whereas those for the nonnormal-both model are slightly smaller to be 0.070 and 0.098, showing that fitting the nonnormal data with Student's t distributions at both stages has the best performance in terms of accuracy and efficiency of the LATE estimate.

Table 2 presents the results for DICs for the four types of two-stage causal models when Φ = 0.3.


In practice, DIC can be used as a model selection criteria. To select the best-fitted parsimonious model, we first fit all four types of models to the data, and then select the model with the

Table 2. DICs of all the distributional models when Φ = 0.3.

smallest DIC. Notice that for normal data, all four types of models have similar DIC values. When data contain outliers, nonnormal-s2 and nonnormal-both models provides the smallest DIC, indicating that these types of models fit the data better. In all data conditions in the study, the DICs of the nonnormal-s2 model and the nonnormal-both model are very similar, and either model can be adopted.

The proportions of outliers contained in the data have effect on the performance of the nonnormal-s2 model and the nonnormal-both model. Specifically, the larger the proportions of outliers, the more salient the advantages of the nonnormal-s2 and nonnormal-both models. For example, for the nonnormal data with N = 200 and OP = 5%, the bias from the normal model, the nonnormal-s2 model and the nonnormal-both model is 0.210, 0.022, and 0.020, respectively; when OP becomes 10%, the bias from the normal model jumps to 0.342, whereas the bias from the nonnormal-s2 model changes slightly to 0.060 and that from the nonnormal-both model is 0.050. Similarly, the preferred models provide less biased LATE estimates when sample size is small, and the advantage of the preferred models is more apparent under small sample conditions (e.g., [23]).

When Φ = 0.7, consistent with the results from previous conditions when Φ = 0.3, when data have outliers, using Student's t distributions to model the data provides more accurate and efficient LATE estimates and better model fits than using normal distribution to model the data. The advantage of using t distributions is more obvious when sample size is small and the proportion of outliers is large.

### 5. Discussion

In causal inference research, the issue of the treatment endogeneity is commonly addressed in the 2SLS model with IVs, where the LATE is the causal effect of interest. Because practical data usually violate the normality assumption, using normal distributions to model heavytailed data or data containing outliers may result in inefficient or even biased LATE estimate. In the 2SLS model with IVs, data at either stage are equally likely having outliers or are normally distributed. To address this problem, this study proposes four possible types of Bayesian two-stage robust causal models with IVs to the data, and evaluates the performance of the robust method using Student's t distributions in the causal modeling. The Monte Carlo simulation results show that modeling normal data with normal distributions and normal or heavy-tailed data or data containing outliers with Student's t distributions gives good performance in terms of accuracy, efficiency, and model fit. When data are normally distributed, the methods that either use normal distributions or the Student's t distributions perform equally well as they provide similar bias, SEs and DICs. In the presence of outliers, the nonnormal-s2 and the nonnormal-both models that take outliers into consideration and use Student's t distributions in the second stage to model heavy-tailed data or data containing outliers outperform other distribution models that use normal distributions to model either exclusively all the data or the second stage data in two-stage causal models with IVs with smaller bias and higher efficiency. In addition, the nonnormals2 model and the nonnormal-both model have smaller DICs than the other two models,

suggesting evidence of better model fit. The nonnormal-s2 and nonnormal-both models are especially preferred when sample size is small and the proportion of outliers is large as they produce more accurate and efficient LATE estimates.

Note that fitting the nonnormal-both model to data may require longer Markov chains as degrees of freedom for t distributions at both stages need to be estimated. We also want to be cautious to simply use Student's t distributions to model all the data as this method is numerically not optimal all the time and computationally time consuming [28]. Additionally, Student's t distributions are sensitive to the skewness, so some nonnormally distributed data may not be modeled by them. If data are highly skewed, alternative robust method, such as robust methods based on skewed-t distributions may be considered [5].

### Author details

smallest DIC. Notice that for normal data, all four types of models have similar DIC values. When data contain outliers, nonnormal-s2 and nonnormal-both models provides the smallest DIC, indicating that these types of models fit the data better. In all data conditions in the study, the DICs of the nonnormal-s2 model and the nonnormal-both model are very similar, and

The proportions of outliers contained in the data have effect on the performance of the nonnormal-s2 model and the nonnormal-both model. Specifically, the larger the proportions of outliers, the more salient the advantages of the nonnormal-s2 and nonnormal-both models. For example, for the nonnormal data with N = 200 and OP = 5%, the bias from the normal model, the nonnormal-s2 model and the nonnormal-both model is 0.210, 0.022, and 0.020, respectively; when OP becomes 10%, the bias from the normal model jumps to 0.342, whereas the bias from the nonnormal-s2 model changes slightly to 0.060 and that from the nonnormal-both model is 0.050. Similarly, the preferred models provide less biased LATE estimates when sample size is small, and the advantage of the preferred models is more

When Φ = 0.7, consistent with the results from previous conditions when Φ = 0.3, when data have outliers, using Student's t distributions to model the data provides more accurate and efficient LATE estimates and better model fits than using normal distribution to model the data. The advantage of using t distributions is more obvious when sample size is small and the

In causal inference research, the issue of the treatment endogeneity is commonly addressed in the 2SLS model with IVs, where the LATE is the causal effect of interest. Because practical data usually violate the normality assumption, using normal distributions to model heavytailed data or data containing outliers may result in inefficient or even biased LATE estimate. In the 2SLS model with IVs, data at either stage are equally likely having outliers or are normally distributed. To address this problem, this study proposes four possible types of Bayesian two-stage robust causal models with IVs to the data, and evaluates the performance of the robust method using Student's t distributions in the causal modeling. The Monte Carlo simulation results show that modeling normal data with normal distributions and normal or heavy-tailed data or data containing outliers with Student's t distributions gives good performance in terms of accuracy, efficiency, and model fit. When data are normally distributed, the methods that either use normal distributions or the Student's t distributions perform equally well as they provide similar bias, SEs and DICs. In the presence of outliers, the nonnormal-s2 and the nonnormal-both models that take outliers into consideration and use Student's t distributions in the second stage to model heavy-tailed data or data containing outliers outperform other distribution models that use normal distributions to model either exclusively all the data or the second stage data in two-stage causal models with IVs with smaller bias and higher efficiency. In addition, the nonnormals2 model and the nonnormal-both model have smaller DICs than the other two models,

either model can be adopted.

230 Bayesian Inference

proportion of outliers is large.

5. Discussion

apparent under small sample conditions (e.g., [23]).

Dingjing Shi and Xin Tong\*

\*Address all correspondence to: xtong@virginia.edu

Department of Psychology, University of Virginia, Charlottesville, Virginia, USA

### References


[26] Song P, Zhang P, Qu A. Maximum likelihood inference in robust linear mixed-effects models using multivariate t distribution. Statistica Sinica. 2007;17:929-943

[9] Gerber AS, Green DP. Field Experiments: Design, Analysis and Interpretation. New York,

[10] Hampel FR, Ronchetti EM, Rousseeuw PJ, Stahel WA. Robust Statistics: The Approach

[12] Geman S, Geman D. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1984;6:721-741

[13] Imbens G, Angrist JD. Identification and estimation of local average treatment effects.

[14] Lange KL, Little RJ, Taylor JM. Robust statistical modeling using the t distribution.

[15] Lee SY, Xia YM. Maximum likelihood methods in treating outliers and symmetrically heavytailed distributions for nonlinear structural equations. Psychometrika. 2006;71:565-585 [16] Lee SY, Xia YM. A robust Bayesian approach for structural equation models with missing

[17] Osbourne, J. W. (2002). Notes on the Use of Data Transformation. Practical Assessment,

[18] Osborne, J. W., & Overbay, A. (2004). The power of outliers (and why researchers should always check for them). Practical assessment, research & evaluation, 9(6), 1-12.

[19] Pinheiro JC, Liu C, Wu Y. Efficient algorithms for robust estimation in linear mixedeffects models using the multivariate t distribution. Journal of Computational and Graph-

[20] Rachman-Moore D, Wolfe RG. Robust analysis of a nonlinear model for multilevel edu-

[21] Seltzer M, Novak J, Choi K, Lim N. Sensitivity analysis for hierarchical models employing t level-1 assumptions. Journal of Educational and Behavioral Statistics. 2002;27:181-222

[22] Seltzer M, Choi K. Sensitivity analysis for hierarchical models: Downweighting and identifying extreme cases using the t distribution. Multilevel Modeling: Methodological

[23] Shi D, Tong X. Robust Bayesian estimation in causal two-stage least squares modeling with instrumental variables. In: van der Ark LA, Culpepper S, Douglas JA, Wang W-C, Wiberg M, editors. Quantitative Psychology Research. Springer: New York; 2017

[24] Shoham S. Robust clustering by deterministic agglomeration EM of mixtures of multivar-

[25] Simmons J, Nelson L, Simon S. False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science.

cational survey data. Journal of Educational Statistics. 1984;9:277-293

Advances, Issues, and Applications. 2003;1:25-52

iate t-distributions. Pattern Recognition. 2002;35:1127-1142

Based on Influence Functions. New York: John Wiley & Sons, Inc; 1986

[11] Huber PJ. Robust Statistics. New York: John Wiley & Sons, Inc; 1981

Journal of the American Statistical Association. 1989;84:881-896

NY: W.W.Norton & Company; 2011

232 Bayesian Inference

Econometrica. 1994;62:467-475

data. Psychometrika. 2008;73:343-364

Research & Evaluation, 8(6), n6.

ical Statistics. 2001;10:249-276

2011;22:1359-1366


Provisional chapter

### **Bayesian Hypothesis Testing: An Alternative to Null Hypothesis Significance Testing (NHST) in Psychology and Social Sciences** Bayesian Hypothesis Testing: An Alternative to Null Hypothesis Significance Testing (NHST) in Psychology and Social Sciences

DOI: 10.5772/intechopen.70230

Alonso Ortega and Gorka Navarrete

Additional information is available at the end of the chapter Alonso Ortega and Gorka Navarrete

http://dx.doi.org/10.5772/intechopen.70230 Additional information is available at the end of the chapter

### Abstract

Since the mid-1950s, there has been a clear predominance of the Frequentist approach to hypothesis testing, both in psychology and in social sciences. Despite its popularity in the field of statistics, Bayesian inference is barely known and used in psychology. Frequentist inference, and its null hypothesis significance testing (NHST), has been hegemonic through most of the history of scientific psychology. However, the NHST has not been exempt of criticisms. Therefore, the aim of this chapter is to introduce a Bayesian approach to hypothesis testing that may represent a useful complement, or even an alternative, to the current NHST. The advantages of this Bayesian approach over Frequentist NHST will be presented, providing examples that support its use in psychology and social sciences. Conclusions are outlined.

Keywords: Bayesian inference, Bayes factor, NHST, quantitative research

### 1. Introduction

"Scientific honesty then requires less than had been thought: it consists in uttering only highly probable theories: or even in merely specifying, for each scientific theory, the evidence, and the probability of the theory in the light of this evidence". Lakatos [1, p. 208].

The nature and role of experimentation in science found its origins in the rise of natural sciences during the sixteenth and seventeenth centuries [2]. Since then, knowledge meant that theories have to be corroborated either by the power of the intellect or by the evidence of the senses [1]. However, until the mid-late 1800s, "psychological experiments had been performed, but the science was not yet experimental" [3, p. 158]. It was not until 1875 that—either at

© 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and eproduction in any medium, provided the original work is properly cited.

Wundt laboratory in Leipzig or at James' laboratory in Harvard—experimental procedures were introduced and contributed to the development of psychology as an independent science [3]. From almost one and a half centuries, scientific research mostly relies on empirical findings to provide support to their hypotheses, models, or theories. From this point of view, psychology and social sciences must take distance from rhetorical speculations, desist from unproven statements and build its knowledge on the basis of empirical evidence [1, 4]. Almost a decade ago, Curran reemphasized that the aim of any empirical science is to pursue the construction of a cumulative base of knowledge [5]. However, it has also been emphasized that such a cumulative knowledge—for a true psychological science—is not possible through the current and widespread paradigm of hypothesis testing [5–9]. Since approximately two decades ago, some explicit claims have appeared in peer review articles, such as "Psychology will be a much better science when we change the way we analyze data"[7], "We need statistical thinking, not statistical rituals" [10], "Why most research findings are false" [11] or "Yes, psychologists must change the way they analyze their data…" [12]. Most critiques have been directed toward the current—and still predominant—approach to hypothesis testing (i.e., NHST) and its overreliance on p-values and significance levels [6, 11, 13], emphasizing its pervasive consequences against the construction of a cumulative base of knowledge in psychological science [8]. Despite all warnings, they seem not to have generated a noteworthy echo in the scientific community, even though "it is evident that the current practice of focusing exclusively on a … decision strategy of null hypothesis testing can actually impede scientific progress" [14, p. 100]. Therefore, it seems reasonable to suggest that there is a need to make considerable changes to how we usually carry out research, especially if the goal is to ensure research integrity [6]. Regarding this matter, a frequently proposed alternative has been moving from the exclusive focus on p-values to incorporate other existing techniques such as "power analysis" [15] and "meta-analysis" [16], or to report and interpret "effect sizes" and "confidence intervals" [7]. However, in our view, a sounder alternative would be to move from a Frequentist paradigm to a Bayesian approach, which allows us not only to provide evidence against the null hypothesis but also in favor of it [17]. Furthermore, Bayesian analysis allows us to compare two (or more) competing models in light of the existent data and not only based in "theoretical probability distributions," as in the Frequentist approach to hypothesis testing [18].

A Bayesian approach would offer some interesting possibilities for both individual psychology researchers and the research endeavor in general. First, Bayesian analysis allows us to move from a dichotomous way of reasoning about results (e.g., either an effect exists of it does not) to a less artificial view that interprets results in terms of magnitude of evidence (e.g., the data are more likely under H<sup>0</sup> than Ha), and therefore, allows us to better depict to which extent a phenomenon may occur. Second, a Bayesian approach naturally allows us to directly test the plausibility of both the null and the alternative hypothesis, but the current NHST paradigm does not. In fact, when a researcher does not reach a desired p-value oftentimes it is—falsely assumed that the effect "does not exist." As a consequence, the researcher's chances of getting his or her results published decrease dramatically, which moves us to our third argument. As broadly known, the most scientific peer-reviewed journals do not show much interest in results, which are "non-statistically significant." This common practice—or scientific standard —sadly reinforces the idea of thinking in terms of relevant or irrelevant findings. In our view, such standards do not promote scientific advance and quickly lead us to ignore some promising but "non-significant" findings that may be further explored, fed into meta-analysis, of just be considered by other researchers in the field. Of course, systematically ignoring a portion of the research undermines the primary goal of scientific inquiry that is to collect evidence and not only to reject hypothesis. The facts and ideas exposed in this introductory section set forth the necessity to reanalyze the way in which scientific evidence has been conceived during the NHST era.

The following sections will: (a) concisely address the NHST procedure, (b) introduce a Bayesian framework to hypothesis testing, (c) provide an example that highlights the advantages of a Bayesian approach over the current NHST in terms of the way in which scientific evidence is quantified, and (d) briefly summarize and discuss the benefits of a Bayesian approach to hypothesis testing.

### 2. Null hypothesis significance testing (NHST)

Wundt laboratory in Leipzig or at James' laboratory in Harvard—experimental procedures were introduced and contributed to the development of psychology as an independent science [3]. From almost one and a half centuries, scientific research mostly relies on empirical findings to provide support to their hypotheses, models, or theories. From this point of view, psychology and social sciences must take distance from rhetorical speculations, desist from unproven statements and build its knowledge on the basis of empirical evidence [1, 4]. Almost a decade ago, Curran reemphasized that the aim of any empirical science is to pursue the construction of a cumulative base of knowledge [5]. However, it has also been emphasized that such a cumulative knowledge—for a true psychological science—is not possible through the current and widespread paradigm of hypothesis testing [5–9]. Since approximately two decades ago, some explicit claims have appeared in peer review articles, such as "Psychology will be a much better science when we change the way we analyze data"[7], "We need statistical thinking, not statistical rituals" [10], "Why most research findings are false" [11] or "Yes, psychologists must change the way they analyze their data…" [12]. Most critiques have been directed toward the current—and still predominant—approach to hypothesis testing (i.e., NHST) and its overreliance on p-values and significance levels [6, 11, 13], emphasizing its pervasive consequences against the construction of a cumulative base of knowledge in psychological science [8]. Despite all warnings, they seem not to have generated a noteworthy echo in the scientific community, even though "it is evident that the current practice of focusing exclusively on a … decision strategy of null hypothesis testing can actually impede scientific progress" [14, p. 100]. Therefore, it seems reasonable to suggest that there is a need to make considerable changes to how we usually carry out research, especially if the goal is to ensure research integrity [6]. Regarding this matter, a frequently proposed alternative has been moving from the exclusive focus on p-values to incorporate other existing techniques such as "power analysis" [15] and "meta-analysis" [16], or to report and interpret "effect sizes" and "confidence intervals" [7]. However, in our view, a sounder alternative would be to move from a Frequentist paradigm to a Bayesian approach, which allows us not only to provide evidence against the null hypothesis but also in favor of it [17]. Furthermore, Bayesian analysis allows us to compare two (or more) competing models in light of the existent data and not only based in "theoretical probability distributions," as in the Frequentist approach to hypothesis

A Bayesian approach would offer some interesting possibilities for both individual psychology researchers and the research endeavor in general. First, Bayesian analysis allows us to move from a dichotomous way of reasoning about results (e.g., either an effect exists of it does not) to a less artificial view that interprets results in terms of magnitude of evidence (e.g., the data are more likely under H<sup>0</sup> than Ha), and therefore, allows us to better depict to which extent a phenomenon may occur. Second, a Bayesian approach naturally allows us to directly test the plausibility of both the null and the alternative hypothesis, but the current NHST paradigm does not. In fact, when a researcher does not reach a desired p-value oftentimes it is—falsely assumed that the effect "does not exist." As a consequence, the researcher's chances of getting his or her results published decrease dramatically, which moves us to our third argument. As broadly known, the most scientific peer-reviewed journals do not show much interest in results, which are "non-statistically significant." This common practice—or scientific standard —sadly reinforces the idea of thinking in terms of relevant or irrelevant findings. In our view,

testing [18].

236 Bayesian Inference

"Never use the unfortunate expression: accept the null hypothesis." Wilkinson and the Task Force on Statistical Inference APA Board of Scientific Affairs [19, p. 602].

The most influential methods to modern null hypothesis significance testing (NHST) were developed by Fisher, and by Neyman and Pearson in the early and mid-1900s [20]. Since then, the NHST has been broadly used to provide an association between empirical evidence and models or theories [21]. In the traditional NHST procedure, two hypotheses are postulated: a null hypothesis (i.e., H0) and a research hypothesis, also called alternative (i.e., Ha), which describe two contrasting conceptions about some phenomenon [22]. When conducting a NHST, researchers usually pursue to reject the null hypothesis (H0) on the basis of a p-value. When the observed p-value is lower than a predetermined significance level (i.e., alpha, usually corresponding to α = 0.05), the conclusion is that such p-value constitutes supporting evidence that favors the plausibility of the alternative hypothesis [23]. However, a more important feature of this procedure that remains unknown for most scientists, including psychology researchers, is that the NHST constitutes an amalgamation of two irreconcilable schools of thought in modern statistics: the Fisher test of significance, and the Neyman and Pearson hypothesis test [24, 25]. To this respect, Goodman stated that "it is not generally appreciated that the p-value, as conceived by Fisher, is not compatible with the Neyman and Pearson hypothesis in which it has become embedded" [25, p. 485]. In this synthesized NHST, the Fisherian approach includes a test of significance of p-values obtained from the data, whereas the Neyman and Pearson method incorporates the notion of error probabilities from the test (i.e., Type I and Type II).

### 2.1. Origins and rationale of NHST

First, in the early 1900s, Fisher [26, 27] developed a method that tested a single hypothesis (i.e., null or H0), which has been mainly referred to as a hypothesis of "no effect" between variables (e.g., relationship, difference). The null hypothesis, as conceived by Fisher, has a known distribution of the test statistic t. Thus, as the test statistic moves away from its expected value, then the null hypothesis becomes progressively less plausible. In other words, it appears less likely to occur by chance. Then, if H<sup>0</sup> achieves a probability of occurrence sufficiently lower than the significance level (i.e., a small p-value) then it should be rejected. Otherwise, no conclusion can be reached. Subsequently, the question that logically arises is: what p-value is sufficiently small to reject H0? The answer to this question was clearly addressed by Fisher when he stated that this threshold should be determined by the context of the problem, and it was not until the 1950s that Fisher presented the first significance tables to establishing rejection thresholds [22]. However, Fisher [28] refused the idea of establishing a conventional significance level and, in its place, recommended reporting the exact p-value instead of a significance level (e.g., p = 0.019, but not p < 0.05; see [10]). Similarly, May et al. indicated that the choice of a significance level should depend on the consequences of rejecting or failing to reject the null hypothesis [29]. Despite these recommendations about threshold determination, most scientists from different research fields adopted standard significance levels (i.e., α = 0.05 or α = 0.01), which have been used—or misused—regardless of the hypotheses being tested.

Later, in 1933, Neyman and Pearson proposed a procedure in which two explicitly stated rival hypotheses were contrasted, being one of them still considered as the "null" hypothesis, as in the Fisher test [30]. Neyman and Pearson rejected Fisher's idea of only testing the null hypothesis. In this scenario, there are now two hypotheses (i.e., the null and the alternative), and based on the observed p-value, the researcher has to decide whether to reject or not to reject the null hypothesis. This decision rule faces the researcher with the probability of committing two kinds of errors: Type I and Type II. As defined by Neyman and Pearson, the Type I error is the probability of falsely rejecting H<sup>0</sup> (i.e., null) when H<sup>0</sup> is true [30]. Conversely, the probability of failing to reject H<sup>0</sup> when H<sup>0</sup> is false is the Type II error. For the sake of simplicity, an analogy of both kinds of errors can be found in the classic fairy tale "The boy who cried wolf!" When the young shepherd, called Peter, shouted out: "Help! the wolf is coming!" The village's people believed the young boy warning and quickly came to help him. However, when they found out that all was a joke, they got angry. To believe in the boy's false, alarm can be considered as a Type I error. Peter repeated the same joke a couple of times and, when the wolf actually appeared, the villagers did not believe the young shepherd's desperate calls. This situation is analogous to be engaged in a Type II error [31].

Within this NHST framework, the Fisher's p-value is then used to dichotomize effects into two categories: significant and non-significant results [21]. Consequently, on one hand, obtaining significant results led us to assume that the phenomenon under investigation can be considered as "existing" and, therefore, can be used as supporting evidence for a particular model or theory. On the other hand, non-significant results are usually (and erroneously) considered as "noise," implicating the nonexistence of an effect [21]. In this last case, there are no findings that could be reported. From this view, the evidence in favor of a research finding is then solely judged on the ability to reject H<sup>0</sup> when a sufficiently low p-value is observed. This simple and appealing decision rule may constitute a very seductive way of thinking about results, that is: A phenomenon either exists or it does not. However, thinking in this fashion is fallacious, led to misinterpretations of results and findings, and more importantly "it can distract us from a higher goal of scientific inquiry. That is, to determine if the results of a test have any practical value or not" [32, p. 7].

### 2.2. NHST: Common misconceptions and criticisms

distribution of the test statistic t. Thus, as the test statistic moves away from its expected value, then the null hypothesis becomes progressively less plausible. In other words, it appears less likely to occur by chance. Then, if H<sup>0</sup> achieves a probability of occurrence sufficiently lower than the significance level (i.e., a small p-value) then it should be rejected. Otherwise, no conclusion can be reached. Subsequently, the question that logically arises is: what p-value is sufficiently small to reject H0? The answer to this question was clearly addressed by Fisher when he stated that this threshold should be determined by the context of the problem, and it was not until the 1950s that Fisher presented the first significance tables to establishing rejection thresholds [22]. However, Fisher [28] refused the idea of establishing a conventional significance level and, in its place, recommended reporting the exact p-value instead of a significance level (e.g., p = 0.019, but not p < 0.05; see [10]). Similarly, May et al. indicated that the choice of a significance level should depend on the consequences of rejecting or failing to reject the null hypothesis [29]. Despite these recommendations about threshold determination, most scientists from different research fields adopted standard significance levels (i.e., α = 0.05 or α = 0.01), which have been used—or misused—regardless of the hypotheses being tested. Later, in 1933, Neyman and Pearson proposed a procedure in which two explicitly stated rival hypotheses were contrasted, being one of them still considered as the "null" hypothesis, as in the Fisher test [30]. Neyman and Pearson rejected Fisher's idea of only testing the null hypothesis. In this scenario, there are now two hypotheses (i.e., the null and the alternative), and based on the observed p-value, the researcher has to decide whether to reject or not to reject the null hypothesis. This decision rule faces the researcher with the probability of committing two kinds of errors: Type I and Type II. As defined by Neyman and Pearson, the Type I error is the probability of falsely rejecting H<sup>0</sup> (i.e., null) when H<sup>0</sup> is true [30]. Conversely, the probability of failing to reject H<sup>0</sup> when H<sup>0</sup> is false is the Type II error. For the sake of simplicity, an analogy of both kinds of errors can be found in the classic fairy tale "The boy who cried wolf!" When the young shepherd, called Peter, shouted out: "Help! the wolf is coming!" The village's people believed the young boy warning and quickly came to help him. However, when they found out that all was a joke, they got angry. To believe in the boy's false, alarm can be considered as a Type I error. Peter repeated the same joke a couple of times and, when the wolf actually appeared, the villagers did not believe the young shepherd's desperate calls. This situation is

Within this NHST framework, the Fisher's p-value is then used to dichotomize effects into two categories: significant and non-significant results [21]. Consequently, on one hand, obtaining significant results led us to assume that the phenomenon under investigation can be considered as "existing" and, therefore, can be used as supporting evidence for a particular model or theory. On the other hand, non-significant results are usually (and erroneously) considered as "noise," implicating the nonexistence of an effect [21]. In this last case, there are no findings that could be reported. From this view, the evidence in favor of a research finding is then solely judged on the ability to reject H<sup>0</sup> when a sufficiently low p-value is observed. This simple and appealing decision rule may constitute a very seductive way of thinking about results, that is: A phenomenon either exists or it does not. However, thinking in this fashion is fallacious, led to misinterpretations of results and findings, and more importantly "it can distract us from a higher goal of scientific inquiry. That is, to determine if the results of a test have any practical

analogous to be engaged in a Type II error [31].

value or not" [32, p. 7].

238 Bayesian Inference

As previously stated, most problems and criticisms to the current NHST paradigm appear as a result of the mismatch of these essentially incompatible statistical approaches [10, 33, 34]. In this line, Nickerson stated that "A major concern expressed by critics is that such testing is misunderstood by many of those who use it" [35, p. 241]. Some of these misconceptions are common among researchers and are interpretative in nature. As a matter of fact, Badenes-Ribera et al. recently reported the results of a survey conducted to 164 academic psychologists who were questioned about the meaning of p-values [36]. Results confirmed previous findings regarding the occurrence of wrongful interpretations of p-values. For instance, the false belief that the p-value indicates the conditional probability of the null hypothesis given certain data (i.e., p (H0|D)), instead of the probability of witnessing a given result, assuming that the null hypothesis is true [37]. This wrong interpretation of a p-value is known as "the inverse probability" fallacy. Another common misconception regarding p-values is that they provide direct information about the magnitude of an effect, that is, a p-value of 0.00001 represents evidence of a bigger effect than a p-value of 0.01. This conclusion is wrong because the only way to estimate the magnitude of an effect is to calculate the value of the effect size with the appropriate statistic and its confidence interval (e.g., Cohen's d; see [38]). This erroneous interpretation of a p-value is known as "the effect size" fallacy. A comprehensive review of these and other common misconceptions is out of the scope of this chapter, but several resources on these topics are available for the interested readers (see [14, 35, 37–40]).

Likewise, the rationale under the NHST has been largely criticized. Most criticisms against NHST are focused on the way in which data are (unsoundly) analyzed and interpreted, for example:


However, an issue that is of particular interest for this chapter is related to the use of p-values as a way to quantify statistical evidence [13, 41]. As previously stated in this chapter, rejecting H<sup>0</sup> does not provide evidence in favor of the plausibility of Ha, and all that can be concluded is that H<sup>0</sup> is unlikely [9]. Conversely, failing to reject H<sup>0</sup> simply allows us to state that—given the evidence at hand—one cannot make an assertion about the existence of some effect or phenomenon [42]. Hence, rejecting H<sup>0</sup> is not a valid indicator of the magnitude of evidence of a result [43]. In Schmidt's words: "… reliance on statistical significance testing in psychology and the other social sciences has led to frequent serious errors in interpreting the meaning of data, errors that have systematically retarded the growth of cumulative knowledge" [16, p. 120]. Despite the existence of scientific literature that highlights the weaknesses of NHST [9, 16, 21, 22, 39, 43–46], it is still considered as the: "sine qua non of the scientific method" [10, p. 199]. Moreover, NHST is arguably the most widely used method of data analysis in psychology since the mid-1950s and still governs the interpretation of quantitative data in social science research [35, 47]. In Krueger's words: "NHST is the researcher's workhorse for making inductive inferences" [45, p. 16]. An immediate matter of concern is that most of scientific discoveries, in a wide range of research fields, are based on a procedure that still generates controversy (see [12, 48–50]). Since the focus of research should be on what data tell us about the magnitude of effects, it seems necessary to shift from our reliance on NHST to more robust alternatives [14]. Some recommended practices include estimates based on effect sizes, confidence intervals, and meta-analysis [6]. However, a sounder alternative comes from the Bayesian paradigm through the use of a simple estimate of the magnitude of evidence called Bayes factor (BF) [17]. This approach to hypothesis testing has shown several benefits. First, it is not oriented to pursue the rejection of H0; on the contrary, it provides a way to obtain evidence for and against H0. Second, it does not use arbitrary thresholds (i.e., significance levels) to reach dichotomous decisions about the plausibility or implausibility of H0; on the contrary, it directly contrasts the magnitude of evidence for and against both H<sup>0</sup> and Ha. Third, it permits the continuous update of evidence as long as new data are available, which is in line with the nature of scientific inquiry. Bayesian methods have been largely suggested as a practical alternative to NHST [9, 17, 23, 51], but—until now—they have not received enough attention from researchers in psychology and social sciences.

### 3. Bayesian hypothesis testing: An alternative to NHST

"(…) prior and posterior are relative terms, referring to the data. Today's posterior is tomorrow's prior." Lindley [52, p. 301].

In the field of statistics, probabilities can be interpreted under two predominant paradigms: Frequentist inference and Bayesian inference. The former makes predictions about experiments whose outcomes depend basically upon random processes [53]. The latter assigns probabilities to any statement, even when a random process is not involved [54]. In a Bayesian framework, a probability is a way to embody an individual's degree of belief in a statement. Since the mid-1950s, there has been a clear predominance of the Frequentist approach to hypothesis testing, both in psychology and social sciences. The hegemony of Frequentist inference and its null hypothesis significance testing (NHST) might be partially attributed to the massive incorporation of such approaches in psychology undergraduate programs [9] and also to the fact that the Neyman and Pearson approach had the most well-developed computational software to conduct statistical inference [18]. However, the current scenario has drastically changed, and the development of sampling techniques like Markov-Chain Monte Carlo (MCMC; see [55, 56]) along with the availability and improvement of specifically developed software (e.g., WinBUGS, see [57, 58]; JAGS, see [59, 60]; JASP, see [61]) makes exact Bayesian inferences possible even in very complex models. As a result, "Bayesian applications have found their way into most social science fields" [22, p. 665], and psychologists can now easily implement Bayesian analysis for many common experimental situations (see for example JASP Statistics: https://jasp-stats.org/).

### 3.1. Bayes in a nutshell

research [35, 47]. In Krueger's words: "NHST is the researcher's workhorse for making inductive inferences" [45, p. 16]. An immediate matter of concern is that most of scientific discoveries, in a wide range of research fields, are based on a procedure that still generates controversy (see [12, 48–50]). Since the focus of research should be on what data tell us about the magnitude of effects, it seems necessary to shift from our reliance on NHST to more robust alternatives [14]. Some recommended practices include estimates based on effect sizes, confidence intervals, and meta-analysis [6]. However, a sounder alternative comes from the Bayesian paradigm through the use of a simple estimate of the magnitude of evidence called Bayes factor (BF) [17]. This approach to hypothesis testing has shown several benefits. First, it is not oriented to pursue the rejection of H0; on the contrary, it provides a way to obtain evidence for and against H0. Second, it does not use arbitrary thresholds (i.e., significance levels) to reach dichotomous decisions about the plausibility or implausibility of H0; on the contrary, it directly contrasts the magnitude of evidence for and against both H<sup>0</sup> and Ha. Third, it permits the continuous update of evidence as long as new data are available, which is in line with the nature of scientific inquiry. Bayesian methods have been largely suggested as a practical alternative to NHST [9, 17, 23, 51], but—until now—they have not received enough attention

from researchers in psychology and social sciences.

240 Bayesian Inference

Statistics: https://jasp-stats.org/).

3. Bayesian hypothesis testing: An alternative to NHST

posterior is tomorrow's prior." Lindley [52, p. 301].

"(…) prior and posterior are relative terms, referring to the data. Today's

In the field of statistics, probabilities can be interpreted under two predominant paradigms: Frequentist inference and Bayesian inference. The former makes predictions about experiments whose outcomes depend basically upon random processes [53]. The latter assigns probabilities to any statement, even when a random process is not involved [54]. In a Bayesian framework, a probability is a way to embody an individual's degree of belief in a statement. Since the mid-1950s, there has been a clear predominance of the Frequentist approach to hypothesis testing, both in psychology and social sciences. The hegemony of Frequentist inference and its null hypothesis significance testing (NHST) might be partially attributed to the massive incorporation of such approaches in psychology undergraduate programs [9] and also to the fact that the Neyman and Pearson approach had the most well-developed computational software to conduct statistical inference [18]. However, the current scenario has drastically changed, and the development of sampling techniques like Markov-Chain Monte Carlo (MCMC; see [55, 56]) along with the availability and improvement of specifically developed software (e.g., WinBUGS, see [57, 58]; JAGS, see [59, 60]; JASP, see [61]) makes exact Bayesian inferences possible even in very complex models. As a result, "Bayesian applications have found their way into most social science fields" [22, p. 665], and psychologists can now easily implement Bayesian analysis for many common experimental situations (see for example JASP

In Bayesian inference, our degrees of belief about a set of hypotheses are quantified by probability distributions over those hypotheses [47, 62], which makes the Bayesian approach fundamentally different from the Frequentist approach, which relies on sampling distributions of data [47]. A Bayesian analysis usually implicates the updating of prior knowledge or information in light of newly available experimental data [63]. The latter clearly reflects the aim of any empirical science, which is to strive for the elaboration of a cumulative base of knowledge. Any Bayesian analysis implies the combination of three sources of information as follows:


This prior information, represented by p(θ), represents our degree of uncertainty about the parameters included in the model. Conversely, this prior distribution may also represent our degree of knowledge about the same parameters. Then, the more informative is our prior distribution, the less will be our degree of uncertainty about the parameters. The likelihood is the conditional probability of observing the data under some latent parameter (i.e., p(D|θ)). Following the Bayes theorem [64], the combination of these three elements produces an updated knowledge about the model parameters after the data have been observed, which is also known as the posterior distribution. The change from the prior to the posterior distribution reflects what has been learned from the data (see Figure 1). Thus, within a Bayesian framework, a researcher can invest more effort in the specification of prior distributions by translating existing knowledge about the phenomenon under study into prior distributions [65]. As suggested by Lee and Wagenmakers "such knowledge may be obtained by eliciting prior beliefs from experts, or by consulting the literature for earlier work on similar problems" [65, p. 110].

As shown in Figure 1, the strength of each source of information is indicated by the narrowness of its curve. A narrower curve is more informative about the value of parameters, whereas a wider one is less informative.

Bayes' rule specifies how the prior information p(θ) and the likelihood p(D|θ) are combined to arrive at the posterior distribution denoted by p(θ |D), in Eq. (1):

$$p(\theta|D) = \frac{p(D|\theta)\,p(\theta)}{p(D)}\tag{1}$$

Eq. (1) is usually paraphrased as:

$$p(\theta|D) \propto p(D|\theta)p(\theta) \tag{2}$$

which means, "the posterior is proportional (i.e.,∝) to the likelihood times the prior." In other words, the observed data (i.e., likelihood) increases our previous degree of knowledge (i.e.,

Figure 1. Prior, likelihood and posterior probability distributions.

prior) in a proportional way to its informative strength, producing a new state of knowledge about the parameters of the model (i.e., posterior). One of the benefits of the Bayesian approach is that the prior (i.e., p(θ); our present knowledge about the model parameters moderates the influence provided by the data (i.e., p(D|θ)). This compromise leads to less pessimism when data are unexpectedly bad and less optimism when it is unexpectedly good [66]. Both influences are beneficial and help us to make more realistic inferences and take better decisions. For more detailed information on Bayesian inference, see, for instance, O'Hagan and Forster [54], Kruschke [59], and Jackman [67].

### 3.2. Bayes factor

Bayesian approaches for hypothesis testing are comparative in nature. Different models often represent competing theories or hypotheses, and the focus of interest is on which one is more plausible and better supported by the data [65]. Therefore, the Bayesian approach allows to quantify the plausibility of a given model or hypothesis (i.e., H0) against that of an alternative model (i.e., Ha). For any comparison of two competing models or hypotheses (e.g., Ha vs. H0), we can rely on an estimate of evidence known as the Bayes factor [52]. One of the attractive features of the Bayes factor is that it follows the principle of parsimony: When two models fit the data equally well, the Bayes factor prefers the simple model over the more complex one [68]. Nonetheless, in contrast to the NHST approach, "Bayesian statistics assigns no special status to the null hypothesis, which means that Bayes factors can be used to quantify evidence for the null hypothesis just as for any other hypothesis" [65, p. 108].

Before observing the data, the prior odds of Ha over, e.g., H0, are p(Ha)/p(H0), and after having observed the data we have the posterior odds p(Ha|D)/p(H0|D). Therefore, the ratio of the posterior odds and the prior odds is defined as the Bayes factor:

$$BF\_{H\_{\text{s}}H\_{0}} = \frac{(D|H\_{\text{s}})}{(D|H\_{0})} = \frac{\frac{\{p(H\_{\text{s}}|D)\}}{\{p(H\_{\text{s}})\}}}{\frac{\{p(H\_{\text{s}})\}}{\{p(H\_{0})\}}} = \frac{posterior\,\text{odds}}{prior\,\text{odds}}\tag{3}$$

Eq. (3) shows the Bayes factor for given data D and two competing hypotheses (i.e., H<sup>0</sup> vs. Ha), which is a measure of the evidence for H<sup>a</sup> against H<sup>0</sup> provided by the data. In other words, the Bayes factor is the probability of the data under one hypothesis relative to the other. For instance, a BFHaH<sup>0</sup> = 3 indicates that H<sup>a</sup> is three times more plausible relative to H<sup>0</sup> than it was a priori. From this view, the Bayes factor may be considered as analogous to the Frequentist likelihood ratio. Nevertheless, in the Bayesian context there is no reference at all to theoretical probability distributions as it is customary in a Frequentist approach. In a Bayesian framework, all inferences are made conditional on the observed data, and therefore, the Bayes factor has to be interpreted as a summary measure of the information provided by the data about the relative plausibility of two models or hypotheses (e.g., Ha vs. H0). Jeffreys [52] suggests the following scale for interpreting the Bayes factor (Table 1), although some people argue against the use of thresholds, least we fall in a different version of the old p < 0.05 ritual (see, for instance, [69]).


Adapted from Jeffreys [52, p. 433], and Lee and Wagenmakers [65, p. 105].

Table 1. Evidence categories for the Bayes factor.<sup>1</sup>

prior) in a proportional way to its informative strength, producing a new state of knowledge about the parameters of the model (i.e., posterior). One of the benefits of the Bayesian approach is that the prior (i.e., p(θ); our present knowledge about the model parameters moderates the influence provided by the data (i.e., p(D|θ)). This compromise leads to less pessimism when data are unexpectedly bad and less optimism when it is unexpectedly good [66]. Both influences are beneficial and help us to make more realistic inferences and take better decisions. For more detailed information on Bayesian inference, see, for instance,

Bayesian approaches for hypothesis testing are comparative in nature. Different models often represent competing theories or hypotheses, and the focus of interest is on which one is more plausible and better supported by the data [65]. Therefore, the Bayesian approach allows to quantify the plausibility of a given model or hypothesis (i.e., H0) against that of an alternative model (i.e., Ha). For any comparison of two competing models or hypotheses (e.g., Ha vs. H0), we can rely on an estimate of evidence known as the Bayes factor [52]. One of the attractive features of the Bayes factor is that it follows the principle of parsimony: When two models fit the data equally well, the Bayes factor prefers the simple model over the more complex one [68]. Nonetheless, in contrast to the NHST approach, "Bayesian statistics assigns no special

O'Hagan and Forster [54], Kruschke [59], and Jackman [67].

Figure 1. Prior, likelihood and posterior probability distributions.

3.2. Bayes factor

242 Bayesian Inference

### 4. Bayesian vs. Frequentist approaches to hypothesis testing: An example

Bayes factors to evaluate the amount of evidence in favor or against H<sup>0</sup> and H<sup>a</sup> are one of the big selling points of the Bayesian framework.1 As stated in the previous section, the core idea is that the magnitude of evidence in favor of the null hypothesis compared to that of the alternative hypothesis can be estimated (or vice-versa). As we have seen, this approach has multiple advantages, such as departing from a hit-or-miss approach to results reporting, or being able to show evidence in favor of the null. The possibility of providing evidence in favor of both the null and the alternative hypotheses has some important advantages. One of them is that it helps to overcome one of the most common issues behind the well-known file-drawer effect, in that results do not suddenly become meaningless when the p-value is over certain threshold. Another advantage is that it gives us more freedom when establishing hypothesis, particularly in topics where hypothesizing the absence of differences may be necessary for theoretical advance.

In this section, an example from a field known as Bayesian reasoning will be presented, which deals with how people update their beliefs when new evidence is available (e.g., when receiving a positive result in a medical test, how likely it is that I have a disease?). There is a long standing debate in the field about why people are unable to solve medical screening problems such as the one shown in Table 2 when the information is shown in a standard probability format (i.e., single-event probabilities; for instance, 1% have cancer), but have a comparatively better time when the same information is shown in a standard frequency format (i.e., natural frequencies; for instance, 10 in 1000 have cancer). As it is often the case, the debate about these issues is very complex (for a review, see [71]), and the present example will focus on a single unnuanced aspect with the goal of showing the usefulness of the Bayesian statistics paradigm.

#### Standard probability format

The probability of breast cancer is 1% for women at age 40 who participate in routine screening. If a woman has breast cancer, the probability is 80% that she will get a positive mammography. If a woman does not have breast cancer, the probability is 9.6% that she will also get a positive mammography.

A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer? \_\_\_\_\_%

#### Standard frequency format

Ten out of every 1000 women at age 40 who participate in routine screening have breast cancer. Eight of every 10 women with breast cancer will get a positive mammography. Ninety-five out of every 990 women without breast cancer will also get a positive mammography.

Here is a new representative sample of women at age 40 who got a positive mammography in routine screening. How many of these women do you expect to actually have breast cancer? \_\_\_\_out of\_\_\_\_

Table 2. Standard probability and standard frequency format problems, as shown by Gigerenzer and Hoffrage [72].

<sup>1</sup> However, we recommend the interested reader to revise a recent paper by Lakens [70], which describes an approach to test for equivalence within a Frequentist framework.

Some authors [73, 74] argue that the crucial factor explaining the differences between the two versions is not the representation format (i.e., probabilities or natural frequencies), but the reference class or more specifically the computational complexity is caused by the reference class of the problems [75]. In brief, as the probability version has a relative reference class, and all the numbers refer to the group above them (e.g., 80% from the 1% who have breast cancer will get a positive mammography). To solve the problem, we need to use the base-rates (in this example, percentage of women with and without breast cancer; 1 and 99%), and the percentage of women who got a positive mammography amongst those two groups (e.g., 80 and 9.6%; see Eq. (4)). In the frequency version, as the reference class is absolute, and all numbers can be seen as referring to the 1000 women, we can ignore the base-rates and directly use the positive mammographies for women with and without cancer (8 and 95; see Eq. (5)). The abovementioned authors hypothesized that when reference class and computational complexity are taken into account, there is no difference between probabilities and natural frequencies. In other words, they expect the null hypothesis to be true (Figure 2).

4. Bayesian vs. Frequentist approaches to hypothesis testing: An example

Bayes factors to evaluate the amount of evidence in favor or against H<sup>0</sup> and H<sup>a</sup> are one of the big selling points of the Bayesian framework.1 As stated in the previous section, the core idea is that the magnitude of evidence in favor of the null hypothesis compared to that of the alternative hypothesis can be estimated (or vice-versa). As we have seen, this approach has multiple advantages, such as departing from a hit-or-miss approach to results reporting, or being able to show evidence in favor of the null. The possibility of providing evidence in favor of both the null and the alternative hypotheses has some important advantages. One of them is that it helps to overcome one of the most common issues behind the well-known file-drawer effect, in that results do not suddenly become meaningless when the p-value is over certain threshold. Another advantage is that it gives us more freedom when establishing hypothesis, particularly in topics where hypothesizing the absence of differences may be necessary for

In this section, an example from a field known as Bayesian reasoning will be presented, which deals with how people update their beliefs when new evidence is available (e.g., when receiving a positive result in a medical test, how likely it is that I have a disease?). There is a long standing debate in the field about why people are unable to solve medical screening problems such as the one shown in Table 2 when the information is shown in a standard probability format (i.e., single-event probabilities; for instance, 1% have cancer), but have a comparatively better time when the same information is shown in a standard frequency format (i.e., natural frequencies; for instance, 10 in 1000 have cancer). As it is often the case, the debate about these issues is very complex (for a review, see [71]), and the present example will focus on a single unnuanced aspect with the goal of showing the usefulness of the Bayesian statistics paradigm.

The probability of breast cancer is 1% for women at age 40 who participate in routine screening. If a woman has breast cancer, the probability is 80% that she will get a positive mammography. If a woman does not have breast cancer, the

A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually

Ten out of every 1000 women at age 40 who participate in routine screening have breast cancer. Eight of every 10 women with breast cancer will get a positive mammography. Ninety-five out of every 990 women without breast cancer will also

Here is a new representative sample of women at age 40 who got a positive mammography in routine screening. How

Table 2. Standard probability and standard frequency format problems, as shown by Gigerenzer and Hoffrage [72].

However, we recommend the interested reader to revise a recent paper by Lakens [70], which describes an approach to

theoretical advance.

244 Bayesian Inference

Standard probability format

has breast cancer? \_\_\_\_\_% Standard frequency format

get a positive mammography.

test for equivalence within a Frequentist framework.

1

probability is 9.6% that she will also get a positive mammography.

many of these women do you expect to actually have breast cancer? \_\_\_\_out of\_\_\_\_

$$p(H|D) = \frac{1\% \times 80\%}{1\% \times 80\% + 99\% \times 9.6\%} = 0.077\tag{4}$$

$$p(H|D) = \frac{8}{8+95} = 0.077\tag{5}$$

Now, imagine two PhD students, a Frequentist (i.e., Student 1) and a Bayesian (i.e., Student 2). After reading a critical but often ignored Fiedler's paper [73], they had the idea that computational complexity class (and not representation format) is the key issue when trying to understand how people solve Bayesian reasoning problems. They devise a very simple experiment where two different groups of people will be asked to solve one Bayesian reasoning problem that will be shown either in single-event probabilities or in natural frequencies. In both cases, the arithmetic complexity (i.e., number of arithmetic steps required to solve the problem) will be exactly 2. That is, to solve the problems, participants would need to do two arithmetic operations, a sum and a division. They used a test with a 100% sensitivity and 0% specificity, which could not have any clinical application, but it is useful to get a few arithmetic steps out of the probability format and check if computational complexity underlies Bayesian reasoning. With this manipulation, the algorithms to solve the probability and frequency versions become

Figure 2. Relative and absolute reference classes represented by the reference of the last row (test results). In the Relative reference class, the information about the test, for example, 80% positive (+) and 20% negative results (�) refers to the 1% women with BC, but not to the 100% of the women (it is not an 80% of the 100%!). However, in the absolute reference class, the same information, 8+ and 2�, refers to the women with BC, but also to the 1000 women directly. This translates in the need to use Eq. (4) for relative probabilities and Eq. (5) for absolute frequencies.

Eqs. (6) and (7), respectively. It is easy to see how both have become roughly equivalent now in terms of arithmetic complexity.

$$p(H|D) = \frac{10\% \times 100\%}{10\% \times 100\% + 90\% \times 100\%} = \frac{10\%}{10\% + 90\%} = 0.1\tag{6}$$

$$p(H|D) = \frac{10}{10 + 90} = 0.1\tag{7}$$

As it can be deduced, Student 1 would have a Fisherian approach to statistics and Student 2 a Bayesian approach. Both run an experiment with a total of 62 participants (31 per group),2 and have the following results:


#### 4.1. PhD Student 1—Frequentist

Student 1, as the most good NHST practitioners would do, conducts a Chi-square test and reports that he did not obtain a significant effect of representation format when arithmetic steps were equal (χ<sup>2</sup> = 0.088, p = 0.767). He is happy, because this is congruent with his hypothesis. He then writes a brief report detailing his idea and experimental results and sends the manuscript draft to his advisor. A few days later, he receives his advisor feedback, telling him that his non-significant results could be caused by a number of reasons, and as a consequence, the non-significant results are hard to interpret.


<sup>2</sup> Of course, the sample size and manipulation for this experiment is more congruent with a pilot experiment than a real one that could be sent to a journal on its own. As a side note, take into account that one of the advantages of the Bayesian framework some authors propose is a sequential sampling rule, where sampling stops when the evidence (BF) is over a predetermined threshold (e.g., BF10 >10 | <0.1), see Lindley [76].

His advisor suggests carrying out a few more experiments using variations of the task and decent sample-sizes, to be able to perform a meta-analysis that could convince the editorial board of a journal that their endeavor is noteworthy, as they would probably have a hard time publishing those non-significant results by themselves.

### 4.2. PhD Student 2—Bayesian

Eqs. (6) and (7), respectively. It is easy to see how both have become roughly equivalent now in

<sup>10</sup>% � <sup>100</sup>% <sup>þ</sup> <sup>90</sup>% � <sup>100</sup>% <sup>¼</sup> <sup>10</sup>%

As it can be deduced, Student 1 would have a Fisherian approach to statistics and Student 2 a Bayesian approach. Both run an experiment with a total of 62 participants (31 per group),2 and

Student 1, as the most good NHST practitioners would do, conducts a Chi-square test and reports that he did not obtain a significant effect of representation format when arithmetic steps were equal (χ<sup>2</sup> = 0.088, p = 0.767). He is happy, because this is congruent with his hypothesis. He then writes a brief report detailing his idea and experimental results and sends the manuscript draft to his advisor. A few days later, he receives his advisor feedback, telling him that his non-significant results could be caused by a number of reasons, and as a conse-

χ<sup>2</sup> 0.088 1 0.767

Of course, the sample size and manipulation for this experiment is more congruent with a pilot experiment than a real one that could be sent to a journal on its own. As a side note, take into account that one of the advantages of the Bayesian framework some authors propose is a sequential sampling rule, where sampling stops when the evidence (BF) is over a

Value df p

Accuracy Natural frequencies Probabilities Total 0 23 24 47 1 8 7 15 Total 31 31 62

p Hð Þ¼ <sup>j</sup><sup>D</sup> <sup>10</sup>

<sup>10</sup>% <sup>þ</sup> <sup>90</sup>% <sup>¼</sup> <sup>0</sup>:<sup>1</sup> (6)

<sup>10</sup> <sup>þ</sup> <sup>90</sup> <sup>¼</sup> <sup>0</sup>:<sup>1</sup> (7)

p Hð Þ¼ <sup>j</sup><sup>D</sup> <sup>10</sup>% � <sup>100</sup>%

Representation format

terms of arithmetic complexity.

have the following results:

Contingency tables

246 Bayesian Inference

Chi-square tests

2

4.1. PhD Student 1—Frequentist

N 62

quence, the non-significant results are hard to interpret.

predetermined threshold (e.g., BF10 >10 | <0.1), see Lindley [76].

Student 2, instead of performing a Chi-square test, prefers to use a well-known analysis among Bayesian statisticians called Bayes factor (BF; see [17, 65]). He uses a very simple to use software called JASP [61], that incorporates Bayesian contingency tables, and outputs BF results in ready to use APA formatted tables. He finds that when arithmetic steps are equal, there is a BF01 of 4.656, that is, there is 4.6 times more evidence in favor of the null-hypothesis than the alternative-hypothesis. Along his advisor, they send the manuscript to a journal, pushing for the relative importance of arithmetic complexity over representation format. In practical terms, it is more likely that the editor will be willing to publish this interesting result, although the amount of evidence in favor of the null would be considered moderate by some standards (see [53]).


Note: For all tests, the alternative hypothesis specifies that group Natural-Frequencies is greater than group Probabilities.

As the evidence for the null effect is not very strong, they would need to run a few more studies with variations to replicate the finding and show, using BF, how much more evidence there is for the null hypothesis compared to the alternative hypothesis. Alternatively, they could increase the sample size in their experiment until the stopping rule threshold (e.g., BF10 < 0.1) is reached.

This example was aimed to describe (in a very simplified manner) one of the practical advantages of the Bayesian framework, that is, being able to present the amount of evidence for and against both the null and alternative-hypotheses. This, combined with the incremental nature of the Bayesian inference process, allows us to move further from the hit-or-miss approach generally reinforced by the NHST framework, in which significant results are seen as more valuable than non-significant ones.

### 5. Conclusion

During the past 70 years, the NHST has dominated the way in which knowledge is produced and interpreted and still governs the way in which researchers analyze their data, reach conclusions, and report results [10, 45]. This approach has been largely criticized [9, 16, 21, 22, 39, 43–46], and "a major concern expressed by critics is that such testing is misunderstood by many of those who use it" [35, p. 241]. Some authors [9, 13] emphasized that one of the most pervasive influences of the NHST approach has been its over reliance on p-values, and in particular, in the way that p-values have been interpreted (see, for instance [35, 36, 77]). One of the most common misinterpretations of p-values it has been to consider a p-value as a valid indicator of the magnitude of evidence of a result (i.e., effect size fallacy). Regarding this point, Cohen emphasized that the only way to estimate the magnitude of an effect is to calculate the value of the effect size with the appropriate statistic and its confidence interval [38]. The correct way to interpret p-values is two-fold. On one hand, to reject H<sup>0</sup> only allows us to conclude that H<sup>0</sup> is unlikely. On the other hand, failing to reject H<sup>0</sup> simply allows us to state that—given the evidence at hand—one cannot make an assertion about the existence of some effect or phenomenon [42]. An immediate consequence of the wrong way in which a big number of researchers interpret p-values is that null results have been usually considered as the absence of evidence of the existence of an effect. This perspective regarding the decisions made when a given p-value threshold is not reached (i.e., p < 0.05) do not promote scientific advance and quickly leads us to a systematic bias toward ignoring promising but "non-significant" findings that may be further explored, fed into meta-analysis, of just be considered by other researchers in the field. This fact is against the pursue of any empirical science and may be harmful to the construction of a cumulative base of knowledge [5].

As a way to provide a complementary (or alternative) method to deal with the current NHST practice, we described here a Bayesian approach to hypothesis testing. A Bayesian approach allows us to think about phenomena in terms of the magnitude of evidence that supports the existence of an effect, instead of a dichotomous and artificial way of thinking in which an effect either exists or does not exist [21]. As described in previous sections, a Bayesian approach provides us a measure of evidence for and against both the null and the alternative hypotheses (i.e., Bayes factor, BF; see [17]). The use of Bayes factors helps to overcome one of the most common issues behind the well-known file-drawer effect, reducing the existent bias through which results suddenly become meaningless when the p-value is over certain threshold (e.g., p > 0.05). A straightforward feature of this approach is that "Bayesian statistics assigns no special status to the null hypothesis, which means that Bayes factors can be used to quantify evidence for the null hypothesis just as for any other hypothesis" [65, p. 108]. Therefore, a Bayesian approach gives us more freedom when establishing hypothesis, for example in topics where hypothesizing the absence of differences may be necessary for theoretical advance.

However, a major problem with Bayesian statistics has historically been that they require complex and intricate mathematical calculations that were analytically intractable, at least without the required techniques and specialized software. However, this scenario changed dramatically during the 1990s with the development of sampling techniques like Markov-Chain Monte Carlo (MCMC; see [55]) along with the availability and improvement of specifically developed software (e.g., WinBUGS, see [57, 58]; JAGS, see [59, 60]) that makes exact Bayesian inferences possible even in very complex models. Nowadays, the relatively recent implementation and availability of Bayesian analysis in "easy-to-use" and open software such as JASP [61], R toolboxes such as Bayes factor [78], or more specialized ones like WinBUGS, JAGS, or Stan (http://mc-stan.org/) makes Bayesian statistics more accessible to all researchers, academics and students. This widespread availability, paired with the advantages of the Bayesian approach described in this chapter, and several times elsewhere [79–82], should help establish the Bayesian paradigm as a viable and popular alternative to NHST.

Despite all the important Bayesian paradigm advantages, as always, there is potential for misuse. As pointed out by Morey, Bayes factor interpretation is very natural (i.e., as the amount of evidence in favor of one hypothesis in comparison to another), and does not need specific decision thresholds, as it is the case of p-values [83]. However, some standards that could help to communicate BF results have been proposed (see [53]) and may be helpful to people that are not familiar with them. Nonetheless, the introduction of these labels also creates an opportunity for misuse, as they could be misinterpreted as decision boundaries. It is very important to be aware of this fact, and be careful when using them, to avoid making "BF > 3" the new "p < 0.05."

To sum up, the main goal of this chapter has been to increase the degree of awareness regarding the limitations of the NHST approach and highlight the advantages of the Bayesian approach. We expect that the inclusion of an easy-to-understand example of a specific case where a Bayesian paradigm shows its practical utility may offer the newborn readers on this matter a glimpse to the usefulness of this alternative to the way in which they can analyze and interpret their data. As a final remark, we would like to point an often-heard recommendation for people interested in starting to use BF, which is to introduce them alongside p-values and effect size measures, to ease the transition to the new paradigm, and make them comprehensible to people not yet familiarized with them.

### Author details

conclusions, and report results [10, 45]. This approach has been largely criticized [9, 16, 21, 22, 39, 43–46], and "a major concern expressed by critics is that such testing is misunderstood by many of those who use it" [35, p. 241]. Some authors [9, 13] emphasized that one of the most pervasive influences of the NHST approach has been its over reliance on p-values, and in particular, in the way that p-values have been interpreted (see, for instance [35, 36, 77]). One of the most common misinterpretations of p-values it has been to consider a p-value as a valid indicator of the magnitude of evidence of a result (i.e., effect size fallacy). Regarding this point, Cohen emphasized that the only way to estimate the magnitude of an effect is to calculate the value of the effect size with the appropriate statistic and its confidence interval [38]. The correct way to interpret p-values is two-fold. On one hand, to reject H<sup>0</sup> only allows us to conclude that H<sup>0</sup> is unlikely. On the other hand, failing to reject H<sup>0</sup> simply allows us to state that—given the evidence at hand—one cannot make an assertion about the existence of some effect or phenomenon [42]. An immediate consequence of the wrong way in which a big number of researchers interpret p-values is that null results have been usually considered as the absence of evidence of the existence of an effect. This perspective regarding the decisions made when a given p-value threshold is not reached (i.e., p < 0.05) do not promote scientific advance and quickly leads us to a systematic bias toward ignoring promising but "non-significant" findings that may be further explored, fed into meta-analysis, of just be considered by other researchers in the field. This fact is against the pursue of any empirical science and may be harmful to the construction of a

As a way to provide a complementary (or alternative) method to deal with the current NHST practice, we described here a Bayesian approach to hypothesis testing. A Bayesian approach allows us to think about phenomena in terms of the magnitude of evidence that supports the existence of an effect, instead of a dichotomous and artificial way of thinking in which an effect either exists or does not exist [21]. As described in previous sections, a Bayesian approach provides us a measure of evidence for and against both the null and the alternative hypotheses (i.e., Bayes factor, BF; see [17]). The use of Bayes factors helps to overcome one of the most common issues behind the well-known file-drawer effect, reducing the existent bias through which results suddenly become meaningless when the p-value is over certain threshold (e.g., p > 0.05). A straightforward feature of this approach is that "Bayesian statistics assigns no special status to the null hypothesis, which means that Bayes factors can be used to quantify evidence for the null hypothesis just as for any other hypothesis" [65, p. 108]. Therefore, a Bayesian approach gives us more freedom when establishing hypothesis, for example in topics where hypothesizing the absence of differences may be necessary for theoretical advance.

However, a major problem with Bayesian statistics has historically been that they require complex and intricate mathematical calculations that were analytically intractable, at least without the required techniques and specialized software. However, this scenario changed dramatically during the 1990s with the development of sampling techniques like Markov-Chain Monte Carlo (MCMC; see [55]) along with the availability and improvement of specifically developed software (e.g., WinBUGS, see [57, 58]; JAGS, see [59, 60]) that makes exact Bayesian inferences possible even in very complex models. Nowadays, the relatively recent implementation and availability of Bayesian analysis in "easy-to-use" and open software such as JASP [61], R toolboxes such as Bayes factor [78], or more specialized ones like WinBUGS,

cumulative base of knowledge [5].

248 Bayesian Inference

Alonso Ortega<sup>1</sup> \* and Gorka Navarrete<sup>2</sup>


2 Center for Social and Cognitive Neuroscience (CSCN), School of Psychology, Universidad Adolfo Ibáñez, Chile

### References


[20] Levine TR, Weber R, Hullett C, Park HS, Lindsey LLM. A critical assessment of null hypothesis significance testing in quantitative communication research. Human Communication Research. 2008;34(2):71-187

[3] Harper RS. The first psychological laboratory. Isis. 1950;41(2):158-161

duction to the special issue. Psychological Methods. 2009;14(2):77-80

Psychonomic Bulletin & Review. 2007;14(5):779-804

of Personality and Social Psychology. 2011;100(3):426-432

las Ciencias del Comportamiento, 2004; Volumen Especial: 465-469

[15] Cohen J. A power primer. Psychological Bulletin. 1992;112(1):155-159

Sciences. 1998;21(2):199-200

e124

115-129

1995;90(430):773-795

1999;54:594-604

Psychological Science. 2011;6(3):274-290

1954;5(18):143-149

250 Bayesian Inference

[4] Popper KR. Degree of confirmation. The British Journal for the Philosophy of Science.

[5] Curran PJ. The seemingly quixotic pursuit of a cumulative psychological science: Intro-

[7] Loftus GR. Psychology will be a much better science when we change the way we

[8] Rossi JS. A case study in the failure of psychology as a cumulative science: The spontaneous recovery of verbal learning. In: Harlow L, Mulaik S, Steiger J, editors. What If There Were No Significance Tests. Mahwah, NJ: Erlbaum Associates Publishers; 1997. pp. 175-197

[9] Wagenmakers E-J. A practical solution to the pervasive problems of p values.

[10] Gigerenzer G. We need statistical thinking, not statistical rituals. Behavioral and Brain

[11] Ioannidis JP. Why most published research findings are false. PLOS Medicine. 2005;2(8):

[12] Wagenmakers EJ, Wetzels R, Borsboom D, Van Der Maas HL. Why psychologists must change the way they analyze their data: The case of psi: Comment on Bem (2011). Journal

[13] Llobell JP, Dolores M, Navarro F, et al. Usos y abusos de la significación estadística: propuestas de futuro ("Necesidad de nuevas normativas editoriales"). Metodologia de

[14] Kirk RE. The importance of effect magnitude. In: Davis SF, editor. Handbook of Research Methods in Experimental Psychology. Malden, MA: Blackwell Publishing; 2003. pp. 83-105

[16] Schmidt FL. Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. American Psychological Association. 1996;1(2):

[17] Kass RE, Raftery AE. Bayes factors. Journal of the American Statistical Association.

[18] Dienes Z. Bayesian versus Orthodox statistics: Which side are you on? Perspectives on

[19] Wilkinson L, Task Force on Statistical Inference APA Board of Scientific Affairs. Statistical methods in psychology journals: Guidelines and explanations. American Psychologist.

[6] Cumming G. The new statistics why and how. Psychological Science. 2013;25(1):7-29

analyze data. Current Directions in Psychological Science. 1996;5(6):161-171


[36] Badenes-Ribera L, Frias-Navarro D, Iotti B, Bonilla-Campos A, Longobardi C. Misconceptions of the p-value among Chilean and Italian Academic Psychologists. Frontiers in

[37] Kline RB. Beyond Significance Testing, Statistics Reform in the Behavioral Sciences. 2nd

[39] Carver R. The case against statistical significance testing. Harvard Educational Review.

[40] Rozeboom WW. The fallacy of the null-hypothesis significance test. Psychological Bulle-

[41] Wetzels R, Raaijmakers JG, Jakab E, Wagenmakers E-J. How to quantify support for and against the null hypothesis: A flexible WinBUGS implementation of a default Bayesian t

[42] Cohen J. The statistical power of abnormal-social psychological research: A review. The

[43] Shaver JP. What statistical significance testing is, and what it is not. The Journal of

[44] Carver RP. The case against statistical significance testing, revisited. The Journal of Exper-

[45] Krueger J. Null hypothesis significance testing: On the survival of a flawed method.

[46] Meehl PE. Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology. 1978;46(4):806-834 [47] Wetzels R, Matzke D, Lee MD, Rouder JN, Iverson GJ, Wagenmakers E-J. Statistical evidence in experimental psychology an empirical comparison using 855 t tests. Perspec-

[48] Wagenmakers EJ, Wetzels R, Borsboom D, van der Maas H. Yes, Psychologists Must Change the Way They Analyse Their Data: Clarifications for Bem, Utts, and Johnson (2011). 2011. Available from: http://web.stanford.edu/class/psych201s/psych201s/papers/

[49] Bem DJ. Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology. 2011;100(3):407-425

[50] Bem DJ, Utts J, Johnson WO. Must psychologists change the way they analyze their data?

[51] Bernardo JM. A Bayesian analysis of classical hypothesis testing. Trabajos de estadística y

ed. Washington, DC: American Psychological Association; 2013

test. Psychonomic Bulletin & Review. 2009;16(4):752-760

Journal of Abnormal and Social Psychology. 1962;65(3):145

Experimental Education. 1993;61(4):293-316

tives on Psychological Science. 2011;6(3):291-298

de investigación operativa. 1980;31(1):605-647

ClarificationsForBemUttsJohnson.pdf [Accessed: July 26, 2017]

Journal of Personality and Social Psychology. 2011;101(4):716-719

imental Education. 1993;61(4):287-292

American Psychologist. 2001;56(1):16

[38] Cohen J. The earth is round (p < .05). American Psychologist. 1994;49:997-1003

Psychology. 2016;7:1247

252 Bayesian Inference

1978;48(3):378-399

tin. 1960;57(5):416


**Applications of Bayesian Inference in Engineering**

[69] Bigler ED. Symptom validity testing, effort, and neuropsychological assessment. Journal

[70] Lakens D. Equivalence tests: A practical primer for t-tests, correlations, and meta-ana-

[71] Barbey AK, Sloman SA. Base-rate respect: From ecological rationality to dual processes.

[72] Gigerenzer G, Hoffrage U, Mellers BA, et al. How to improve Bayesian reasoning without

[73] Fiedler K, Brinkmann B, Betsch T, Wild B. A sampling approach to biases in conditional probability judgments: Beyond base rate neglect and statistical format. Journal of Exper-

[74] Lesage E, Navarrete G, De Neys W. Evolutionary modules and Bayesian facilitation: The

[75] Ayal S, Beyth-Marom R. The effects of mental steps and compatibility on Bayesian

[76] Lindley DV. Bayesian statistics: A review. Society for Industrial and Applied Mathemat-

[77] Gliner JA, Leech NL, Morgan GA. Problems with null hypothesis significance testing (NHST): What do the textbooks say? The Journal of Experimental Education. 2002;71(1):

[78] Morey RD, Rouder JN. Bayes Factor: Computation of Bayes Factors for Common Designs. R package version 0.9.12-2. 2015. Available from: https://cran.r-project.org/package=

[80] Briggs AH. A Bayesian approach to stochastic cost-effectiveness analysis. Health Eco-

[81] Ortega A, Wagenmakers E-J, Lee MD, Markowitsch HJ, Piefke M. A Bayesian latent group analysis for detecting poor effort in the assessment of malingering. Archives of Clinical

[82] Stegmueller D. How many countries for multilevel modeling? A comparison of Frequentist and Bayesian approaches. American Journal of Political Science. 2013;57(3):748-761

[83] Morey RD. On verbal categories for the interpretation of Bayes factors. 2015. Available from: http://bayesfactor.blogspot.cl/2015/01/on-verbal-categories-for-interpretation.html

[79] Berry DA. Bayesian clinical trials. Nature Reviews Drug Discovery. 2006;5(1):27-36

role of general cognitive resources. Thinking & Reasoning. 2013;19(1):27-53

reasoning. Judgment and Decision Making. 2014;9(3):226-242

of the International Neuropsychological Society. 2012;18(04):632-640

lyses. Social Psychological and Personality Science. 2017;March 4:1-21

instruction: Frequency formats. Psychological Review. 1995;102:684-704

Behavioral and Brain Sciences. 2007;30(03):241-254

imental Psychology: General. 2000;129(3):399-418

BayesFactor [Accessed: June 21, 2017]

Neuropsychology. 2012;27(4):453-465

nomics. 1999;8(3):257-261

[Accessed: June 21, 2017]

ics; 1972

254 Bayesian Inference

83-92

Provisional chapter

### **Bayesian Inference and Compressed Sensing**

DOI: 10.5772/intechopen.70308

Bayesian Inference and Compressed Sensing

Solomon A. Tesfamicael and Faraz Barzideh

Additional information is available at the end of the chapter Solomon A. Tesfamicael and Faraz Barzideh

http://dx.doi.org/10.5772/intechopen.70308 Additional information is available at the end of the chapter

Abstract

This chapter provides the use of Bayesian inference in compressive sensing (CS), a method in signal processing. Among the recovery methods used in CS literature, the convex relaxation methods are reformulated again using the Bayesian framework and this method is applied in different CS applications such as magnetic resonance imaging (MRI), remote sensing, and wireless communication systems, specifically on multipleinput multiple-output (MIMO) systems. The robustness of Bayesian method in incorporating prior information like sparse and structure among the sparse entries is shown in this chapter.

Keywords: Bayesian inference, compressive sensing, sparse priors, clustered priors, convex relaxation

### 1. Introduction

In order to estimate parameters in a signal, one can apply wisdoms of the two schools of thoughts in statistics called the classical (also called the frequentist) and the Bayesian. These methods of computing are competitive with each other at times. The definition of probability is where the basic difference arises from. The frequentist define P(A) as a long-run relative frequency with which A occurs in identical repeats of an experiment, whereas Bayesian defines P(A|B) as a real number measure of the probability of a proposition A, given the truth of the information represented by proposition B. So under Bayesian theory, probability is considered as an extension of logic [1, 2]. Probabilities represent the investigator degree of belief—hence it is subjective. But this is not acceptable under classical theory, making it to be not flexible. To add on the differences, under the classical inference, parameters are not random, they are fixed and prior information is absent. But under the Bayesian, parameters are random variables, and prior information is an integral part, and the Bayesian has no excuse for that. Since one is free

© 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and eproduction in any medium, provided the original work is properly cited.

to invent new estimators or confidence intervals or hypothesis test, adhockery exists and hence frequentist approach lack consistency whereas Bayesian theory is flexible and consistent [1–9]. Therefore, Bayesian inference is our main focus applied to a special paradigm in signal processing in this chapter.

After presenting the theoretical frameworks, Bayesian theory, CS, and convex relaxation methods in Section 2, the use of Bayesian inference in CS problem by considering two priors modeling the sparsity and clusteredness is shown in Section 3. In Section 4, we present three examples of applications that show the connection of the two theories, Bayesian and compressive sensing. In Section 5, the conclusion is given briefly.

### 2. Theoretical framework

#### 2.1. Bayesian framework

For two random variables A and B, the product rule gives

$$P(A,B) = P(A|B)P(B) = P(B|A)P(A) \tag{1}$$

and the famous Bayes' theorem provides

$$P(B|A) = \frac{P(A|B)P(B)}{P(A)}.\tag{2}$$

Using the same framework, consider model Mj and a vector of parameter θ. We infer what the model's parameter θ might be, given the data, D, and a prior information I. Using Bayes' theorem, the probability of the parameters θ given model Mj, data D, and information I is given by

$$P\left(\boldsymbol{\theta}|D\_{\boldsymbol{\prime}}M\_{\boldsymbol{\prime}\boldsymbol{\prime}},I\right) = \frac{P\left(D|\boldsymbol{\theta},M\_{\boldsymbol{\prime}\boldsymbol{\prime}},I\right)P\left(\boldsymbol{\theta}|M\_{\boldsymbol{\prime}\boldsymbol{\prime}},I\right)}{P\left(D|M\_{\boldsymbol{\prime}\boldsymbol{\prime}},I\right)},\tag{3}$$

where P(θ|D, Mj, I), is posterior probability, P(θ|Mj, I) is the non data information about θ, called prior probability distribution function, while P(D|θ, Mj, I) is the density of the data conditional on the parameters of the model, called likelihood. P(D|Mj, I) is called the evidence of model Mj, or the normalizing constant, given by:

$$P\{D|M\_{\dot{\prime}},I\} = \int\_{\mathcal{O}} P(\boldsymbol{\Theta}|M\_{\dot{\prime}},I)P\{D|\boldsymbol{\Theta},M\_{\dot{\prime}},I\}d\boldsymbol{\Theta}.\tag{4}$$

P(θ|D, Mj, I) is the fundamental interest for the first level of inference called model fitting. It is the task of inferring what the model parameters might be given the model and the data. Further, we can do inference on higher level, which is comparing models Mj. In the light of prior information I and data D, a given set of models {M1, M2,⋯, Mn} is most likely to be the correct one. Now focusing on the first level of inference, we can ignore the normalizing constant in (3) since it has no relevance at this level of inference about the parameters θ. Hence we get:

$$P\left(\boldsymbol{\theta}|D,M\_{\boldsymbol{\flat}},I\right) \propto P\left(D|\boldsymbol{\theta},M\_{\boldsymbol{\flat}},I\right)P\left(\boldsymbol{\theta}|M\_{\boldsymbol{\flat}},I\right).\tag{5}$$

The posterior probability is proportional to the Prior probability times the Likelihood. Eq. (5) is called Updating Rule [1, 3], in which the data allow us to update our prior views about θ, and as a result, we get the posterior which combines both the data and non-data information of θ. As an example for a binomial trial, let us have beta distribution as a prior and as a result we get posterior distribution which is beta distribution. Figure 1 shows that the posterior density is taller and narrower than the prior density. It therefore favors strongly a smaller range of θ values, reflecting the fact that we now have more information. That is why inference based on the posterior distribution is superior to the one only based on the likelihood.

Now, we first find the maximum of the posterior distribution called maximum a posteriori (MAP). It defines the most probable value for the parameters denoted θbMP. MAP is related to Fisher's methods of maximum likelihood estimation (MLE), θbML. If f is the sampling distribution of D, then the likelihood function of D:θ ↦ f Dð Þ jθ and the maximum likelihood estimation of θ:

$$\Theta\_{ML}(D) = \arg\max\_{\theta} f(D|\theta). \tag{6}$$

But under Bayesian inference, let g be a prior distribution of θ, then the posterior distribution of θ becomes

$$\theta\_{\rm ML} \mapsto \frac{f(D|\theta)g(\theta)}{f(D)}\tag{7}$$

and the maximum a posteriori estimation of θ:

to invent new estimators or confidence intervals or hypothesis test, adhockery exists and hence frequentist approach lack consistency whereas Bayesian theory is flexible and consistent [1–9]. Therefore, Bayesian inference is our main focus applied to a special paradigm in signal

After presenting the theoretical frameworks, Bayesian theory, CS, and convex relaxation methods in Section 2, the use of Bayesian inference in CS problem by considering two priors modeling the sparsity and clusteredness is shown in Section 3. In Section 4, we present three examples of applications that show the connection of the two theories, Bayesian and compres-

P Bð Þ¼ <sup>j</sup><sup>A</sup> P Að Þ <sup>j</sup><sup>B</sup> P Bð Þ

Using the same framework, consider model Mj and a vector of parameter θ. We infer what the model's parameter θ might be, given the data, D, and a prior information I. Using Bayes' theorem, the probability of the parameters θ given model Mj, data D, and information I is

<sup>P</sup> <sup>θ</sup>jD, Mj, I � � <sup>¼</sup> P Djθ, Mj, I � �<sup>P</sup> <sup>θ</sup>jMj, I � �

where P(θ|D, Mj, I), is posterior probability, P(θ|Mj, I) is the non data information about θ, called prior probability distribution function, while P(D|θ, Mj, I) is the density of the data conditional on the parameters of the model, called likelihood. P(D|Mj, I) is called the evidence

P(θ|D, Mj, I) is the fundamental interest for the first level of inference called model fitting. It is the task of inferring what the model parameters might be given the model and the data. Further, we can do inference on higher level, which is comparing models Mj. In the light of prior information I and data D, a given set of models {M1, M2,⋯, Mn} is most likely to be the

P Að Þ¼ ; B P Að Þ jB P Bð Þ¼ P Bð Þ jA P Að Þ (1)

P Að Þ : (2)

P DjMj, I � � ; (3)

<sup>P</sup> <sup>θ</sup>jMj, I � �P Djθ, Mj, I � �dθ: (4)

processing in this chapter.

258 Bayesian Inference

2. Theoretical framework

and the famous Bayes' theorem provides

of model Mj, or the normalizing constant, given by:

P DjMj, I � � <sup>¼</sup>

ð θ

2.1. Bayesian framework

given by

sive sensing. In Section 5, the conclusion is given briefly.

For two random variables A and B, the product rule gives

$$\begin{split} \widehat{\boldsymbol{\Theta}}\_{\text{MP}} &= \frac{f(\boldsymbol{D}|\boldsymbol{\Theta})g(\boldsymbol{\Theta})}{\int\_{\boldsymbol{\Theta}} f(\boldsymbol{D}|\boldsymbol{\Theta})g(\boldsymbol{\Theta})d\boldsymbol{\Theta}} \\ &= \arg\max\_{\boldsymbol{\Theta}} f(\boldsymbol{D}|\boldsymbol{\Theta})g(\boldsymbol{\Theta}) \end{split} \tag{8}$$

Inference based on the posterior is not an easy task since it involves multiple integral, which are cumbersome to solve at times. However, it can be computed in several ways: Numerical optimization (like Conjugate gradient method, Newton method,…), modification of an expectation-maximization algorithm and others. As we can see it from (22) and (8), the difference between MLE and MAP is the prior distribution. The latter can be considered as a

Figure 1. Figure showing the updating rule: the posterior synthesizes and compromises by favoring values between the maximum of the prior density and likelihood. The prior we had is challenged to shift by the arrival of little amount of data.

regularization of the former. Here we can summarize the posterior distribution by the value of the best fit parameters θMP and error bars (confidence intervals) on the best fit parameters. Error bars can be found from the curvatures of the posterior. To proceed further, we replace the random variable D and θ by vectors y and x and we assume prior distributions on x in the next section.

#### 2.2. Compressive sensing

Compressive sensing (CS) is a paradigm to capture information at lower rate than the Nyquist-Shannon sampling rate when signals are sparse in some domain [10–13]. CS has recently gained a lot of attention due to its exploitation of signal sparsity. Sparsity, an inherent characteristic of many natural signals, enables the signal to be stored in a few samples and subsequently be recovered accurately.

As a signal processing scheme, CS follows a similar fashion: encoding, transmission/storing, and decoding. Focusing on the encoding and decoding of such a system with noisy measurement, the block diagram is given in Figure 2. At the encoding side, CS combines the sampling and compression stages of a traditional signal processing into one step by measuring few samples that contain maximum information about the signal. This measurement/sampling is done by linear projections using random sensing transformations as shown in the landmark papers by the authors mentioned above. Having said this, let us define the CS problem formally as follows:

Figure 2. Blockdiagram for CS-based reconstruction.

#### Definition 1. (The standard CS problem)

Find the k-sparse signal vector x ∈ R<sup>N</sup> provided the measurement vector y ∈ RM, the measurement matrix A ∈ RM�<sup>N</sup> and the under-determined set of linear equations as

$$\mathbf{y} = \mathbf{A}\mathbf{x},\tag{9}$$

where k ≪ M ≪ N.

regularization of the former. Here we can summarize the posterior distribution by the value of the best fit parameters θMP and error bars (confidence intervals) on the best fit parameters. Error bars can be found from the curvatures of the posterior. To proceed further, we replace the random variable D and θ by vectors y and x and we assume prior distributions on x in the next

Figure 1. Figure showing the updating rule: the posterior synthesizes and compromises by favoring values between the maximum of the prior density and likelihood. The prior we had is challenged to shift by the arrival of little amount of

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x

Compressive sensing (CS) is a paradigm to capture information at lower rate than the Nyquist-Shannon sampling rate when signals are sparse in some domain [10–13]. CS has recently gained a lot of attention due to its exploitation of signal sparsity. Sparsity, an inherent characteristic of many natural signals, enables the signal to be stored in a few samples and subse-

As a signal processing scheme, CS follows a similar fashion: encoding, transmission/storing, and decoding. Focusing on the encoding and decoding of such a system with noisy measurement, the block diagram is given in Figure 2. At the encoding side, CS combines the sampling and compression stages of a traditional signal processing into one step by measuring few samples that contain maximum information about the signal. This measurement/sampling is done by linear projections using random sensing transformations as shown in the landmark papers by the authors mentioned above. Having said this, let us define the CS problem

section.

data.

2.2. Compressive sensing

0

0.1

0.2

0.3

0.4

p

0.5

0.6

0.7

0.8

260 Bayesian Inference

Likelihood Prior Postersior

quently be recovered accurately.

formally as follows:

One can ask again two of the questions here, in relation to the standard CS problem. First, how should we design the matrix A to ensure that it preserves the information in the signal x? Second, how can we recover the original signal x from measurements y [14]? To address the first question, the solution for the CS problem presented here is dependent on the design of A. This matrix can be considered as a transformation of the signal from the signal space to the measurement space, Figure 3 [15]. There have been different criteria that matrix A should satisfy to have meaningful reconstruction. One of the main criteria is given in [11]. The authors defined the sufficient condition that matrix A should satisfy for the reconstruction of the signal x. It is called the Restricted Isometric Property (RIP) and it is defined below.

Figure 3. Transformation from the signal-space to the measurement-space.

### Definition 2. (Restricted Isometry Property)

For all x ∈ R<sup>N</sup> so that ∥ x ∥<sup>0</sup> ≤ k, if there exists 0 ≤ δ<sup>k</sup> < 1 such that

$$\|(1 - \delta\_k)\|\mathbf{x}\|\_2^2 \le \|\mathbf{Ax}\|\_2^2 \le \quad (1 + \delta\_k) \|\mathbf{x}\|\_2^2 \tag{10}$$

is satisfied, then A fulfills RIP of order k with radius δk.

An equivalent description of the RIP is to say that all subsets of k columns taken from A are nearly orthogonal (the columns of A cannot be exactly orthogonal since we have more columns than rows) [16]. For example, if a matrix A satisfies the RIP of order 2k, then we can interpret (10) saying that A approximately preserves the distance between any pair of k-sparse vectors. For random matrix A, the following theorem is one of the results in relation to RIP for the noiseless CS problem, provided that the entries of the random matrix A are drawn from some distributions which are given later.

#### Theorem 1. (Perfect Recovery Condition, Candes and Tao [13])

If A satisfies the RIP of order 2k with radius δ2k, then for any k-sparse signal x sensed by y=Ax, x is with high probability perfectly recovered by the ideal program

$$\begin{aligned} \widehat{\mathbf{x}} &= \underset{\mathbf{x}}{\text{arg min}} \quad \|\mathbf{x}\|\_{0} \\ \text{subject to} \quad \mathbf{y} &= \mathbf{A}\mathbf{x} \end{aligned} \tag{11}$$

and it is unique, where ∥ x ∥<sup>0</sup> = k� # {i ∈ {1, 2, ⋯, N}|xi 6¼ 0}.

This means, if A satisfies the RIP of order k with radius δk, then for any k<sup>0</sup> < k, A satisfies the RIP of order k<sup>0</sup> with constant δ<sup>k</sup><sup>0</sup> <δ<sup>k</sup> [?]. Note that, this theorem is stated for the noiseless CS problem and it is possible to extend it for the noisy CS system. The proof of these theorems is deferred to the literature mentioned, [13], in for the sake of space.

Under conventional sensing paradigm, the dimension of the original signal and the measurement should be at least equal. But in CS, the measurement vector can be far less than the original. While at the decoding side, reconstruction is done using nonlinear schemes. Eventually, the reconstruction is more cumbersome than the encoding which was only projections from a large space to a smaller space. On the other hand, finding a unique solution that satisfies the constraint that the signal itself is sparse or sparse in some domain is complex in nature. Fortunately, there are many algorithms to solve the CS problem, such as iterative methods such as greedy iterative algorithms [17] and iterative thresholding algorithms [18]. This chapter focuses merely on the convex relaxations methods [12, 13]. The regularizing terms in these methods can be reinterpreted as prior information under Bayesian inference. We consider a noisy measurement and apply convex relaxation algorithms for robust reconstruction.

#### 2.3. Convex relaxation methods for CS

Various methods for estimating x may be used. We have the least square (LS) estimator in which no prior information is applied:

$$
\widehat{\mathbf{x}} = \left(\mathbf{A}^T \mathbf{A}\right)^{-1} \mathbf{A}^T \mathbf{y},
\tag{12}
$$

which performs very badly for the CS estimation problem we are considering. In order to incorporate the methods called convex relaxation, let us define an important concept first.

#### Definition 3. (Unit Ball)

Definition 2. (Restricted Isometry Property)

262 Bayesian Inference

distributions which are given later.

2.3. Convex relaxation methods for CS

which no prior information is applied:

For all x ∈ R<sup>N</sup> so that ∥ x ∥<sup>0</sup> ≤ k, if there exists 0 ≤ δ<sup>k</sup> < 1 such that

Theorem 1. (Perfect Recovery Condition, Candes and Tao [13])

with high probability perfectly recovered by the ideal program

and it is unique, where ∥ x ∥<sup>0</sup> = k� # {i ∈ {1, 2, ⋯, N}|xi 6¼ 0}.

This means, if A satisfies the RIP of order k with radius δk, then for any k<sup>0</sup>

deferred to the literature mentioned, [13], in for the sake of space.

is satisfied, then A fulfills RIP of order k with radius δk.

ð Þ <sup>1</sup> � <sup>δ</sup><sup>k</sup> <sup>∥</sup>x∥<sup>2</sup>

<sup>2</sup> ≤ ∥Ax∥<sup>2</sup>

An equivalent description of the RIP is to say that all subsets of k columns taken from A are nearly orthogonal (the columns of A cannot be exactly orthogonal since we have more columns than rows) [16]. For example, if a matrix A satisfies the RIP of order 2k, then we can interpret (10) saying that A approximately preserves the distance between any pair of k-sparse vectors. For random matrix A, the following theorem is one of the results in relation to RIP for the noiseless CS problem, provided that the entries of the random matrix A are drawn from some

If A satisfies the RIP of order 2k with radius δ2k, then for any k-sparse signal x sensed by y=Ax, x is

subject to y ¼ Ax

of order k<sup>0</sup> with constant δ<sup>k</sup><sup>0</sup> <δ<sup>k</sup> [?]. Note that, this theorem is stated for the noiseless CS problem and it is possible to extend it for the noisy CS system. The proof of these theorems is

Under conventional sensing paradigm, the dimension of the original signal and the measurement should be at least equal. But in CS, the measurement vector can be far less than the original. While at the decoding side, reconstruction is done using nonlinear schemes. Eventually, the reconstruction is more cumbersome than the encoding which was only projections from a large space to a smaller space. On the other hand, finding a unique solution that satisfies the constraint that the signal itself is sparse or sparse in some domain is complex in nature. Fortunately, there are many algorithms to solve the CS problem, such as iterative methods such as greedy iterative algorithms [17] and iterative thresholding algorithms [18]. This chapter focuses merely on the convex relaxations methods [12, 13]. The regularizing terms in these methods can be reinterpreted as prior information under Bayesian inference. We consider a noisy measurement and apply convex relaxation algorithms for robust reconstruction.

Various methods for estimating x may be used. We have the least square (LS) estimator in

∥x∥<sup>0</sup>

<sup>b</sup><sup>x</sup> <sup>¼</sup> arg min <sup>x</sup>

<sup>2</sup> <sup>≤</sup> ð Þ <sup>1</sup> <sup>þ</sup> <sup>δ</sup><sup>k</sup> <sup>∥</sup>x∥<sup>2</sup>

<sup>2</sup> (10)

(11)

< k, A satisfies the RIP

A unit ball in lp-space of dimension N can be defined as

$$\mathcal{B}\_p \equiv \left\{ \mathbf{x} \in \mathbb{R}^N \;:\; \quad \|\mathbf{x}\|\_p \le 1 \right\}. \tag{13}$$

Unit balls corresponding to p = 0, p = 1/2, p = 1, p = 2, p = ∞, and N = 2, the balls are shown in Figure 4.

The exact solution for the noiseless CS problem is given by

$$\min\_{\mathbf{x}} \|\mathbf{x}\|\_{0^{\nu}} \quad \text{such that } \mathbf{y} = \mathbf{A}\mathbf{x}.\tag{14}$$

However, minimizing l0-norm is a non-convex optimization problem which is NP-hard [19]. By relaxing the objective function to convexity, it is possible to get good approximation. That is, replacing the l0-norm by the l1-norm, one can find a problem which is tractable. Note that it is also possible to use other lp-norms to relax the condition given by l0. However, keeping our focus on l1-norm, consider the minimization problem instead of (14).

$$\min\_{\mathbf{x}} \quad \|\mathbf{x}\|\_{1\prime} \text{ such that } \mathbf{y} = \mathbf{A}\mathbf{x} \tag{15}$$

The solution of the relaxed problem (15) gives the same as that of (14) and this equivalence was provided by Donoho and Huo in [20].

Figure 4. Different lp-balls in different lp-spaces for N=2, only balls with p≥1 are convex.

Theorem 2. (l0�l<sup>1</sup> Equivalence [13])

If A satisfies the RIP of order 2k with radius δ2<sup>k</sup> < ffiffiffi 2 <sup>p</sup> � <sup>1</sup>, then

$$\begin{aligned} \widehat{\mathbf{x}} &= \arg\min\_{\mathbf{x}} \|\mathbf{x}\|\_{1} \\ \text{subject to } \mathbf{y} &= \mathbf{A}\mathbf{x} \end{aligned} \tag{16}$$

is equivalent to (11) and will find the same unique <sup>b</sup>x.

Justified by this theorem, (15) is an optimization problem which can be solved in polynomial time and the fact that it gives the exact solution for the problem (14) under some circumstance has been one of the main reasons for the recent developments in CS. There is a simple geometric intuition on why such an approach gives good approximations. Among the lp-norms that can be used in the construction of CS related optimization problems, only those which are convex give rise to a convex optimization problem which is more feasible than the non-convex counter parts, which means lp-norms with only p ≥ 1 satisfy such a condition. On the other hand, lp-norms with p > 1 do not favor sparsity, for example, l2-norm minimization tends to spread reconstruction across all coordinates even if the true solution is sparse. But l1-norm is able to enforce sparsity. The intuition is that l1-minimization solution is most likely to occur at corners or edges, not faces [21, 45]. That is why l1-norm became famous for CS. Further, in CS literature, convex relaxation is presented as either l2-penalized l1-minimization called Basis Pursuit Denoising (BPDN) [22] or l1-penalized l2-minimization called least absolute shrinkage and selection operator (LASSO) [45], which are equivalent and effective in estimating a high-dimensional data.

Usually real world systems are contaminated with noise, w, and in this chapter, the focus is on such problems. The noisy recovery problem becomes a simple extension of (15),

$$\min\_{\mathbf{x}} \|\mathbf{x}\|\_{1^\prime} \text{ such that } \|\mathbf{y} - \mathbf{A}\mathbf{x}\|\_2 \le \epsilon \tag{17}$$

where E is a bound on ||w||2. The real problem for (17) is stability. Introducing small changes in the observations should result in small changes in the recovery. We can visualize this using the balls shown in Figure 5.

Both the l<sup>0</sup> and l1-norms give exact solutions for the noise-free CS problem while giving a close solution for the noisy problem. However, the l2-norm gives worst approximation in both cases compared to the other lp-norms with p < 2 (see Figure 5). Moreover, (17) is equivalent to an unconstrained quadratic programming problem as

$$\min\_{\mathbf{x}} \frac{1}{2} \|\mathbf{y} - \mathbf{A}\mathbf{x}\|\_{2}^{2} + \gamma \|\mathbf{x}\|\_{1},\tag{18}$$

as it will be shown later as LASSO, where γ is a tuning parameter. The equivalency of (17) and (18) is shown in [23, 24]. In this chapter, the generalized form of the minimization problem in (18) with different lp-norm regularization is considered, that is,

Figure 5. lp-norm approximations: the constraints for the noise-free CS problem is given by the bold line while the shaded region is for the noisy one.

$$\min\_{\mathbf{x}} \frac{1}{2} \|\mathbf{y} - \mathbf{A}\mathbf{x}\|\_{2}^{2} + \gamma \|\mathbf{x}\|\_{p}.\tag{19}$$

Further, this chapter provides the use of Bayesian framework in compressive sensing by incorporating two different priors modeling the sparsity and the possible structure among the sparse entries in a signal. Basically, it is the summary of the recent works [2, 25–27].

### 3. Bayesian inference used in CS problem

Theorem 2. (l0�l<sup>1</sup> Equivalence [13])

264 Bayesian Inference

estimating a high-dimensional data.

unconstrained quadratic programming problem as

balls shown in Figure 5.

If A satisfies the RIP of order 2k with radius δ2<sup>k</sup> < ffiffiffi

is equivalent to (11) and will find the same unique <sup>b</sup>x.

2 <sup>p</sup> � <sup>1</sup>, then

∥x∥<sup>1</sup>

(16)

<sup>b</sup><sup>x</sup> <sup>¼</sup> arg min <sup>x</sup>

subject to y ¼ Ax

Justified by this theorem, (15) is an optimization problem which can be solved in polynomial time and the fact that it gives the exact solution for the problem (14) under some circumstance has been one of the main reasons for the recent developments in CS. There is a simple geometric intuition on why such an approach gives good approximations. Among the lp-norms that can be used in the construction of CS related optimization problems, only those which are convex give rise to a convex optimization problem which is more feasible than the non-convex counter parts, which means lp-norms with only p ≥ 1 satisfy such a condition. On the other hand, lp-norms with p > 1 do not favor sparsity, for example, l2-norm minimization tends to spread reconstruction across all coordinates even if the true solution is sparse. But l1-norm is able to enforce sparsity. The intuition is that l1-minimization solution is most likely to occur at corners or edges, not faces [21, 45]. That is why l1-norm became famous for CS. Further, in CS literature, convex relaxation is presented as either l2-penalized l1-minimization called Basis Pursuit Denoising (BPDN) [22] or l1-penalized l2-minimization called least absolute shrinkage and selection operator (LASSO) [45], which are equivalent and effective in

Usually real world systems are contaminated with noise, w, and in this chapter, the focus is on

where E is a bound on ||w||2. The real problem for (17) is stability. Introducing small changes in the observations should result in small changes in the recovery. We can visualize this using the

Both the l<sup>0</sup> and l1-norms give exact solutions for the noise-free CS problem while giving a close solution for the noisy problem. However, the l2-norm gives worst approximation in both cases compared to the other lp-norms with p < 2 (see Figure 5). Moreover, (17) is equivalent to an

<sup>∥</sup><sup>y</sup> � Ax∥<sup>2</sup>

as it will be shown later as LASSO, where γ is a tuning parameter. The equivalency of (17) and (18) is shown in [23, 24]. In this chapter, the generalized form of the minimization problem in

min<sup>x</sup> <sup>∥</sup>x∥1, such that <sup>∥</sup><sup>y</sup> � Ax∥<sup>2</sup> <sup>≤</sup> <sup>e</sup> (17)

<sup>2</sup> þ γ∥x∥1; (18)

such problems. The noisy recovery problem becomes a simple extension of (15),

min<sup>x</sup> 1 2

(18) with different lp-norm regularization is considered, that is,

Under Bayesian inference, consider two random variables x and y with probability density function (pdf) p(x) and p(y), respectively. The product rule gives us p(x,y) = p(x|y)p(y) = p(y|x)p(x) and Bayes' theorem provides

$$p(\mathbf{x}|\mathbf{y}) = \frac{p(\mathbf{y}|\mathbf{x})p(\mathbf{x})}{p(\mathbf{y})}.\tag{20}$$

Further, the maximum a posteriori (MAP), <sup>b</sup>xMP, is defined as

$$\begin{split} \hat{\mathbf{x}}\_{\text{MP}} &= \arg\max\_{\mathbf{x}} \frac{p(\mathbf{y}|\mathbf{x})p(\mathbf{x})}{\int\_{\hat{\mathbf{x}}} p(\mathbf{y}|\hat{\mathbf{x}})p(\hat{\mathbf{x}})d\tilde{\mathbf{x}}} \\ &= \arg\max\_{\mathbf{x}} p(\mathbf{y}|\mathbf{x})p(\mathbf{x}) \end{split} \tag{21}$$

MAP is related to Fisher's methods of maximum likelihood estimation (MLE), <sup>b</sup>xML:

$$
\widehat{\mathbf{x}}\_{\text{ML}} = \arg\max\_{\mathbf{x}} p(\mathbf{y}|\mathbf{x}).\tag{22}
$$

As we can see it from (21) and (22), the difference between MAP and MLE is the prior distribution. The former can be considered as a regularized form of the latter. Since we apply Bayesian inference we assume further different prior distributions on x.

#### 3.1. Sparse prior

The estimators of x resulting from (19) for the sparse problem we consider in this chapter, can be presented as a maximum a posteriori (MAP) estimator under the Bayesian framework as in [28]. We show this by defining a prior probability distribution for x on the form

$$p(\mathbf{x}) = \frac{e^{-u f(\mathbf{x})}}{\int\_{\mathbf{x} \in \mathbb{R}^N} e^{-u f(\mathbf{x})} d\mathbf{x}} \tag{23}$$

where the regularizing function f: χ!R is some scalar-valued, non negative function with χ ⊆ R which can be expanded to a vector argument by

$$f(\mathbf{x}) = \sum\_{i=1}^{N} f(\mathbf{x}\_i),\tag{24}$$

such that for sufficiently large u,

$$\int\_{\mathbf{x} \in \mathbb{R}^N} \exp(-\imath f(\mathbf{x}))d\mathbf{x}$$

is finite. Furthermore, let the assumed variance of the noise be given by <sup>σ</sup><sup>2</sup> <sup>¼</sup> <sup>λ</sup> <sup>u</sup> ; where λ is the system parameter which can be taken as λ = σ<sup>2</sup> u.

Since the pdf of the noise w is gaussian, the likelihood function of y given x is given by

$$p\_{\mathbf{y}|\mathbf{x}}(\mathbf{y}|\mathbf{x}) = \frac{1}{(2\pi\sigma)^{N/2}} e^{-\frac{1}{2\sigma^2} \|\mathbf{y} - A\mathbf{x}\|\_2^2}. \tag{25}$$

Together with (20) and (23), this now gives

$$p\_{\mathbf{x}|\mathbf{y}}(\mathbf{x}|\mathbf{y};\mathbf{A}) = \frac{e^{-\mu\left(\frac{1}{2}\|\mathbf{y}-\mathbf{A}\mathbf{x}\|\_{2}^{2} + \lambda f(\mathbf{x})\right)}}{(2\pi\sigma)^{N/2} \int\_{\mathbf{x}\in\mathbb{R}^{N}} e^{-\mu\left(\frac{1}{2\lambda}\|\mathbf{y}-\mathbf{A}\mathbf{x}\|\_{2}^{2} + \lambda f(\mathbf{x})\right)} d\mathbf{x}}.$$

The MAP estimator, (21), becomes

$$\widehat{\mathbf{x}}\_{\text{MP}} = \underset{\mathbf{x} \in \mathbb{R}^N}{\text{arg min}} \frac{1}{2} \|\mathbf{y} - \mathbf{A}\mathbf{x}\|\_2^2 + \lambda f(\mathbf{x}). \tag{26}$$

Now, as we choose different regularizing function, we get different estimators as listed below [28]:

1. Linear estimators: when <sup>f</sup>ð Þ¼ <sup>x</sup> <sup>∥</sup>x∥<sup>2</sup> <sup>2</sup> (26) reduces to

$$
\widehat{\mathbf{x}}\_{\text{Linear}} = \mathbf{A}^T \left(\mathbf{A}\mathbf{A}^T + \lambda\mathbf{I}\right)^{-1} \mathbf{y},
\tag{27}
$$

which is the LMMSE estimator. But we ignore this estimator in our analysis since the results are not sparse. However, the following two estimators are more interesting for CS problems since they enforce sparsity into the vector x.

2. LASSO estimator: when f(x)=||x||1 we get the LASSO estimator and (26) becomes,

$$\mathbf{x}^{\text{\textquotedblleft LASSO}} = \underset{\mathbf{x} \in \mathbb{R}^N}{\text{arg min}} \frac{1}{2} \|\mathbf{y} - \mathbf{A}\mathbf{x}\|\_2^2 + \lambda \|\mathbf{x}\|\_1. \tag{28}$$

3. Zero-norm regularization estimator: when fx=∥x∥0, we get the zero-norm regularization estimator and (26) becomes

$$\mathbf{f}^{\hat{\mathbf{x}}}\text{Zero-Norm} = \underset{\mathbf{x}\in\mathbb{R}^{N}}{\text{arg min}} \frac{1}{2} \|\mathbf{y} - \mathbf{A}\mathbf{x}\|\_{2}^{2} + \lambda\|\mathbf{x}\|\_{0}.\tag{29}$$

As mentioned earlier, (29) is the best solution for estimation of the sparse vector x, but is NPcomplete. The worst approximation for the sparse problem considered is the L2-regularization solution given by (27). However, the best approximation is given by Eq. (28) and its equivalent forms. We have used some of the algorithms in literature in our simulation which are considered as equivalent to this approximations such as Bayesian compressive sensing (BCS) [29] and L1-norm regularized least-squares (L1-LS) [11–13].

#### 3.2. Clustering prior

As we can see it from (21) and (22), the difference between MAP and MLE is the prior distribution. The former can be considered as a regularized form of the latter. Since we apply

The estimators of x resulting from (19) for the sparse problem we consider in this chapter, can be presented as a maximum a posteriori (MAP) estimator under the Bayesian framework as in [28]. We show this by defining a prior probability distribution for x on the

e�ufð Þ<sup>x</sup>

e�ufð Þ<sup>x</sup> dx

f xð Þ<sup>i</sup> ; (24)

(23)

<sup>u</sup> ; where λ is the

<sup>2</sup> : (25)

:

<sup>2</sup> þ λfð Þx : (26)

Bayesian inference we assume further different prior distributions on x.

pð Þ¼ x

ð

x∈ R<sup>N</sup>

is finite. Furthermore, let the assumed variance of the noise be given by <sup>σ</sup><sup>2</sup> <sup>¼</sup> <sup>λ</sup>

<sup>p</sup><sup>y</sup>j<sup>x</sup>ð Þ¼ <sup>y</sup>j<sup>x</sup>

<sup>p</sup><sup>x</sup>j<sup>y</sup>ð Þ¼ <sup>x</sup>jy; <sup>A</sup> <sup>e</sup>

<sup>b</sup>xMP <sup>¼</sup> arg min x∈ R<sup>N</sup>

ð Þ <sup>2</sup>πσ <sup>N</sup>=<sup>2</sup>

χ ⊆ R which can be expanded to a vector argument by

system parameter which can be taken as λ = σ<sup>2</sup>

Together with (20) and (23), this now gives

The MAP estimator, (21), becomes

below [28]:

such that for sufficiently large u,

ð

<sup>f</sup>ð Þ¼ <sup>x</sup> <sup>X</sup> N

x∈ R<sup>N</sup>

where the regularizing function f: χ!R is some scalar-valued, non negative function with

i¼1

expð Þ �ufð Þx dx

u.

1 ð Þ <sup>2</sup>πσ <sup>N</sup>=<sup>2</sup> <sup>e</sup>

�<sup>u</sup> <sup>1</sup>

x∈ R<sup>N</sup> e �<sup>u</sup> <sup>1</sup>

<sup>∥</sup><sup>y</sup> � Ax∥<sup>2</sup>

ð

1 2

Now, as we choose different regularizing function, we get different estimators as listed

� <sup>1</sup> 2σ2∥y�Ax∥<sup>2</sup>

<sup>2</sup>∥y�Ax∥<sup>2</sup> ð Þ <sup>2</sup>þλfð Þ<sup>x</sup>

<sup>2</sup>λ∥y�Ax∥<sup>2</sup> ð Þ <sup>2</sup>þλfð Þ<sup>x</sup> dx

Since the pdf of the noise w is gaussian, the likelihood function of y given x is given by

3.1. Sparse prior

266 Bayesian Inference

form

The entries of the sparse vector x may have some special structure (clusteredness) among themselves. This can be modeled by modifying the previous prior distribution.<sup>1</sup> We use another penalizing parameter γ to represent clusteredness in the data. For that we define the clustering using the distance between the entries of the sparse vector x by

$$D \equiv \sum\_{i=1}^{N} |\mathbf{x}\_i - \mathbf{x}\_{i-1}|,$$

and we use a regularizing parameter γ. Hence, we define the clustering prior to be

$$q(\mathbf{x}) = \frac{e^{-\gamma D(\mathbf{x})}}{\int\_{\mathbf{x} \in \mathbb{R}^N} e^{-\gamma D(\mathbf{x})} d\mathbf{x}}.\tag{30}$$

The new posterior involving this prior under the Bayesian framework is proportional to the product of the three pdfs:

<sup>1</sup> In [30] a hierarchical Bayesian generative model for sparse signals is found in which they have applied full Bayesian analysis by assuming prior distributions to each parameter appearing in the analysis. We follow a different approach.

$$p(\mathbf{x}|\mathbf{y}) \propto p(\mathbf{y}|\mathbf{x})p(\mathbf{x})q(\mathbf{x}).\tag{31}$$

By similar argument s as used in 3.1, we arrive at the clustered LASSO estimator

$$\mathbf{x}^{\hat{\mathbf{x}}} \text{Clu-Lasso} = \underset{\mathbf{x} \in \mathbb{R}^N}{\text{arg min}} \frac{1}{2} \|\mathbf{y} - \mathbf{A}\mathbf{x}\|\_2^2 + \lambda \|\mathbf{x}\|\_1 + \gamma \sum\_{i=1}^N |\mathbf{x}\_i - \mathbf{x}\_{i-1}|.\tag{32}$$

Here λ, γ are our tuning parameters for the sparsity in x and the way the entries are clustered, respectively.

### 4. Bayesian inference in CS applications

Compressed sensing paradigm has been applied to many signal processing areas [31–41]. However, at this time, building the hardware that can translate the CS theory into practical use is very limited. Nonetheless, the demand for cheaper, faster, and efficient devices will motivate the use of CS paradigm in real-time systems in the near future.

So far, in image processing, one can mention the single-pixel imaging via compressive sampling [31], in magnetic resonance imaging (MRI) for reducing scan time and improved image quality [32], in seismic images [33], and in radar systems for simplifying hardware design and to obtain high resolution [34, 35]. In communication and networks, CS theory has been studied for sparse channel estimation [36], for under water acoustic channels which are inherently sparse [37], spectrum sensing in cognitive radio networks [38], for large wireless sensor networks (WSNs) [39], as a channel coding scheme [40], localization [41] and so on. A good CS application literature review is provided in [21], which basically is the summary of the bulk of literatures given at http://dsp.rice.edu/cs.

In this chapter, there are examples of CS theory applications using Bayesian inference in imaging, like magnetic resonance imaging (MRI) and, in communication, i.e., multiple-input multiple-output (MIMO) systems, and in remote sensing. First, let us see the impact of the estimators derived above, LASSO and clustered LASSO, in MRI.

### 4.1. Magnetic resonance imaging (MRI)

MRI images are usually very weak due to the presence of noise and due to the weak nature of the signal itself. Compressed sensing (CS) paradigm can be applied in order to boost such signal recoveries. We applied CS paradigm via Bayesian framework, that is, incorporating the different prior information such as sparsity and the special structure that can be found in such sparse signal recovery method is applied on different MRI images.

#### 4.1.1. Angiogram image

Angiogram images are already sparse in the pixel representation. An angiogram image taken from the University Hospital Rechts der Isar, Munich, Germany [42] is used for our analysis.

Figure 6. Comparison of reconstruction schemes together with performance comparison using mean square error (MSE) in dB: (a) original image x; (b) LMMSE (35.1988 dB); (c) LASSO (53.6195 dB); and (d) clustered Lasso (63.6889 dB).

The image we took is sparse and clustered even in the pixel domain. The original signal after vectorization is x of length N = 960. By taking 746 measurements, and maximum number of non-zero elements k = 373, we applied different reconstruction schemes and the results are shown in Figure 6.

#### 4.1.2. Phantom image

pð Þ xjy ∝ pð Þ yjx pð Þx qð Þx : (31)

X N

jxi � xi�<sup>1</sup>j: (32)

i¼1

<sup>2</sup> þ λ∥x∥<sup>1</sup> þ γ

By similar argument s as used in 3.1, we arrive at the clustered LASSO estimator

1 2

<sup>∥</sup><sup>y</sup> � Ax∥<sup>2</sup>

Here λ, γ are our tuning parameters for the sparsity in x and the way the entries are clustered,

Compressed sensing paradigm has been applied to many signal processing areas [31–41]. However, at this time, building the hardware that can translate the CS theory into practical use is very limited. Nonetheless, the demand for cheaper, faster, and efficient devices will

So far, in image processing, one can mention the single-pixel imaging via compressive sampling [31], in magnetic resonance imaging (MRI) for reducing scan time and improved image quality [32], in seismic images [33], and in radar systems for simplifying hardware design and to obtain high resolution [34, 35]. In communication and networks, CS theory has been studied for sparse channel estimation [36], for under water acoustic channels which are inherently sparse [37], spectrum sensing in cognitive radio networks [38], for large wireless sensor networks (WSNs) [39], as a channel coding scheme [40], localization [41] and so on. A good CS application literature review is provided in [21], which basically is the summary of the bulk of

In this chapter, there are examples of CS theory applications using Bayesian inference in imaging, like magnetic resonance imaging (MRI) and, in communication, i.e., multiple-input multiple-output (MIMO) systems, and in remote sensing. First, let us see the impact of the

MRI images are usually very weak due to the presence of noise and due to the weak nature of the signal itself. Compressed sensing (CS) paradigm can be applied in order to boost such signal recoveries. We applied CS paradigm via Bayesian framework, that is, incorporating the different prior information such as sparsity and the special structure that can be found in such

Angiogram images are already sparse in the pixel representation. An angiogram image taken from the University Hospital Rechts der Isar, Munich, Germany [42] is used for our analysis.

x^

respectively.

268 Bayesian Inference

Clu-Lasso ¼ arg min

4. Bayesian inference in CS applications

literatures given at http://dsp.rice.edu/cs.

4.1. Magnetic resonance imaging (MRI)

4.1.1. Angiogram image

x∈ R<sup>N</sup>

motivate the use of CS paradigm in real-time systems in the near future.

estimators derived above, LASSO and clustered LASSO, in MRI.

sparse signal recovery method is applied on different MRI images.

Another MRI image considered is the Shepp-Logan phantom which is not sparse in spatial domain. However, we sparsified it in K-space by zeroing out small coefficients. We then measured the sparsified image and added noise. The original signal after vectorization is x of length N = 200. By taking 94 measurements, that is, y is of length M = 94, and maximum number of non-zero elements k = 47, we applied different reconstruction algorithms used above. The result shows that clustered LASSO does well compared to the others as can be seen in Figure 7.

### 4.1.3. fMRI image

Another example to apply the clustered LASSO based image reconstruction using Bayesian framework to medical images is a functional magnetic resonance imaging (fMRI), a noninvasive technique of brain mapping, which is crucial in the study of brain activity. Taking many slices in fMRI data, we saw how these data sets are sparse in the Fourier domain. This is

Figure 7. Comparison of reconstruction schemes together with performance comparison using mean square error (MSE) in dB: (a) original image x; (b) sparcified image; (c) least square (LS) (21.3304 dB); (d) LMMSE (27.387 dB); (e) LASSO (37.9978 dB); and (f) clustered LASSO (40.0068 dB).

Figure 8. The five column images represent the real and imaginary parts of the Fourier transform representation of the data set we have chosen to present further, which in general shows that the fMRI image have sparse and clustered representation.

shown in Figure 8. We observed the whole data in this domain for the whole brain image. They all share the characteristics we have based our analysis, i.e., sparsity and clusteredness. Then we took some slices which are consecutive in the slice order and took different N, k, and M=2k, on these slices. We can see the numbers at the top of Figure 9, in which the two numbers represent k and N, respectively.

In fMRI, results are compared using image intensity which gives a good ground for a health practitioner to observe and decide in accordance to the available information. The more one have prior knowledge on how the brain regions work in human beings or pets the better priors that one incorporate to analyze the data. So this is an interesting tool for researchers in the future.

### 4.2. MIMO systems

a b c

d e f

Figure 7. Comparison of reconstruction schemes together with performance comparison using mean square error (MSE) in dB: (a) original image x; (b) sparcified image; (c) least square (LS) (21.3304 dB); (d) LMMSE (27.387 dB); (e) LASSO

R #3

R #4

R #5

I #5

I #4

I #3

Figure 8. The five column images represent the real and imaginary parts of the Fourier transform representation of the data set we have chosen to present further, which in general shows that the fMRI image have sparse and clustered

(37.9978 dB); and (f) clustered LASSO (40.0068 dB).

R #2

I #2

R #1

270 Bayesian Inference

I #1

representation.

Multiple-input multiple-output (MIMO) systems are integrated in modern wireless communications due to their advantage in improving performance with respect to many performance metrics. One such advantage is the ability to transmit multiple streams using spatial multiplexing, but channel state information (CSI) at the transmitter is needed to get optimal system performance.

Consider a frequency division duplex (FDD) MIMO system consisting of Nt transmit and Nr receive antennas. Assume that the channel is a flat-fading, temporally correlated channel denoted by a matrix <sup>H</sup>½ � <sup>n</sup> <sup>∈</sup> <sup>C</sup>Nr�Nt where <sup>n</sup> indicates a channel feedback time index with block fading assumed during the feedback interval. The singular value decomposition (SVD) of H[n] gives

$$\mathbf{H}[n] = \mathbf{U}[n]\boldsymbol{\Sigma}\left[n\right]\mathbf{V}^H[n],$$

where U ∈ CNr�<sup>r</sup> and V ∈ CNt�<sup>r</sup> are unitary matrices and Σ ∈ Cr�<sup>r</sup> is a diagonal matrix consisting of r = min(Nt, Nr) singular values. In the presence of perfect channel state information (CSI), a MIMO system model can be given by the equation

$$\tilde{\mathbf{y}} = \mathbf{U}^{H}[n]\mathbf{H}[n]\mathbf{V}[n]\tilde{\mathbf{x}} + \mathbf{U}^{H}[n]\mathbf{n} \tag{33}$$

where ~x ∈ C<sup>r</sup>�<sup>1</sup> is transmitted vector, V[n] is used as precoder at the transmitter, UH[n] is used as decoder at the receiver, n∈ CNr�<sup>1</sup> denotes a noise vector whose entries are i.i.d. and distributed according to CN ð Þ <sup>0</sup>; <sup>1</sup> and <sup>y</sup><sup>~</sup> <sup>∈</sup> <sup>C</sup>Nr�<sup>1</sup> is the received vector.

Channel adaptive transmission requires knowledge of channel state information at the transmitter. In temporally correlated MIMO channels, the correlation can be utilized to reduce feedback overhead and improve performance. CS methods and rotative quantization are used to compress and feedback the CSI for MIMO systems [43]. This was done as an extension work of [44]. It is shown that the CS-based method reduces feedback overhead while delivering the same performance as the direct quantization scheme, using simulation.

Three methods are compared in the simulations, perfect CSI, without CS and with CS using matched filter (MF) and minimum mean square error estimator (MMSE) receivers for different total feedback bits B = 10 and B = 5. In Figure 10, sum rates are compared against signalto-noise-ratio (SNR). Using CS, half of the number of bits can be saved. In Figure 11, where the bit-error-rate is plotted against SNR, the CS method has a better bit error rate performance using same number of bits for the CS and without CS cases. These two figures demonstrate the clear advantage of using CS in feedback of singular vectors in rotative based method. The detail is deferred to [43].

#### 4.3. Remote sensing

Remote sensing satellites provide a repetitive and consistent view of the Earth and they offer a wide range of spatial, spectral, radiometric, and temporal resolutions. Image fusion is applied to extract all the important features from various input images. These images are integrated to form a fused image which is more informative and suitable for human visual perception or computer processing. Sparse representation has been applied to fuse image to improve the quality of fused image [45].

Figure 9. Application of sparse and cluster prior, LASSO and clustered LASSO (CL. LASSO), on a fMRI data analysis for N = 80, k > 50 for σ<sup>2</sup> = 0.1 and λ = 0.1, where LMMSE is with L2-regularised one.

total feedback bits B = 10 and B = 5. In Figure 10, sum rates are compared against signalto-noise-ratio (SNR). Using CS, half of the number of bits can be saved. In Figure 11, where the bit-error-rate is plotted against SNR, the CS method has a better bit error rate performance using same number of bits for the CS and without CS cases. These two figures demonstrate the clear advantage of using CS in feedback of singular vectors in rotative based method. The

Remote sensing satellites provide a repetitive and consistent view of the Earth and they offer a wide range of spatial, spectral, radiometric, and temporal resolutions. Image fusion is applied to extract all the important features from various input images. These images are integrated to form a fused image which is more informative and suitable for human visual perception or computer processing. Sparse representation has been applied to fuse image to improve the

Figure 9. Application of sparse and cluster prior, LASSO and clustered LASSO (CL. LASSO), on a fMRI data analysis for

N = 80, k > 50 for σ<sup>2</sup> = 0.1 and λ = 0.1, where LMMSE is with L2-regularised one.

detail is deferred to [43].

quality of fused image [45].

4.3. Remote sensing

272 Bayesian Inference

Figure 10. Sum rate vs. SNR for a 22 MIMO system with and without CS with two streams. We can observe that the performance of the CS method is almost equal to that of the method without using CS while saving half the number of bits.

Figure 11. Bit error rate vs. SNR using matched filter receiver for a 22 MIMO system with one stream.

Figure 12. Comparison of image fusion methods for remote sensing applications using Brovey, DWT, PCA, FDCT, and the sparse representation methods [46].

To improve the quality of the fused image, a remote sensing image fusion method based on sparse representation is proposed in [46]. In these methods, the source images were represented with sparse coefficients first. Then, the larger values of sparse coefficients of panchromatic (Pan) image are set to 0. Thereafter, the coefficients of panchromatic (Pan) and multispectral (MS) image are combined with the linear weighted averaging fusion rule. Finally, the fused image is reconstructed from the combined sparse coefficients and the dictionary. The proposed method is compared with intensity-hue-saturation (IHS), Brovey transform (Brovey), discrete wavelet transform (DWT), principal component analysis (PCA) and fast discrete curvelet transform (FDCT) methods on several pairs of multifocus images. The proposed method using sparse representation outperforms, see Figure 12, better than the usual methods listed here. We believe that our method of clustered compressed sensing can also further improve this result.

### 5. Conclusions

In this chapter, a Bayesian way of analyzing data on CS paradigm is presented. The method assumes prior information like the sparsity and clusteredness of signals in the analysis of the data. Among the different reconstruction methods, the convex relaxation methods are redefined using Bayesian inference. Further, three CS applications are presented: MRI imaging, MIMO systems, and remote sensing. For MRI imaging, the two different priors are incorporated, while for MIMO systems and remote sensing, only the sparse prior is applied in the analysis. We suggest that including the special structure among the sparse elements of the data can be included in the analysis to further improve the results.

### Author details

Solomon A. Tesfamicael<sup>1</sup> \* and Faraz Barzideh2

\*Address all correspondence to: solomon.a.tesfamicael@ntnu.no

1 Department of Education, Norwegian University of Science and Technology (NTNU), Trondheim, Norway

2 Department of Electrical Engineering and Computer Science, University of Stavanger (UiS), Stavanger, Norway

### References

To improve the quality of the fused image, a remote sensing image fusion method based on sparse representation is proposed in [46]. In these methods, the source images were represented with sparse coefficients first. Then, the larger values of sparse coefficients of panchromatic (Pan) image are set to 0. Thereafter, the coefficients of panchromatic (Pan) and multispectral (MS) image are combined with the linear weighted averaging fusion rule. Finally, the fused image is reconstructed from the combined sparse coefficients and the dictionary. The proposed method is compared with intensity-hue-saturation (IHS), Brovey transform (Brovey), discrete wavelet transform (DWT), principal component analysis (PCA) and fast discrete curvelet transform (FDCT) methods on several pairs of multifocus images. The proposed method using sparse representation outperforms, see Figure 12, better than the usual methods listed here. We believe that our method of clustered compressed sensing can also

Figure 12. Comparison of image fusion methods for remote sensing applications using Brovey, DWT, PCA, FDCT, and

In this chapter, a Bayesian way of analyzing data on CS paradigm is presented. The method assumes prior information like the sparsity and clusteredness of signals in the analysis of the data. Among the different reconstruction methods, the convex relaxation methods are redefined using Bayesian inference. Further, three CS applications are presented: MRI imaging, MIMO systems, and remote sensing. For MRI imaging, the two different priors are incorporated, while for MIMO systems and remote sensing, only the sparse prior is applied in

further improve this result.

the sparse representation methods [46].

274 Bayesian Inference

5. Conclusions


[27] Tesfamicael SA. Compressive sensing in signal processing: Performance analysis and applications [doctoral thesis]. NTNU; 2016. p. 182

[12] Cand'es E, Romberg J, Tao T. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information The-

[13] Cande's EJ, Tao T. Near-optimal signal recovery from random projections: Universal encoding strategies? IEEE Transactions on Information Theory. December 2006;52:54065425

[14] Eldar YC, Kutyniok G. Compressed Sensing: Theory and Applications. Cambridge Uni-

[15] Eldar YC, Kutyniok G. Compressed Sensing: Algorithms and Applications. KTHKTH,

[16] Candes EJ. The restricted isometry property and its implications for compressed sensing.

[17] Guan X, Gao Y, Chang J, Zhang Z. Advances in Theory of Compressive Sensing and Applications in Communication. 2011 First International Conference on Instrumentation,

[18] Blumensath T, Davies ME. Iterative hard thresholding for compressed sensing. Applied

[19] Natarajan BK. Sparse approximate solutions to linear systems. SIAM Journal of Comput-

[20] Natarajan BK. Uncertainty principles and ideal atomic decomposition. IEEE Transactions

[21] Qaisar S, Bilal RM, Iqbal W, Naureen M, Lee S. Compressive sensing: From theory to applications, a survey. Journal of Communications and Networks. October 2013;15(5):443-456 [22] Figueiredo MAT, Nowak RD, Wright SJ. Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems. Journal of Selected

[23] Schniter P, Potter LC, Ziniel J. Subspace pursuit for compressive sensing signal reconstruction. Information Theory and Applications Workshop. February 2008. pp. 326-333

[24] Teixeira FCA, Bergen SWA, Antoniou A. Robust signal recovery approach for compressive sensing using unconstrained optimization. Proceedings of 2010 IEEE International

[25] Tesfamicael SA, Barzideh F. Clustered compressed sensing via Bayesian framework. IEEE UKSim-AMSS 17th International Conference on Computer Modelling and Simulation. UKSim2015-19.S.Image, Speech and Signal Processing; Cambridge, United Kingdom.

[26] Tesfamicael SA, Barzideh F. Clustered compressive sensing: Application on medical imaging. International Journal of Information and Electronics Engineering. 2015;5(1):48-50

Symposium on Circuits and Systems (ISCAS). May 2010. pp. 3521-3524

Communication Theory. ACCESS Linnaeus Centre; 2012

Measurement, Computer, Communication and Control; 2011

ory. February 2006;52(2):489509

Comptes Rendus Mathematique. 2008

and Computational Harmonic Analysis. 2009

on Information Theory. January 2001;47:2845-2862

Topics in Signal Processing. 2007;1(4):586-597

versity Press; 2012

276 Bayesian Inference

ing. 1995

2015. pp. 25-27


Provisional chapter

### **Sparsity in Bayesian Signal Estimation** Sparsity in Bayesian Signal Estimation

Ishan Wickramasingha, Michael Sobhy and Sherif S. Sherif Ishan Wickramasingha, Michael Sobhy and

Additional information is available at the end of the chapter Sherif S. Sherif

http://dx.doi.org/10.5772/intechopen.70529 Additional information is available at the end of the chapter

#### Abstract

[42] Image. Depiction of Vessel Diseases with a Wide Range of Contrast and Non-Contrast Enhanced Techniques. Munich, Germany: University Hospital Rechts der Isar; 2014 [43] Tesfamicael SA, Lundheim L. Compressed sensing based rotative quantization in temporally correlated MIMO channels. Recent Developments in Signal Processing; 2013 [44] Godana SBE, Ekman T. Rotative quantization using adaptive range for temporally correlated MIMO channels. 2013 IEEE 24th International Symposium on Personal Indoor and

[45] Tibshirani R. Compressive sensing: From theory to applications, a survey. Journal of the

[46] Yu X, Gao G, Xu J, Wang G. Remote sensing image fusion based on sparse representation.

Mobile Radio Communications (PIMRC). 2013. pp. 1233-1238

278 Bayesian Inference

2014 IEEE Geoscience and Remote Sensing Symposium

Royal Statistical Society. Series B (Methodological). 1996;58(1):267-288

In this chapter, we describe different methods to estimate an unknown signal from its linear measurements. We focus on the underdetermined case where the number of measurements is less than the dimension of the unknown signal. We introduce the concept of signal sparsity and describe how it could be used as prior information for either regularized least squares or Bayesian signal estimation. We discuss compressed sensing and sparse signal representation as examples where these sparse signal estimation methods could be applied.

DOI: 10.5772/intechopen.70529

Keywords: inverse problems, signal estimation, regularization, Bayesian methods, signal sparsity

### 1. Introduction

In engineering and science, a system typically refers to a physical process whose outputs are generated due to some inputs [1, 2]. Examples of systems include measuring instruments, imaging devices, mechanical and biomedical devices, chemical reactors and others. A system could be abstracted as a block diagram,

where x and y represent the inputs and outputs of the system, respectively. The block, A, formalizes the relation between these inputs and the outputs using mathematical equations [2, 3]. Depending on the nature of the system, the relation between its inputs and outputs

© 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and eproduction in any medium, provided the original work is properly cited.

could be either linear or nonlinear. For a linear relation, the system is called a linear system and it would be represented by a set of linear equations [3, 4]

$$y = A\mathbf{x}.\tag{1}$$

In this chapter, we will restrict our attention to linear systems, as they could adequately represent many actual systems in a mathematically tractable way.

When dealing with systems, two typical types of problems arise, forward and inverse problems.

### 1.1. Forward problems

In a forward problem, one would be interested in obtaining the output of a system due to a particular input [5, 6]. For linear systems, this output is the result of a simple matrix-vector product, Ax. Forward problems usually become more difficult as the number of equations increases or as uncertainties about the inputs, or the behavior of the system, are present [6].

### 1.2. Inverse problems

In an inverse problem, one would be interested in inferring the inputs to a system x that resulted in observed outputs, i.e., measured y [5, 6]. Another formulation of an inverse problem is to identify the behavior of the system, i.e., construct A, from knowledge of different input and output values. This problem formulation is known as system identification [1, 7, 8]. In this chapter, we will only consider the input inference problem. The nature of the input x to be inferred further leads to two broad categories of this problem: estimation, and classification. In input estimation, the input could assume an infinite number of possible values [4, 9], while in input classification the input could assume only a finite number (usually small) of possible values [4, 9]. Accordingly, in input classification, one would like to only assign an input to a predetermined signal class. In this chapter, we will only focus on estimation problems, particularly on restoring an input signal x from noisy data y that is obtained using a linear measuring system represented by a matrix A.

### 2. Signal restoration as example of an inverse problem

To solve the above signal restoration problem, we need to estimate input signal x through the inversion of matrix A. This could be a hard problem because in many cases the inverse of A might not exist, or the measurement data, y, might be corrupted by noise. The existence of the inverse of A depends on the number of acquired independent measurements relative to the dimension of the unknown signal [5, 10]. The conditions for the existence of a stable solution of any inverse problem, i.e., for an inverse problem to be well-posed, have been addressed by Hadamard as:


These conditions could be applied to linear systems as conditions on the matrix A. Let the matrix <sup>A</sup> <sup>∈</sup> <sup>R</sup><sup>n</sup>�<sup>m</sup>, such that <sup>R</sup><sup>n</sup>�<sup>m</sup> denotes the set of matrices of dimension <sup>n</sup> � <sup>m</sup> with its elements being real values. The matrix equation, yn � <sup>1</sup> = An � <sup>m</sup> xm � 1, is equivalent to n linear equations with m unknowns. The matrix A is a linear transformation that maps input signals from its domain <sup>D</sup>ð Þ¼ <sup>A</sup> <sup>R</sup><sup>m</sup> to its range <sup>R</sup>ð Þ¼ <sup>A</sup> <sup>R</sup><sup>n</sup> [4, 5, 10]. For any measured output signal y∈ R<sup>n</sup>, we could identify three cases based on the values of n and m.

#### 2.1. Underdetermined linear systems

could be either linear or nonlinear. For a linear relation, the system is called a linear system and

In this chapter, we will restrict our attention to linear systems, as they could adequately

When dealing with systems, two typical types of problems arise, forward and inverse problems.

In a forward problem, one would be interested in obtaining the output of a system due to a particular input [5, 6]. For linear systems, this output is the result of a simple matrix-vector product, Ax. Forward problems usually become more difficult as the number of equations increases or as uncertainties about the inputs, or the behavior of the system, are present [6].

In an inverse problem, one would be interested in inferring the inputs to a system x that resulted in observed outputs, i.e., measured y [5, 6]. Another formulation of an inverse problem is to identify the behavior of the system, i.e., construct A, from knowledge of different input and output values. This problem formulation is known as system identification [1, 7, 8]. In this chapter, we will only consider the input inference problem. The nature of the input x to be inferred further leads to two broad categories of this problem: estimation, and classification. In input estimation, the input could assume an infinite number of possible values [4, 9], while in input classification the input could assume only a finite number (usually small) of possible values [4, 9]. Accordingly, in input classification, one would like to only assign an input to a predetermined signal class. In this chapter, we will only focus on estimation problems, particularly on restoring an input signal x from noisy data y that is obtained using a linear measuring system represented by a matrix A.

To solve the above signal restoration problem, we need to estimate input signal x through the inversion of matrix A. This could be a hard problem because in many cases the inverse of A might not exist, or the measurement data, y, might be corrupted by noise. The existence of the inverse of A depends on the number of acquired independent measurements relative to the dimension of the unknown signal [5, 10]. The conditions for the existence of a stable solution of any inverse problem, i.e., for an inverse problem to be well-posed, have been addressed by Hadamard as:

• Continuity: as the input x changes slightly, the output y changes slightly, i.e., the relation

• Existence: for measured output y there exists at least one corresponding input x. • Uniqueness: for measured output y there exists only one corresponding input x.

y ¼ A x: (1)

it would be represented by a set of linear equations [3, 4]

1.1. Forward problems

280 Bayesian Inference

1.2. Inverse problems

represent many actual systems in a mathematically tractable way.

2. Signal restoration as example of an inverse problem

between x and y is continuous.

In this case, n < m, i.e., the number of equations is less than the number of unknowns,

$$A = \begin{bmatrix} a\_{11} & a\_{12} & \cdots & a\_{1m} \\ \vdots & \vdots & \ddots & \vdots \\ a\_{n1} & a\_{n2} & \cdots & a\_{nm} \end{bmatrix} . \tag{2}$$

If these equations are consistent, Hadamard's Existence condition will be satisfied. However, Hadamard's Uniqueness condition is not satisfied because the Null Space(A) 6¼ {0}, i.e., there exist z 6¼ 0 ∈ Null Space(A) such that,

$$A(\mathbf{x} + \mathbf{z}) = \mathbf{y}.\tag{3}$$

This linear system is called under-determined because its equations, i.e., system constraints, are not enough to uniquely determine x [4, 5]. Thus, the inverse of A does not exist.

#### 2.2. Overdetermined linear systems

In this case, m > n, the number of equations is more than the number of unknowns,

$$A = \begin{bmatrix} a\_{11} & \cdots & a\_{1m} \\ a\_{21} & \cdots & a\_{2m} \\ \vdots & & \vdots \\ \vdots & & \ddots \\ a\_{n1} & \cdots & a\_{nm} \end{bmatrix} \tag{4}$$

If these equations are consistent, Hadamard's Existence condition will not be satisfied. However, Hadamard's Uniqueness condition will be satisfied, if A has full rank. In this case, Null Space (A)={0}, i.e.,

$$A(\mathbf{x} + \mathbf{0}) = A\mathbf{x} = \mathbf{y}.\tag{5}$$

This linear system is called over-determined, because its equations, i.e., system constraints, are too many for x to exist [4, 5]. Also, the inverse of A does not exist.

#### 2.3. Square linear systems

The case where m = n, the number of equations is equal to the number of unknowns,

$$A = \begin{bmatrix} a\_{11} & \cdots & a\_{1n} \\ \vdots & \ddots & \vdots \\ a\_{n1} & \cdots & a\_{nn} \end{bmatrix}. \tag{6}$$

If A has full rank, its Null Space(A)={0} and both Hadamard's Existence and Uniqueness conditions will be satisfied. In addition, if A has a small condition number, the relation between x,y will be continuous, and Hadamard's Continuity condition will be satisfied [4, 5, 10]. In this case, the inverse problem formulated by this system of linear equations is well-posed.

### 3. Methods for signal estimation

In this section, we will focus on the estimation of an input signal x from a noisy measurement y of the output of a linear system A.

The linear system shown in Figure 1, could be modeled as,

$$
\mathfrak{y} = A\mathfrak{x} + \mathfrak{v}.\tag{7}
$$

where v is additive Gaussian noise. As a consequence of the Central Limit Theorem, this assumption of Gaussian distributed noise is valid for many output measurement setups.

Statistical Estimation Theory allows one to obtain an estimate <sup>b</sup><sup>x</sup> of a signal <sup>x</sup> that is input to a known system <sup>A</sup> from measurement <sup>y</sup> (see Figure 2) [11, 12]. However, this estimate <sup>b</sup><sup>x</sup> is not unique, as it depends on the choice of the used estimator from the different ones available. In addition to measurement y, if other information about the input signal is available, it could be

Figure 1. Linear system with noisy output measurement.

Figure 2. Signal estimation using prior information.

used as prior information to constrain the estimator to produce a better estimate of x. Signal estimation for overdetermined systems could be achieved without any prior information about the input signal. However, for underdetermined systems, prior information is necessary to ensure a unique estimate.

#### 3.1. Least squares estimation

A ¼

3. Methods for signal estimation

Figure 1. Linear system with noisy output measurement.

Figure 2. Signal estimation using prior information.

The linear system shown in Figure 1, could be modeled as,

of the output of a linear system A.

282 Bayesian Inference

2 4

a<sup>11</sup> ⋯ a1<sup>n</sup> ⋮⋱⋮ an<sup>1</sup> ⋯ ann

If A has full rank, its Null Space(A)={0} and both Hadamard's Existence and Uniqueness conditions will be satisfied. In addition, if A has a small condition number, the relation between x,y will be continuous, and Hadamard's Continuity condition will be satisfied [4, 5, 10]. In this case, the inverse problem formulated by this system of linear equations is well-posed.

In this section, we will focus on the estimation of an input signal x from a noisy measurement y

where v is additive Gaussian noise. As a consequence of the Central Limit Theorem, this assumption of Gaussian distributed noise is valid for many output measurement setups.

Statistical Estimation Theory allows one to obtain an estimate <sup>b</sup><sup>x</sup> of a signal <sup>x</sup> that is input to a known system <sup>A</sup> from measurement <sup>y</sup> (see Figure 2) [11, 12]. However, this estimate <sup>b</sup><sup>x</sup> is not unique, as it depends on the choice of the used estimator from the different ones available. In addition to measurement y, if other information about the input signal is available, it could be

3

5: (6)

y ¼ Ax þ v: (7)

If there is no information available about the statistics of the measured data,

$$
\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{v},
\tag{8}
$$

least squares estimation could be used. The least squares estimate is obtained by minimizing the square of the L<sup>2</sup> norm of the error between the measurement and the linear model, v = y � Ax. It is given by

$$
\widehat{\mathfrak{X}} = \underset{\mathfrak{x}}{\text{arg min}} \,\|\mathfrak{y} - \mathbf{A}\mathfrak{x}\|\_2^2. \tag{9}
$$

The L<sup>2</sup> norm is a special case of the p-norm of a vector, where p = 2, that is defined as k k<sup>x</sup> <sup>p</sup> <sup>¼</sup> <sup>P</sup><sup>m</sup> <sup>i</sup>¼<sup>1</sup> xi j j<sup>p</sup> � �<sup>1</sup> p . In Eq. (9), the unknown x is considered deterministic, so its statistics are not required. The noise v in this formulation is implicitly assumed to be white noise with variance σ<sup>2</sup> [13, 14]. Least squares estimation is typically used to estimate input signals x in overdetermined problems. Since <sup>b</sup><sup>x</sup> is unique in this case, no prior information, additional constraints, for x is necessary.

#### 3.2. Weighted least squares estimation

If the noise v in Eq. (8) is not necessarily white and its second order statistics, i.e., mean and covariance matrix, are known, then weighted least squares estimation could be used to further improve the least squares estimate. In this estimation method, measurement errors are not weighted equally, but a weighting matrix C explicitly specifies such the weights. The weighted least squares estimate is given by

$$\widehat{\mathfrak{X}} = \underset{\mathfrak{x}}{\text{arg min }} \|\mathbf{C}^{-1/2}(\mathfrak{y} - \mathbf{A}\mathfrak{x})\|\_{2}^{2}. \tag{10}$$

We note that the least squares problem, Eq. (9), is a special case of the weighted least squares problem, Eq. (10), when C = σ<sup>2</sup> I.

#### 3.3. Regularized least squares estimation

In underdetermined problems, the introduction of additional constraints on x, also known as regularization, could ensure the uniqueness of the obtained solution. Standard least squares estimation could be extended, through regularization, to solve underdetermined estimation problems. The regularized least squares estimate is given by

$$\underset{\mathbf{x}}{\text{arg min}} \|\mathbf{y} - \mathbf{A}\mathbf{x}\|\_{2}^{2} + \lambda \|L\mathbf{x}\|\_{2},\tag{11}$$

where L is a matrix specifying the additional constraints and λ is a regularization parameter whose value determines the relative weights of the two terms in the objective function. If the combined matrix <sup>A</sup> L � � has full rank, the regularized least squares estimate <sup>b</sup><sup>x</sup> is unique [4]. In this optimization problem, the unknown x is once again considered deterministic, so its statistics are not required. It is worthwhile noting that while regularization is necessary to solve underdetermined inverse problems, it could also be used to improve numerical properties, e.g., condition number, of either linear overdetermined or linear square inverse problems.

#### 3.4. Maximum likelihood estimation

If the probability distribution function (pdf) of the measurement y, parameterized by an unknown deterministic input signal x, is available, then the maximum likelihood estimate of x is given by,

$$
\widehat{\mathfrak{X}} = \underset{\mathfrak{x}}{\text{arg}\, \text{max}} \, f(\mathfrak{y}|\mathfrak{x}).\tag{12}
$$

This maximum likelihood estimate <sup>b</sup><sup>x</sup> is obtained by assuming that measurement <sup>y</sup> is the most likely measurement to occur given the input signal x. This corresponds to choosing the value of x for which the probability of the observed measurement y is maximized. In maximum likelihood estimation, the negative log of the likelihood function, f(y|x), is typically used to transform Eq. (12) into a simpler minimization problem. When, f(y| x) is a Gaussian distribution, N(Ax,C), minimizing the negative log of the likelihood function is equivalent to solving the weighted least squares estimation problem.

#### 3.5. Bayesian estimation

If the conditional pdf of the measurement y, given an unknown random input signal x, is known, in addition to the marginal pdf of x, representing prior information about x, is given, then a Bayesian estimation method would be possible. The first step to obtain one of the many possible Bayesian estimates of x is to use Bayes rule to obtain the a posteriori pdf,

$$f(\mathbf{x}|\mathbf{y}) = \frac{f(\mathbf{y}|\mathbf{x})f(\mathbf{x})}{\int f(\mathbf{y}|\mathbf{x})f(\mathbf{x})} \,. \tag{13}$$

Once this a posteriori pdf is known, different Bayesian estimates <sup>b</sup><sup>x</sup> could be obtained. For example, the minimum mean square error estimate is given by,

$$
\widehat{\mathfrak{X}}\_{\text{MMSE}} = E\_{\mathfrak{x}}[f(\mathfrak{x}|\mathfrak{y})] = E\_{\mathfrak{x}} \left[ \frac{f(\mathfrak{y}|\mathfrak{x})f(\mathfrak{x})}{\int f(\mathfrak{y}|\mathfrak{x})f(\mathfrak{x})} \right], \tag{14}
$$

while the maximum a priori (MAP) estimate is given by,

$$
\widehat{\mathfrak{X}}\_{\text{MAP}} = \arg\max\_{\mathfrak{x}} f(\mathfrak{x}|\mathfrak{y}) = \arg\max\_{\mathfrak{x}} f(\mathfrak{y}|\mathfrak{x}) f(\mathfrak{x}).\tag{15}
$$

We note that the maximum likelihood estimate, Eq. (12), is a special case of the MAP estimate, when f(x) is a uniform pdf over the entire domain of x. The use of prior information is essential to solve underdetermined inverse problems, but it also improves the numerical properties, e.g., condition number, of either linear overdetermined or linear square inverse problems.

#### 3.5.1. Bayesian least squares estimation

arg min x

combined matrix <sup>A</sup>

284 Bayesian Inference

given by,

L � �

3.4. Maximum likelihood estimation

the weighted least squares estimation problem.

3.5. Bayesian estimation

<sup>∥</sup><sup>y</sup> � Ax∥<sup>2</sup>

where L is a matrix specifying the additional constraints and λ is a regularization parameter whose value determines the relative weights of the two terms in the objective function. If the

this optimization problem, the unknown x is once again considered deterministic, so its statistics are not required. It is worthwhile noting that while regularization is necessary to solve underdetermined inverse problems, it could also be used to improve numerical properties, e.g., condition number, of either linear overdetermined or linear square inverse problems.

If the probability distribution function (pdf) of the measurement y, parameterized by an unknown deterministic input signal x, is available, then the maximum likelihood estimate of x is

This maximum likelihood estimate <sup>b</sup><sup>x</sup> is obtained by assuming that measurement <sup>y</sup> is the most likely measurement to occur given the input signal x. This corresponds to choosing the value of x for which the probability of the observed measurement y is maximized. In maximum likelihood estimation, the negative log of the likelihood function, f(y|x), is typically used to transform Eq. (12) into a simpler minimization problem. When, f(y| x) is a Gaussian distribution, N(Ax,C), minimizing the negative log of the likelihood function is equivalent to solving

If the conditional pdf of the measurement y, given an unknown random input signal x, is known, in addition to the marginal pdf of x, representing prior information about x, is given, then a Bayesian estimation method would be possible. The first step to obtain one of the many

> <sup>f</sup>ð Þ¼ <sup>x</sup>j<sup>y</sup> <sup>f</sup>ð Þ <sup>y</sup>j<sup>x</sup> <sup>f</sup>ð Þ<sup>x</sup> Ð

Once this a posteriori pdf is known, different Bayesian estimates <sup>b</sup><sup>x</sup> could be obtained. For

fð Þ yjx fð Þx

fð Þ yjx fð Þx � �

Ð

possible Bayesian estimates of x is to use Bayes rule to obtain the a posteriori pdf,

<sup>b</sup>xMMSE <sup>¼</sup> Ex½ �¼ <sup>f</sup>ð Þ <sup>x</sup>j<sup>y</sup> Ex

example, the minimum mean square error estimate is given by,

while the maximum a priori (MAP) estimate is given by,

<sup>b</sup><sup>x</sup> <sup>¼</sup> arg max <sup>x</sup>

has full rank, the regularized least squares estimate <sup>b</sup><sup>x</sup> is unique [4]. In

<sup>2</sup> þ λ∥L x∥2; (11)

fð Þ yjx : (12)

<sup>f</sup>ð Þ <sup>y</sup>j<sup>x</sup> <sup>f</sup>ð Þ<sup>x</sup> : (13)

; (14)

In least squares estimation, the vector x is assumed to be an unknown deterministic variable. However, in Bayesian least squares estimation, it is considered a vector of scalar random variables that satisfies statistical properties given by an a priori probability distribution function [5]. In addition, in least squares estimation, the L<sup>2</sup> norm of the measurement error is minimized, while in Bayesian least squares estimation, it is the estimation error, <sup>e</sup> <sup>¼</sup> <sup>b</sup><sup>x</sup> � <sup>x</sup>, not measurement error, that is used [5]. Since x is assumed to be a random vector, the estimation error e will also be a random vector. Therefore, the Bayesian least squares estimate could be obtained by minimizing the condtional mean of the square of the estimation error, given measurement, y,

$$
\widehat{\mathfrak{X}} = \underset{\mathfrak{x}}{\text{arg min}} \, E\left[ (\widehat{\mathfrak{x}} - \mathfrak{x})^T (\widehat{\mathfrak{x}} - \mathfrak{x}) | y \right]. \tag{16}
$$

When x has a Gaussian distribution and A represents a linear system, then measurement y will also have a Gaussian distribution. In this case, the Bayesian least squares estimate given by Eq. (16) could be reinterpreted as a regularized least squares estimate given by,

$$
\widehat{\mathfrak{X}} = \underset{\mathfrak{x}}{\text{arg min}} \|\mathfrak{y} - \mathbf{A}\mathfrak{x}\|\_2^2 + \|\mathfrak{y} - \mathfrak{x}\|,\tag{17}
$$

where μ is the mean of the a priori distribution of x [5]. Therefore, a least squares Bayesian estimate is analogous to a regularized least squares estimate, where a priori information about x is expressed as additional constraints on x in the form of a regularization term.

#### 3.5.2. Advantages of Bayesian estimation over other estimation methods

Bayesian estimation techniques could be used, given that a reliable a priori distribution is known, to obtain an accurate estimate of a signal x, even if the number available measurements is smaller than the dimension of the signal to estimated. In this underdetermined case, Bayesian estimation could accurately estimate a signal while un-regularized least squares estimation or maximum likelihood estimation could not. The use of prior information in Bayesian estimation could also improves the numerical properties, e.g., condition number, of either linear overdetermined or linear square inverse problems. This could be understood by keeping in mind the mathematical equivalence between obtaining one scalar measurement related to x, and specifying one constraint that x has to satisfy. Therefore, as the number of available measurements significantly increases, both Bayesian and maximum likelihood estimates would converge to the same estimate.

Bayesian estimation also could be easily adapted to estimate dynamic signals that change over time. This is achieved by sequentially using past estimates of a signal, e.g., xt <sup>1</sup>, as prior information to estimate its current value xt. More generally, Bayesian estimation could be easily adapted for data fusion, i.e., combination of multiple partial measurements to estimate a complete signal in remote sensing, stereo vision and tomographic imaging, e.g., Positron emission tomography (PET), Magnetic resonance imaging (MRI), computed tomography (CT) and optical coherence tomography (OCT). Bayesian methods could also easily fuse all available prior information to provide an estimate based on measurements, in addition to all known information about a signal.

Bayesian estimation techniques could be extended in straight forward ways to estimate output signals of nonlinear systems or signals that have complicated probability distributions. In these cases, numerical Bayesian estimates are typically obtained using Monte Carlo methods.

### 3.5.3. Sparsity as prior information for underdetermined Bayesian signal estimation

Sparse signal representation means the representation of a signal in a domain where most of its coefficients are zero. Depending on the nature of the signal, one could find an appropriate domain where it would be sparse. This notion could be useful in signal estimation because assuming that the unknown signal x is sparse could be used as prior information to obtain an accurate estimate of it, even if only a small number of measurements are available. The rest of this chapter will focus on using signal sparsity as prior information for underdetermined Bayesian signal estimation.

### 4. Sparse signal representation

As shown in Figure 3, a sinusoid is a dense signal in the time domain. However, it could be represented by a single value, i.e., it has a sparse representation, in the frequency domain.

We note that any signal could have a sparse representation in a suitable domain [15]. A sparse signal representation means a representation of the signal in a domain where most of its coefficients are zero. Sparse signal representations have many advantages including:

1. A sparse signal representation requires less memory for its storage. Therefore, it is a fundamental concept for signal compression.


#### 4.1. Signal representation using a dictionary

Bayesian estimation also could be easily adapted to estimate dynamic signals that change over time. This is achieved by sequentially using past estimates of a signal, e.g., xt <sup>1</sup>, as prior information to estimate its current value xt. More generally, Bayesian estimation could be easily adapted for data fusion, i.e., combination of multiple partial measurements to estimate a complete signal in remote sensing, stereo vision and tomographic imaging, e.g., Positron emission tomography (PET), Magnetic resonance imaging (MRI), computed tomography (CT) and optical coherence tomography (OCT). Bayesian methods could also easily fuse all available prior information to provide an estimate based on measurements, in addition to all

Bayesian estimation techniques could be extended in straight forward ways to estimate output signals of nonlinear systems or signals that have complicated probability distributions. In these cases, numerical Bayesian estimates are typically obtained using Monte Carlo methods.

Sparse signal representation means the representation of a signal in a domain where most of its coefficients are zero. Depending on the nature of the signal, one could find an appropriate domain where it would be sparse. This notion could be useful in signal estimation because assuming that the unknown signal x is sparse could be used as prior information to obtain an accurate estimate of it, even if only a small number of measurements are available. The rest of this chapter will focus on using signal sparsity as prior information for underdetermined

As shown in Figure 3, a sinusoid is a dense signal in the time domain. However, it could be represented by a single value, i.e., it has a sparse representation, in the frequency domain.

We note that any signal could have a sparse representation in a suitable domain [15]. A sparse signal representation means a representation of the signal in a domain where most of its

1. A sparse signal representation requires less memory for its storage. Therefore, it is a

coefficients are zero. Sparse signal representations have many advantages including:

3.5.3. Sparsity as prior information for underdetermined Bayesian signal estimation

known information about a signal.

286 Bayesian Inference

Bayesian signal estimation.

4. Sparse signal representation

fundamental concept for signal compression.

Figure 3. A sinusoid in time and frequency domains.

A dictionary D is a collection of vectors {φn}nEΓ, indexed by a parameter n E Γ equal to the dimension of a signal f, where we could represent f as a linear combination [16],

$$f = \sum\_{n\in\Gamma} c\_n \phi\_n. \tag{18}$$

If the vectors {φn}nE<sup>Γ</sup> are linearly independent, then such dictionary is called a basis. Representing a signal as a linear combination of sinusoids, i.e., using a Fourier dictionary, is very common. Wavelet dictionaries and Chirplet dictionaries are also common dictionaries for signal representation. Dictionaries could be combined together to obtain a larger dictionary, where n EΓ is larger than the dimension the signal f, that is called an overcomplete dictionary or a frame.

#### 4.1.1. Signal representation using a basis

A set of vectors form a basis for R<sup>n</sup> if they span R<sup>n</sup> and are linearly independent. A basis in a vector space V is a set X of linearly independent vectors such that every vector in V is a linear combination of elements in X. A vector space V is finite dimensional if it has a finite number of basis vectors [17].

Depending on the properties of {φn}nEΓ, bases could be classified into different types, e.g., orthogonal basis, orthonormal basis, biorthogonal basis, global basis and local basis. For an orthogonal basis, its basis vectors in the vector space V are mutually orthogonal,

$$
\langle \phi\_m, \phi\_n \rangle = 0 \text{ for } m \neq n. \tag{19}
$$

For an orthonormal basis, its basis vectors in the vector space V are mutually orthogonal and have unit length,

$$
\langle \phi\_m, \phi\_n \rangle = \delta(m - n),
\tag{20}
$$

where δ(m � n) is the Kronecker delta function. For a biorthogonal basis, its basis vectors are not orthogonal to each other, but they are orthogonal to vectors in another basis, φe<sup>n</sup> n o nEΓ , such that

$$
\left< \phi\_m, \tilde{\phi}\_n \right> = \delta(m - n). \tag{21}
$$

In addition, depending on the domain (support) on which these basis vectors are defined, we could also classify a basis as either global or local. Sinusoidal basis vectors used for the discrete Fourier transform are defined on the entire domain (support) of f, so they are considered a global basis. Many wavelet basis vectors used for the discrete wavelet transform are defined on only part of the domain (support) of f, so they are considered a local basis.

### 4.1.2. Signal representation using a frame

A frame is a set of vectors {φn}nE<sup>Γ</sup> that spans R<sup>n</sup> and could be used to represent a signal f from the inner products {〈f, φn〉}nEΓ. A frame allows the representation of a signal as a set of frame coefficients, and its reconstruction from these coefficients in a numerically stable way

$$f = \sum\_{n\in\Gamma} \left< f, \phi\_n \right> \phi\_n. \tag{22}$$

Frame theory analyzes the completeness, stability, and redundancy of linear discrete signal representations [18]. A frame is not necessarily a basis, but it shares many properties with bases. The most important distinction between a frame and a basis is that the vectors that comprise a basis are linearly independent, while those comprising frame could be linearly dependent. Frames are also called overcomplete dictionaries. The redundancy in the representation of a signal using frames could be used to obtain sparse signal representations.

### 4.2. Sparse signal representation as a regularized least squares estimation problem

If designed to concentrate the energy of a signal in a small number of dimensions, an orthogonal basis would be the minimum-size dictionary that could yield a sparse representation of this signal [15]. However, finding an orthogonal basis that yields a highly sparse representation for a given signal is usually difficult or impractical. To allow more flexibility, the orthogonality constraint is usually dropped, and overcomplete dictionaries (frames) are usually used. This idea is well explained in the following quote by Stephane Mallat:

"In natural languages, a richer dictionary helps to build shorter and more precise sentences. Similarly, dictionaries of vectors that are larger than bases are needed to build sparse representations of complex signals. Sparse representations in redundant dictionaries can improve pattern recognition, compression, and noise reduction but also the resolution of new inverse problems. This includes super resolution, source separation, and compressed sensing" [15].

Thus representing a signal using a particular overcomplete dictionary has the following goals [16]


A simple way to obtain an overcomplete dictionary A is to use a union of basis Ai that would result in the following representation of a signal y,

Sparsity in Bayesian Signal Estimation http://dx.doi.org/10.5772/intechopen.70529 289

$$\hat{\mathbf{y}}(\mathbf{y}) = \underbrace{([\mathbf{A}\_1][\mathbf{A}\_2][\mathbf{A}\_3][\mathbf{A}\_4][\mathbf{A}\_5])}\_{\mathbf{A}}(\mathbf{x}) \Rightarrow \mathbf{y} = \mathbf{A}\mathbf{x},\tag{23}$$

where A is a n � m matrix representing the dictionary and x are the coefficients representing y in the domain defined by A. Since A represents an overcomplete dictionary, the number of its rows will be less than the number of its columns. Eq. (23) is a formulation of the signal representation problem as an underdetermined inverse problem.

To obtain a sparse solution for Eq. (23) one needs to find an <sup>m</sup> � 1 coefficient vector <sup>b</sup>x, such that,

$$
\widehat{\mathfrak{X}} = \underset{\mathfrak{x}}{\text{arg min }} \|\mathfrak{y} - \mathbf{A}\mathbf{x}\|\_2^2 + \lambda \|\mathfrak{x}\|\_0,\tag{24}
$$

where kxk<sup>0</sup> is the cardinality of vector x, i.e., its number of nonzero elements, and λ > 0 is a regularization parameter that quantifies the tradeoff between the signal representation error,k k <sup>y</sup> � Ax <sup>2</sup> <sup>2</sup>, and its sparsity level, kxk<sup>0</sup> [19]. The cardinality of vector x is sometimes referred to as the L<sup>0</sup> norm of x, even though kxk<sup>0</sup> is actually a pseudo norm that does not satisfy the requirements of a norm in R<sup>m</sup>. This sparse signal representation problem, Eq. (24), has a form similar to the regularized least squares estimation problem, Eq. (11), that would be underdetermined in the case of an overcomplete dictionary. Because of the correspondence between regularized least squares estimation and Bayesian estimation, the problem of finding a sparse representation of a signal could be formulated as a Bayesian estimation problem.

### 5. Compressed sensing

In addition, depending on the domain (support) on which these basis vectors are defined, we could also classify a basis as either global or local. Sinusoidal basis vectors used for the discrete Fourier transform are defined on the entire domain (support) of f, so they are considered a global basis. Many wavelet basis vectors used for the discrete wavelet transform are defined on

A frame is a set of vectors {φn}nE<sup>Γ</sup> that spans R<sup>n</sup> and could be used to represent a signal f from the inner products {〈f, φn〉}nEΓ. A frame allows the representation of a signal as a set of frame

f ; φ<sup>n</sup>

Frame theory analyzes the completeness, stability, and redundancy of linear discrete signal representations [18]. A frame is not necessarily a basis, but it shares many properties with bases. The most important distinction between a frame and a basis is that the vectors that comprise a basis are linearly independent, while those comprising frame could be linearly dependent. Frames are also called overcomplete dictionaries. The redundancy in the representa-

If designed to concentrate the energy of a signal in a small number of dimensions, an orthogonal basis would be the minimum-size dictionary that could yield a sparse representation of this signal [15]. However, finding an orthogonal basis that yields a highly sparse representation for a given signal is usually difficult or impractical. To allow more flexibility, the orthogonality constraint is usually dropped, and overcomplete dictionaries (frames) are usually used.

"In natural languages, a richer dictionary helps to build shorter and more precise sentences. Similarly, dictionaries of vectors that are larger than bases are needed to build sparse representations of complex signals. Sparse representations in redundant dictionaries can improve pattern recognition, compression, and noise reduction but also the resolution of new inverse problems. This includes super resolution, source

Thus representing a signal using a particular overcomplete dictionary has the following goals [16]

• Super resolution—the resolution of the signal when represented using this dictionary

A simple way to obtain an overcomplete dictionary A is to use a union of basis Ai that would

• Sparsity—this representation should be more sparse than other representations.

should be higher than when represented in any other dictionary.

• Speed—this representation should be computed in O(n) or O(n log(n)) time.

� �φn: (22)

coefficients, and its reconstruction from these coefficients in a numerically stable way

<sup>f</sup> <sup>¼</sup> <sup>X</sup> nEΓ

tion of a signal using frames could be used to obtain sparse signal representations.

This idea is well explained in the following quote by Stephane Mallat:

separation, and compressed sensing" [15].

result in the following representation of a signal y,

4.2. Sparse signal representation as a regularized least squares estimation problem

only part of the domain (support) of f, so they are considered a local basis.

4.1.2. Signal representation using a frame

288 Bayesian Inference

Compressed sensing involves the estimation of a signal using a number of measurements that are significantly less than its dimension [20]. By assuming that the unknown signal is sparse in the domain where the measurements were acquired, one could use this sparsity constraint as prior information to obtain an accurate estimate of the signal from relatively few measurements.

Compressed sensing is closely related to signal compression that is routinely used for efficient storage or transmission of signals. Compressed sensing was inspired by this question: instead of the typical signal acquisition followed by signal compression, is there a way to acquire (sense) the compressed signal in the first place? If possible, it would significantly reduce the number of measurements and the computation cost [20]. In addition, this possibility would allow acquisition of signals that require extremely high, hence impractical, sampling rates [21]. As an affirmative answer to this question, compressed sensing was developed to combine signal compression with signal acquisition [20]. This is achieved by designing the measurement setup to acquire signals in the domain where the unknown signal is assumed to be sparse.

In compressed sensing, we consider the estimation of an input signal x∈ R<sup>n</sup> from m linear measurements, where m ≪ n. As discussed above, this problem could be written as an underdetermined linear system,

$$y = A\mathfrak{x},\tag{25}$$

where y∈ R<sup>m</sup> and A ∈ R<sup>m</sup>�<sup>n</sup> represent the measurements and measurement (sensing) matrix, respectively.

Assuming that the unknown signal x is s-sparse, i.e., x ∈ ∑<sup>s</sup> has only s nonzero elements, in the domain specified by the measurement (sensing) matrix A, and assuming that A satisfies the restricted isometry property (RIP) of order 2s, i.e., there exists a constant δ2<sup>s</sup> ∈ (0, 1) such that,

$$(1 - \delta\_{2s})||z||\_2^2 \le ||Az||\_2^2 \le (1 + \delta\_{2s})||z||\_2^2,\tag{26}$$

for all z ∈ ∑2s, then x could be reconstructed from m ≥ s measurements by different optimization algorithms [20]. When the measurements y are noiseless, x could be exactly estimated from,

$$\min\_{\mathbf{x}} ||\mathbf{x}||\_0 \text{ subject to } A\mathbf{x} = \mathbf{y}. \tag{27}$$

However, when the measurements y are contaminated by noise,x could be obtained as the regularized least squares estimate,

$$
\widehat{\mathfrak{X}} = \underset{\mathfrak{x}}{\text{arg min}} \, \|\!\|\mathbf{Ax} - \mathfrak{y}\|\!\|\_{2}^{2} + \lambda \, \|\!\|\mathfrak{x}\|\_{0}.\tag{28}
$$

This minimization problem could also be mathematically reformulated and solved as a Bayesian estimation problem.

### 6. Obtaining sparse solutions for signal representation and signal estimation problems

From Sections 4 and 5 we note that the problem of obtaining a sparse signal representation, Eq. (24) and the problem of sparse signal estimation in compressed sensing, Eq. (28), both have the same mathematical form [11, 22],

$$
\widehat{\mathfrak{X}} = \underset{\mathfrak{x}}{\text{arg min}} \; \|\mathfrak{y} - \mathbf{A}\mathbf{x}\|\_2^2 + \lambda \|\mathfrak{x}\|\_0. \tag{29}
$$

In this section, we describe different approaches to solving this minimization problem. From Eq. (29), we note that the first term of its RHS, k k <sup>y</sup> � Ax <sup>2</sup> 2, represents either signal reconstruction error (sparse signal representation problem) or measurement fitting error (sparse signal estimation in compressed sensing problem), while the second term of its RHS,kxk0, represents the cardinality (number of nonzero coefficients) of the unknown signal. The regularization parameter λ specifies the tradeoff between these two terms in the objective function. The selection of an appropriate value of λ to balance the reconstruction, or fitting error, and signal sparsity is very important. Regularization theory and Bayesian approaches could provide ways to determine optimal values of λ [23–26].

Convex optimization problems is a class of optimization problems that are significantly easier to solve compared to nonconvex problems [34]. Another advantage of convex optimization problems is that any local solution, e.g., a local minimum, is guaranteed to be a global solution. We note that obtaining an exact solution for the minimization problem in Eq. (29) is difficult because it is nonconvex. Therefore, one could either seek an approximate solution to this nonconvex problem or approximate this problem by a convex optimization whose exact solution could be obtained easily.

Considering the general regularized least squares estimation problem,

y ¼ A x; (25)

<sup>2</sup>; (26)

<sup>2</sup> þ λj j j j x <sup>0</sup>: (28)

<sup>2</sup> þ λk kx <sup>0</sup> : (29)

2, represents either signal reconstruc-

where y∈ R<sup>m</sup> and A ∈ R<sup>m</sup>�<sup>n</sup> represent the measurements and measurement (sensing) matrix,

Assuming that the unknown signal x is s-sparse, i.e., x ∈ ∑<sup>s</sup> has only s nonzero elements, in the domain specified by the measurement (sensing) matrix A, and assuming that A satisfies the restricted isometry property (RIP) of order 2s, i.e., there exists a constant δ2<sup>s</sup> ∈ (0, 1) such that,

<sup>2</sup> <sup>≤</sup> j j j j Az <sup>2</sup>

for all z ∈ ∑2s, then x could be reconstructed from m ≥ s measurements by different optimization algorithms [20]. When the measurements y are noiseless, x could be exactly estimated

However, when the measurements y are contaminated by noise,x could be obtained as the

j j j j Ax � <sup>y</sup> <sup>2</sup>

This minimization problem could also be mathematically reformulated and solved as a Bayesian

From Sections 4 and 5 we note that the problem of obtaining a sparse signal representation, Eq. (24) and the problem of sparse signal estimation in compressed sensing, Eq. (28), both have

k k <sup>y</sup> � Ax <sup>2</sup>

In this section, we describe different approaches to solving this minimization problem. From

tion error (sparse signal representation problem) or measurement fitting error (sparse signal estimation in compressed sensing problem), while the second term of its RHS,kxk0, represents the cardinality (number of nonzero coefficients) of the unknown signal. The regularization parameter λ specifies the tradeoff between these two terms in the objective function. The selection of an appropriate value of λ to balance the reconstruction, or fitting error, and signal sparsity is very important. Regularization theory and Bayesian approaches could provide

6. Obtaining sparse solutions for signal representation and signal

<sup>2</sup> <sup>≤</sup> ð Þ <sup>1</sup> <sup>þ</sup> <sup>δ</sup>2<sup>s</sup> j j j j <sup>z</sup> <sup>2</sup>

min<sup>x</sup> j j j j <sup>x</sup> <sup>0</sup> subject to Ax <sup>¼</sup> <sup>y</sup>: (27)

ð Þ <sup>1</sup> � <sup>δ</sup>2<sup>s</sup> j j j j <sup>z</sup> <sup>2</sup>

<sup>b</sup><sup>x</sup> <sup>¼</sup> arg min <sup>x</sup>

<sup>b</sup><sup>x</sup> <sup>¼</sup> arg min <sup>x</sup>

Eq. (29), we note that the first term of its RHS, k k <sup>y</sup> � Ax <sup>2</sup>

ways to determine optimal values of λ [23–26].

respectively.

290 Bayesian Inference

from,

regularized least squares estimate,

estimation problem.

estimation problems

the same mathematical form [11, 22],

$$
\widehat{\mathfrak{X}} = \underset{\mathfrak{x}}{\text{arg min}} \; \|\mathfrak{y} - \mathbf{A}\mathbf{x}\|\_{2}^{2} + \lambda \|\mathfrak{x}\|\_{p}, \tag{30}
$$

we note that it is a nonconvex optimization problem for 0 ≤ p < 1 and a convex optimization problem for p ≥ 1. One alternative to approximate Eq. (29) by a convex optimization problem, one could relax the strict condition of minimizing the cardinality of the signal, kxk0, by replacing by it by the sparsity-promoting condition of minimizing the L<sup>1</sup> norm of the signal, kxk1. Another alternative to approximate Eq. (29) by another nonconvex optimization problem that is easier to solve than the original problem using a Bayesian formulation, is to replace kxk<sup>0</sup> by kxkp,0< p < 1. The minimization of Eq. (30) using kxkp,0< p < 1 would result in a higher degree of signal sparsity compared to when kxk<sup>1</sup> is used. This could be understood visually by examining Figure 4, that shows the shapes of two-dimensional unit balls using (pseudo)norms with different values of p.

We explain further details in the following subsections.

### 6.1. Obtaining a sparse signal solution using L<sup>0</sup> minimization

The sparsest solution of the regularized least squares estimation problem, Eq. (29) would be obtained when p = 0 in kxkp. As shown in Figure 5, the solution of the regularized least squares problem, <sup>b</sup>x, is given by the intersection of the circles, possibly ellipses, representing the

Figure 4. Two-dimensional unit ball using different (pseudo)norms. (a) L0, (b) L0�1, and (c) L1.

Figure 5. Regularized least squares using L0.

solution of the unconstrained least squares estimation problem and the unit ball using L<sup>0</sup> representing the constraint of minimizing L0. In this case of minimizing L0, the unconstrained least squares solution will always intersect the unit ball at an axis, this yielding the most possible sparse solution. However, as mentioned earlier, this L<sup>0</sup> minimization problem is difficult to solve because it is nonconvex. Approximate solutions for this problem could be obtained using greedy optimization algorithms, e.g., Matching Pursuits [27] and Least Angle Regression (LARS) [28].

### 6.2. Obtaining a sparse signal solution using L<sup>1</sup> minimization

On relaxing the nonconvex regularized least squares using L<sup>0</sup> minimization problem, by setting p = 1, we obtain the convex L<sup>1</sup> minimization problem. As shown in Figure 4(c), the unit ball using the L<sup>1</sup> norm covers a larger area than the unit ball using the L<sup>0</sup> pseudo norm, shown in Figure 4(a). Therefore, as shown in Figure 6, the solution for the regularized least squares problem using the L<sup>1</sup> minimization would be sparse, but it should not be expected to be as sparse as the L<sup>0</sup> minimization problem.

This L<sup>1</sup> minimization problem could be solved easily using various algorithms, e.g., Basis Pursuits [16], Method of frames (MOF) [29], Lasso [30, 31], and Best Basis Selection [32, 33]. A Bayesian formulation of this L<sup>1</sup> minimization problem is also possible by assuming that the a priori probability distribution of x is Laplacian, x ~ e |x| .

### 6.3. Obtaining a sparse signal solution using L<sup>0</sup> <sup>1</sup> minimization

As discussed above, solving the regularized least squares problem with L<sup>0</sup> minimization should yield the sparsest signal solution. However, only approximate solutions are available for this difficult nonconvex problem. Alternatively, solving the regularized least squares problem with L<sup>1</sup> minimization should yield an exact sparse solution that would be less sparse than in the L<sup>0</sup> case, but it is considerably easier to obtain.

Figure 6. Regularized least squares using L1.

solution of the unconstrained least squares estimation problem and the unit ball using L<sup>0</sup> representing the constraint of minimizing L0. In this case of minimizing L0, the unconstrained least squares solution will always intersect the unit ball at an axis, this yielding the most possible sparse solution. However, as mentioned earlier, this L<sup>0</sup> minimization problem is difficult to solve because it is nonconvex. Approximate solutions for this problem could be obtained using greedy optimization algorithms, e.g., Matching Pursuits [27] and Least Angle

On relaxing the nonconvex regularized least squares using L<sup>0</sup> minimization problem, by setting p = 1, we obtain the convex L<sup>1</sup> minimization problem. As shown in Figure 4(c), the unit ball using the L<sup>1</sup> norm covers a larger area than the unit ball using the L<sup>0</sup> pseudo norm, shown in Figure 4(a). Therefore, as shown in Figure 6, the solution for the regularized least squares problem using the L<sup>1</sup> minimization would be sparse, but it should not be expected to be as

This L<sup>1</sup> minimization problem could be solved easily using various algorithms, e.g., Basis Pursuits [16], Method of frames (MOF) [29], Lasso [30, 31], and Best Basis Selection [32, 33]. A Bayesian formulation of this L<sup>1</sup> minimization problem is also possible by assuming that the a

As discussed above, solving the regularized least squares problem with L<sup>0</sup> minimization should yield the sparsest signal solution. However, only approximate solutions are available for this difficult nonconvex problem. Alternatively, solving the regularized least squares problem with L<sup>1</sup> minimization should yield an exact sparse solution that would be less sparse than

 |x| .

6.2. Obtaining a sparse signal solution using L<sup>1</sup> minimization

Regression (LARS) [28].

292 Bayesian Inference

Figure 5. Regularized least squares using L0.

sparse as the L<sup>0</sup> minimization problem.

priori probability distribution of x is Laplacian, x ~ e

in the L<sup>0</sup> case, but it is considerably easier to obtain.

6.3. Obtaining a sparse signal solution using L<sup>0</sup> <sup>1</sup> minimization

The regularized least squares problem could also be formulated as an L<sup>0</sup> � <sup>1</sup> minimization problem. As kxkp,0< p < 1,that we abbreviate as L<sup>0</sup> � 1, is not an actual norm, this optimization problem would be nonconvex [34]. The advantage of using L<sup>0</sup> � <sup>1</sup> minimization is that, as shown in Figure 4(b), compared to unit ball using the L<sup>1</sup> norm, the unit ball using the L<sup>0</sup> � <sup>1</sup> pseudo norm has a narrower area that is concentrated around the axes. Therefore, as shown in Figure 7, the L<sup>0</sup> � <sup>1</sup> minimization problem should yield a sparser solution compared to the L<sup>1</sup> minimization problem.

Figure 7. Regularized least squares using L0�1.

Figure 8. Product of two student-t probability distributions.

Another advantage of using L<sup>0</sup> � <sup>1</sup> minimization is that this nonconvex optimization problem could be easily formulated as a Bayesian estimation problem that could be solved using Markov Chain Monte Carlo (MCMC) methods. As shown in Figure 8, the product of student-t probability distributions has a shape similar to the unit ball using the L<sup>0</sup> � <sup>1</sup> pseudo norm, so student-t distributions could be used as a priori distributions to approximate the L<sup>0</sup> � <sup>1</sup> pseudo norm.

### 6.4. Bayesian method to obtain a sparse signal solution using L<sup>0</sup> � <sup>1</sup> minimization

As mentioned in Section 3.5, the first step to obtaining one of the many possible Bayesian estimates of x is to use Bayes rule to obtain the a posteriori pdf,

$$f(\mathbf{x}|\mathbf{y}) = \frac{f(\mathbf{y}|\mathbf{x})f(\mathbf{x})}{\int f(\mathbf{y}|\mathbf{x})f(\mathbf{x})} \,. \tag{31}$$

Using this a posteriori distribution, one could obtain a sparse signal solution using L<sup>0</sup> � <sup>1</sup> minimization, as the maximum a posteriori (MAP) estimate given by Eq. (15). Compared to other Bayesian estimates, the MAP estimate could be easier to obtain because the calculation of the normalizing constant, Ð f(y|x)f(x),would not be needed. The maximization of the product of conditional probability distribution of y given x and the a priori distribution of x is equivalent to the minimizing of the sum of their negative logarithms,

$$
\widehat{\mathfrak{X}}\_{\text{MAP}} = \underset{\mathfrak{x}}{\text{arg min}} \left[ -\log p(\mathfrak{y}|\mathbf{x}) - \log p(\mathbf{x}) \right]. \tag{32}
$$

In the case of white Gaussian measurement noise, p(y|x) ~ Nx(Ax, σ<sup>2</sup> I) where � log pð Þ yjx ∝ k k <sup>y</sup> � Ax <sup>2</sup> <sup>2</sup>, which the first term of the RHS of Eq. (30). As discussed in the previous section, the a priori probability p(x) corresponding to L<sup>0</sup> � <sup>1</sup> minimization could be represented as a product of univariate student-t probability distribution functions [14],

$$p(\mathbf{x}) = \prod\_{i=1}^{M} \operatorname{stud}\_{\mathbf{x}}[0, 1, \mathfrak{d}] = \prod\_{i=1}^{M} \frac{\Gamma\left(\frac{\mathfrak{d} + 1}{2}\right)}{\sqrt{\mathfrak{d}\pi} \Gamma\left(\frac{\mathfrak{d}}{2}\right)} \left(1 + \frac{\mathfrak{x}\_{i}^{2}}{\mathfrak{d}}\right)^{-\frac{(\mathfrak{d} + 1)}{2}},\tag{33}$$

where Γ is the Gamma function, and ϑ is the number of degrees of freedom of the student-t distribution. Since this a priori distribution function is not an exponential function, we would use Eq. (15) instead of Eq. (32) to obtain the MAP estimate.

Because the prior is not a Gaussian distribution, there is no simple closed form expression for the posterior, p(x|y) with a student-t a priori probability distribution. However, we could express each student-t distribution as an infinite weighted sum of Gaussian distributions, where the hidden variables hi determine their variances [14].

$$p(\mathbf{x}) = \prod\_{i=1}^{M} \left[ \mathrm{N}\_{\mathbf{x}} \left( \mathbf{0}, 1/h\_i \right) \mathrm{Gam}\_{h\_i} [8/2, 8/2] dh\_i = \int \mathrm{N}\_{\mathbf{x}} \left( \mathbf{0}, \mathrm{H}^{-1} \right) \prod\_{i=1}^{M} \mathrm{Gam}\_{h\_i} [8/2, 8/2] dH \right. \tag{34}$$

where the matrix H contains the hidden variables f g hi M <sup>i</sup>¼<sup>1</sup> on its diagonal and has zeros elsewhere, and Gamhi ½ � ϑ=2, ϑ=2 is the gamma probability distribution function with parameters (ϑ/2, ϑ/2). Using this approximation, the a posteriori pdf could be written as

$$\begin{split} p(\mathbf{x}|\mathbf{y}) \propto p(\mathbf{y}|\mathbf{x})p(\mathbf{x}) &= \mathrm{N\_{x}}\left(\mathbf{A}\mathbf{x}, \sigma^{2}I\right) \left\{\mathrm{N\_{x}}\left(\mathbf{0}, \mathbf{H}^{-1}\right) \prod\_{i=1}^{M} \mathrm{Gam}\_{\mathbb{H}\_{i}}\left[\frac{\boldsymbol{\aleph}}{2}, \frac{\boldsymbol{\aleph}}{2}\right] d\boldsymbol{H} \\ &= \int \mathrm{N\_{x}}\left(\mathbf{A}\mathbf{x}, \sigma^{2}I\right) \mathrm{N\_{x}}\left(\mathbf{0}, \mathbf{H}^{-1}\right) \prod\_{i=1}^{M} \mathrm{Gam}\_{\mathbb{H}\_{i}}\left[\frac{\boldsymbol{\aleph}}{2}, \frac{\boldsymbol{\aleph}}{2}\right] d\boldsymbol{H}. \end{split} \tag{35}$$

The product of two Gaussian distributions is also a Gaussian distribution [35],

$$N\_{\mathbf{x}}\left(\boldsymbol{\mu}\_{1},\boldsymbol{\Sigma}\_{1}\right)N\_{\mathbf{x}}\left(\boldsymbol{\mu}\_{2},\boldsymbol{\Sigma}\_{2}\right) = kN\_{\mathbf{x}}(\boldsymbol{\mu},\boldsymbol{\Sigma}),\tag{36}$$

where the mean and covariance (μ, Σ) of the new Gaussian distribution in Eq. (36) is given by,

$$\boldsymbol{\mu} = \left(\boldsymbol{\Sigma\_1}^{-1} + \boldsymbol{\Sigma\_2}^{-1}\right)^{-1} \left(\boldsymbol{\Sigma\_1}^{-1}\boldsymbol{\mu\_1} + \boldsymbol{\Sigma\_2}^{-1}\boldsymbol{\mu\_2}\right) \text{and} \\ \boldsymbol{\Sigma} = \left(\boldsymbol{\Sigma\_1}^{-1} + \boldsymbol{\Sigma\_2}^{-1}\right)^{-1}, \tag{37}$$

and k is a constant. Therefore, we could simplify the product of two the Gaussian distributions given in Eq. (35) as,

$$N\_{\mathbf{x}}\left(\mathbf{A}\mathbf{x},\sigma^{2}\mathbf{I}\right).N\_{\mathbf{x}}\left(\mathbf{0},\mathbf{H}^{-1}\right) = k.N\_{\mathbf{x}}\left(\left(\sigma^{-2}I+\mathbf{H}\right)^{-1}\left(\sigma^{-2}\mathbf{A}\mathbf{x}\right),\left(\sigma^{-2}I+\mathbf{H}\right)^{-1}\right).\tag{38}$$

From Eqs. (35) and (38) we could write p(x|y) as,

Another advantage of using L<sup>0</sup> � <sup>1</sup> minimization is that this nonconvex optimization problem could be easily formulated as a Bayesian estimation problem that could be solved using Markov Chain Monte Carlo (MCMC) methods. As shown in Figure 8, the product of student-t probability distributions has a shape similar to the unit ball using the L<sup>0</sup> � <sup>1</sup> pseudo norm, so student-t distributions could be used as a priori distributions to approximate the L<sup>0</sup> � <sup>1</sup> pseudo

As mentioned in Section 3.5, the first step to obtaining one of the many possible Bayesian

<sup>f</sup>ð Þ¼ <sup>x</sup>j<sup>y</sup> <sup>f</sup>ð Þ <sup>y</sup>j<sup>x</sup> <sup>f</sup>ð Þ<sup>x</sup> Ð

Using this a posteriori distribution, one could obtain a sparse signal solution using L<sup>0</sup> � <sup>1</sup> minimization, as the maximum a posteriori (MAP) estimate given by Eq. (15). Compared to other Bayesian estimates, the MAP estimate could be easier to obtain because the calculation

of conditional probability distribution of y given x and the a priori distribution of x is equiva-

a priori probability p(x) corresponding to L<sup>0</sup> � <sup>1</sup> minimization could be represented as a product

<sup>2</sup>, which the first term of the RHS of Eq. (30). As discussed in the previous section, the

<sup>f</sup>ð Þ <sup>y</sup>j<sup>x</sup> <sup>f</sup>ð Þ<sup>x</sup> : (31)

½ � � log pð Þ� yjx log pð Þx : (32)

I) where � log pð Þ yjx ∝

f(y|x)f(x),would not be needed. The maximization of the product

6.4. Bayesian method to obtain a sparse signal solution using L<sup>0</sup> � <sup>1</sup> minimization

estimates of x is to use Bayes rule to obtain the a posteriori pdf,

Figure 8. Product of two student-t probability distributions.

lent to the minimizing of the sum of their negative logarithms,

of univariate student-t probability distribution functions [14],

<sup>b</sup>xMAP <sup>¼</sup> arg min <sup>x</sup>

In the case of white Gaussian measurement noise, p(y|x) ~ Nx(Ax, σ<sup>2</sup>

norm.

294 Bayesian Inference

of the normalizing constant, Ð

k k <sup>y</sup> � Ax <sup>2</sup>

$$p(\mathbf{x}|\mathbf{y}) = k \int N\_{\mathbf{x}} \left( \left( \sigma^{-2} I + \mathbf{H} \right)^{-1} \left( \sigma^{-2} A \mathbf{x} \right), \left( \sigma^{-2} I + \mathbf{H} \right)^{-1} \right) \prod\_{i=1}^{M} \text{Gam}\_{h\_i} \left[ \frac{\mathfrak{d}}{2}, \frac{\mathfrak{d}}{2} \right] d\mathbf{H} . \tag{39}$$

We still could not compute the integral in Eq. (39) in closed form. However, we could maximize the RHS of Eq. (39) over the hidden variables H to obtain an approximation for the a posteriori probability distribution function

$$p(\mathbf{x}|\mathbf{y}) \approx \underset{H}{\text{arg}\,\text{max}} \left[ \text{N}\_{\mathbf{x}} \left( \left( \sigma^{-2}I + \mathbf{H} \right)^{-1} \left( \sigma^{-2}A\mathbf{x} \right), \left( \sigma^{-2}I + \mathbf{H} \right)^{-1} \right) \prod\_{i=1}^{M} \text{Gam}\_{\mathbf{h}\_{i}} \left[ \frac{\mathfrak{d}}{2}, \frac{\mathfrak{d}}{2} \right] \right]. \tag{40}$$

Eq. (40) would be a good approximation of p(x|y), if the actual distribution over the hidden variables is concentrated tightly around its mode [14]. When hi has a large value, its corresponding ith component of the a priori probability distribution function p(x) would have a small variance, <sup>1</sup> hi , so that this ith component of p(x) could be set to zero. Therefore, this ith dimension of the prior p(x) would not contribute to the solution of Eq. (30), thus increasing its sparsity.

Since both Gaussian and gamma pdfs in Eq. (40) are members of the exponential family of probability distributions, we could obtain <sup>b</sup>xMAP by maximizing the sum of their logarithms. Section 3.5 in [11] and Section 8.6 in [14] describe an iterative optimization method to obtain <sup>b</sup>xMAP from the approximate a posteriori probability distribution function given by Eq. (40).

### 7. Conclusion

In this chapter, we described different methods to estimate an unknown signal from its linear measurements. We focused on the underdetermined case where the number of measurements is less than the dimension of the unknown signal. We introduced the concept of signal sparsity and described how it could be used as prior information for either regularized least squares or Bayesian signal estimation. We discussed compressed sensing and sparse signal representation as examples where these sparse signal estimation methods could be applied.

### Author details

Ishan Wickramasingha<sup>1</sup> , Michael Sobhy<sup>2</sup> and Sherif S. Sherif<sup>1</sup> \*

\*Address all correspondence to: sherif.sherif@umanitoba.ca

1 Department of Electrical and Computer Engineering, University of Manitoba, Winnipeg, Canada

2 Biomedical Engineering Graduate Program, University of Manitoba, Winnipeg, Canada

### References


[3] Chen C-T. Linear System Theory and Design. New York, NY: Oxford University Press, Inc.; 1999

pð Þ xjy ≈ arg max H

hi

a small variance, <sup>1</sup>

7. Conclusion

Author details

Canada

References

Media; 2011

Ishan Wickramasingha<sup>1</sup>

sparsity.

296 Bayesian Inference

N<sup>x</sup> σ�<sup>2</sup>

<sup>I</sup> <sup>þ</sup> <sup>H</sup> � ��<sup>1</sup>

σ�<sup>2</sup> Ax � �, σ�<sup>2</sup> <sup>I</sup> <sup>þ</sup> <sup>H</sup> � ��<sup>1</sup> � �Y

Eq. (40) would be a good approximation of p(x|y), if the actual distribution over the hidden variables is concentrated tightly around its mode [14]. When hi has a large value, its corresponding ith component of the a priori probability distribution function p(x) would have

dimension of the prior p(x) would not contribute to the solution of Eq. (30), thus increasing its

Since both Gaussian and gamma pdfs in Eq. (40) are members of the exponential family of probability distributions, we could obtain <sup>b</sup>xMAP by maximizing the sum of their logarithms. Section 3.5 in [11] and Section 8.6 in [14] describe an iterative optimization method to obtain <sup>b</sup>xMAP from the approximate a posteriori probability distribution function given by Eq. (40).

In this chapter, we described different methods to estimate an unknown signal from its linear measurements. We focused on the underdetermined case where the number of measurements is less than the dimension of the unknown signal. We introduced the concept of signal sparsity and described how it could be used as prior information for either regularized least squares or Bayesian signal estimation. We discussed compressed sensing and sparse signal representation

as examples where these sparse signal estimation methods could be applied.

\*Address all correspondence to: sherif.sherif@umanitoba.ca

, Michael Sobhy<sup>2</sup> and Sherif S. Sherif<sup>1</sup>

1 Department of Electrical and Computer Engineering, University of Manitoba, Winnipeg,

2 Biomedical Engineering Graduate Program, University of Manitoba, Winnipeg, Canada

[1] Keesman KJ. System Identification: An Introduction. London: Springer Science & Business

[2] Von Bertalanffy L. General system theory. New York. 1968;41973(1968):40

\*

� � " #

, so that this ith component of p(x) could be set to zero. Therefore, this ith

M

Gamhi

ϑ 2 ; ϑ 2

: (40)

i¼1


**Provisional chapter**

### **Dynamic Bayesian Network for Time-Dependent Classification Problems in Robotics Classification Problems in Robotics**

**Dynamic Bayesian Network for Time-Dependent** 

DOI: 10.5772/intechopen.70059

Cristiano Premebida, Francisco A. A. Souza and Diego R. Faria and Diego R. Faria Additional information is available at the end of the chapter

Additional information is available at the end of the chapter

Cristiano Premebida, Francisco A. A. Souza

http://dx.doi.org/10.5772/intechopen.70059

#### **Abstract**

[22] Huang K, Aviyente S. Sparse representation for signal classification. In: NIPS. Vol. 19;

[23] Poggio T, Torre V, Koch C. Computational vision and regularization theory. Nature. 1985

[24] Tikhonov AN, Arsenin VI. Solutions of Ill-posed Problems. Washington, DC: Winston;

[25] Wahba G, Wendelberger J. Some new mathematical methods for variational objective analysis using splines and cross validation. Monthly Weather Review. 1980 Aug;108(8):

[26] Lin Y, Lee DD. Bayesian L1-Norm Sparse Learning. In: 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings. Toulouse, France: vol. 5;

[27] Mallat SG, Zhang Z. Matching pursuits with time-frequency dictionaries. IEEE Trans-

[28] Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. The Annals of Statis-

[29] Daubechies I. Time-frequency localization operators: A geometric phase space approach.

[30] Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal

[31] Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2006 Feb 1;68(1):

[32] Coifman RR, Wickerhauser MV. Entropy-based algorithms for best basis selection. IEEE

[33] Rao BD, Kreutz-Delgado K. An affine scaling methodology for best basis selection. IEEE

[34] Boyd S, Vandenberghe L. Convex Optimization. New York: Cambridge University Press;

[35] Bromiley P. Products and convolutions of Gaussian probability density functions. Tina-

actions on Signal Processing. 1993 Dec;41(12):3397-3415

IEEE Transactions on Information Theory. 1988 Jul;34(4):605-612

Transactions on Information Theory. 1992 Mar;38(2):713-718

Transactions on Signal Processing. 1999 Jan;47(1):187-200

Statistical Society. Series B (Methodological). 1996 Jan 1;58(1):267–288

2006. pp. 609-616

1977 Jan

298 Bayesian Inference

1122-1143

2006. p. V–V.

49-67

2004.

tics. 2004 Apr;32(2):407-499

Vision Memo. 2003;3(4):1–13.

Sep 26;317(6035):314-319

This chapter discusses the use of dynamic Bayesian networks (DBNs) for time-dependent classification problems in mobile robotics, where Bayesian inference is used to infer the class, or category of interest, given the observed data and prior knowledge. Formulating the DBN as a time-dependent classification problem, and by making some assumptions, a general expression for a DBN is given in terms of classifier priors and likelihoods through the time steps. Since multi-class problems are addressed, and because of the number of time slices in the model, additive smoothing is used to prevent the values of priors from being close to zero. To demonstrate the effectiveness of DBN in time-dependent classification problems, some experimental results are reported regarding semantic place recognition and daily-activity classification.

**Keywords:** dynamic Bayesian network, Bayesian inference, probabilistic classification, mobile robotics, social robotics

### **1. Introduction**

Bayesian inference finds applications in many areas of engineering, and mobile robotics is not an exception. When time is a variable to be considered, the dynamic Bayesian network (DBN) [1–5] is a powerful approach to be considered. Due to its graphical representation and modelling versatility, DBN facilitates the problem-solving process in probabilistic time-dependent applications. Therefore, DBNs provide an effective way to model time-based (dynamic) probabilistic problems and also enable a very suitable and intuitive representation by means of a graph-based tree.

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Depending on the structure of the DBN, the joint probabilistic distribution that governs a given system can be decomposed by a tractable product of probabilities, where the conditional terms only depend on their directly linked nodes. This chapter concentrates on inference problems using DBN where the variable to be inferred from a feature vector (data) represents a set of semantic classes <sup>=</sup> { *<sup>c</sup>* 1 , *c* 2 , …,*c nc*} or categories, in the context of intelligent perception systems for mobile robotics applications. Namely, we will address problems where denotes semantic places in a given indoor environment [6, 7] e.g. <sup>=</sup> { '*corridor*', '*office*',…,'*kitchen*'} and also when the classes of interest are daily-live activities = {'*drinking*','*talking*',…,'*walking*'} [8, 9].

The principle of Bayesian inference basically depends on two elements: the prior and the likelihood; in practical problems, the evidence probability acts 'only' as a normalization to guarantees that the posterior sums to one. In this chapter, we will deal with the problems of the classical Bayesian form *posterior* ∝ *likelihood* ⋅ *prior*, but the incorporation of (past) time will be explicitly modelled in a discrete-time basis, and the *past* information is assumed to be contained in the prior probabilities. Inference will be considered beyond the first-order Markov assumption, which means that a DBN with a finite number of time slices (*T*) will be addressed. Current time step *t* and previous/past time steps will be considered in the formulation of the DBN; thus, the time interval is {*t*, *t* − 1, …,*t* − *T*}.

The observed data enters the DBN in the form of a vector of features *X* calculated from sensory data; examples of sensors are laser scanners (or 2D Lidar) and RGB-D camera, as shown in **Figure 1**. Later, in the formulation of the DBN, we will consider that the feature vector at a given time step (*Xt* ) is conditionally independent of previous time steps; therefore, *P*(*<sup>X</sup> <sup>t</sup>* | *X <sup>t</sup>*−1) = *P*(*X <sup>t</sup>* ).

The use of Bayesian inference in mobile robotics for purpose of localization, simultaneous localization and mapping (SLAM), object detection, path planning and navigation, has been addressed in many scientific works; see Ref. [10] for a review. The majority of those applications involve stochastic filtering, such as Kalman filter (KF), particle filter (PF), Monte Carlo techniques and hidden Markov model (HMM) [11, 12]. However, when the parameter of interest has to be inferred from multidimensional feature vectors ( e.g. feature vectors with hundreds of elements) and also when the distribution that the observed data were drawn is not known (in unseen/ knew or testing scenarios) then, a DBN can be used to handle such complex problems. In robotics, semantic place classification [6, 7] and activity recognition [8, 9] are examples of such problems and belong to the research area of pattern recognition. For

**Figure 1.** Sensors commonly used in mobile robotics for perception systems.

these application cases, the class-conditional probabilities (or likelihoods) can be modelled using machine learning techniques, for example, naive Bayes classifier (NBC), support vector machines (SVMs) and artificial neural networks (ANNs) [13, 14].

The remainder of this chapter is organized as follows: a brief review of the DBN is given in Section 2. Section 3 addresses inference in DBN, formulated for purposes of pattern recognition in robotics, followed by the use of additive smoothing on the prior distributions. In Section 4, experimental results on semantic place classifications and activity recognition are presented. Finally, Section 5 presents our conclusions.

### **2. Preliminaries on DBN**

Depending on the structure of the DBN, the joint probabilistic distribution that governs a given system can be decomposed by a tractable product of probabilities, where the conditional terms only depend on their directly linked nodes. This chapter concentrates on inference problems using DBN where the variable to be inferred from a feature vector

intelligent perception systems for mobile robotics applications. Namely, we will address problems where denotes semantic places in a given indoor environment [6, 7] e.g. <sup>=</sup> { '*corridor*', '*office*',…,'*kitchen*'} and also when the classes of interest are daily-live activities = {'*drink-*

The principle of Bayesian inference basically depends on two elements: the prior and the likelihood; in practical problems, the evidence probability acts 'only' as a normalization to guarantees that the posterior sums to one. In this chapter, we will deal with the problems of the classical Bayesian form *posterior* ∝ *likelihood* ⋅ *prior*, but the incorporation of (past) time will be explicitly modelled in a discrete-time basis, and the *past* information is assumed to be contained in the prior probabilities. Inference will be considered beyond the first-order Markov assumption, which means that a DBN with a finite number of time slices (*T*) will be addressed. Current time step *t* and previous/past time steps will be considered in the formulation of the

The observed data enters the DBN in the form of a vector of features *X* calculated from sensory data; examples of sensors are laser scanners (or 2D Lidar) and RGB-D camera, as shown in **Figure 1**. Later, in the formulation of the DBN, we will consider that the feature

The use of Bayesian inference in mobile robotics for purpose of localization, simultaneous localization and mapping (SLAM), object detection, path planning and navigation, has been addressed in many scientific works; see Ref. [10] for a review. The majority of those applications involve stochastic filtering, such as Kalman filter (KF), particle filter (PF), Monte Carlo techniques and hidden Markov model (HMM) [11, 12]. However, when the parameter of interest has to be inferred from multidimensional feature vectors ( e.g. feature vectors with hundreds of elements) and also when the distribution that the observed data were drawn is not known (in unseen/ knew or testing scenarios) then, a DBN can be used to handle such complex problems. In robotics, semantic place classification [6, 7] and activity recognition [8, 9] are examples of such problems and belong to the research area of pattern recognition. For

) is conditionally independent of previous time steps; there-

1 , *c* 2 , …,*c*

*nc*} or categories, in the context of

(data) represents a set of semantic classes <sup>=</sup> { *<sup>c</sup>*

DBN; thus, the time interval is {*t*, *t* − 1, …,*t* − *T*}.

).

**Figure 1.** Sensors commonly used in mobile robotics for perception systems.

vector at a given time step (*Xt*


fore, *P*(*<sup>X</sup> <sup>t</sup>*

*ing*','*talking*',…,'*walking*'} [8, 9].

300 Bayesian Inference

Basically, a DBN is used to express the joint probability of events that characterizes a timebased (dynamic) system, where the relationships between events are expressed by conditional probabilities. Given evidence (observations) about events of the DBN, and prior probabilities, statistical inference is accomplished using the Bayes theorem. Inference in pattern recognition applications is the process of estimating the probability of the classes/categories given the observations, the class-conditional probabilities, and the priors [15, 16]. When time is involved, usually the system is assumed to evolve according to the first-order Markov assumption and, as consequence, a single time slice is considered.

In this chapter, we address DBN structures with more than one time slice. Moreover, the conditional probabilities of the DBN will be modelled by supervised machine learning techniques (also known as classifier or classification method). Two case studies will be particularly discussed: activity recognition for human-robot interaction and semantic place classification for mobile robotics navigation.

The observed data variable, denoted by *<sup>X</sup>* <sup>=</sup> { *<sup>X</sup>*<sup>1</sup> , …,*Xnx*}, enters into the DBN in the form of conditional probabilities *P*(*X*|*<sup>C</sup>* ), where the values of *X* are feature vectors. To give an idea of the dimensionality of *X*, in semantic place classification [6], the number of features can be *nx* = 50, while in activity recognition we have 51 features [8]. Given such dimensionalities, which can be even higher, it becomes infeasible to estimate the probability distribution that characterizes *P*(*X*|*<sup>C</sup>* ) without the use of advanced algorithms. Although a simple Naïve Bayes classifier can be incorporated in a DBN to model *P*(*X*|*<sup>C</sup>* ), more powerful solutions, such as the ensemble of classifiers in the DBMM approach introduced in Ref. [8], tend to achieve higher classification performance.

In summary, DBN is a direct acyclic graph (DAG) that consists of a finite set of events (the nodes or vertices) connected through edges (or arcs) that model the dependencies among the events and also the time variable. Here, the nodes are given by the variables {*X*, *C*}, and the dynamic (time-based) behaviour of the BDN is considered to be governed by the current time *t* and by a finite set of previous time slices {*t* − 1, *t* − 2, …,*t* − *T*}. So, future time slices will be not considered. **Figure 2** shows the structure of the DBN, with *T* + 1 time slices, that will be considered in the problem formulation presented in the sequel.

**Figure 2.** An example of a DBN with *T* + 1 time slices and two nodes {*C, X*}.

### **3. Inference with DBN**

The problem is formulated by considering *P*(*Xt* , *Xt*−<sup>1</sup> , …,*Xt*−*<sup>T</sup>*, *Ct* , *Ct*−<sup>1</sup> , …,*Ct*−*<sup>T</sup>* ) i.e., the joint distribution of the nodes over the time up to *T*. The goal is to infer the current-time value of the class *Ct* given the data *Xt*:*t*−*<sup>T</sup>* = { *Xt* , *Xt*−<sup>1</sup> , …,*Xt*−*<sup>T</sup>*} and the prior knowledge of the class, which is attained by the a-posteriori probability *P*(*Ct* | *Ct*−1:*t*−*<sup>T</sup>*, *Xt*:*t*−*<sup>T</sup>*). The superscript notation denotes the set of values over a time interval: {*t*:*t* − *T*} = {*t*, *t* − 1, *t* − 2, …,*t* − *T*}.

The simplest case is for a single time slice where the posterior reduces to *P*(*C<sup>t</sup>* | *X<sup>t</sup>* ) <sup>∝</sup> *<sup>P</sup>*(*X<sup>t</sup>* | *C<sup>t</sup>* )*P*(*C<sup>t</sup>* ). For two time slices, we have

$$P(\mathcal{C}^{\prime} \mid \mathcal{C}^{\prime -1}, X^{\prime \pm 1}) \propto P(X^{\prime} \mid X^{\prime \pm 1}, \mathcal{C}^{\prime \pm 1}) P(X^{\prime \pm 1} \mid \mathcal{C}^{\prime \pm 1}) P(\mathcal{C}^{\prime} \mid \mathcal{C}^{\prime \pm 1}) P(\mathcal{C}^{\prime \pm 1}).\tag{1}$$

As the number of time slices increases, the problem of inferring the class becomes more complex; therefore, some assumptions can be made in order to find a tractable solution. As a first assumption, let the nodes be independent of later (subsequent in time) nodes. As a consequence, and taking as the example for *T* = 1, the probability *P*(*X<sup>t</sup>*−<sup>1</sup> <sup>|</sup> *<sup>C</sup><sup>t</sup>*:*t*−1) <sup>=</sup> *<sup>P</sup>*(*X<sup>t</sup>*−<sup>1</sup> | *<sup>C</sup><sup>t</sup>*−<sup>1</sup> ) that is, the node *Xt*–1 does not depend on the node *Ct* which is after a time-slice. The second assumption, more strong, is that the feature-vector node *X* is independent for all time slices hence, and following the previous example, *P*(*X<sup>t</sup>* | *X<sup>t</sup>*−<sup>1</sup> , *C<sup>t</sup>*:*t*−<sup>1</sup> ) becomes *P*(*X<sup>t</sup>* <sup>|</sup> *<sup>C</sup><sup>t</sup>*:*t*−1). Given these two assumptions, we can state the general problem of calculating the posterior probability of a DBN with *T* + 1 time slices by the expression

$$P\{\mathbf{C}^t \mid \mathbf{C}^{t+1:t-T}, \mathbf{X}^{t:t-T}\} = \frac{1}{\beta} \prod\_{k \neq l}^{t-T} \left\{ P\{\mathbf{X}^t \mid \mathbf{C}^t\} P(\mathbf{C}^t) \right\}.\tag{2}$$

where *β* is the scale (normalization) factor to guarantee that the values of the a-posteriori sum to one. The class-conditional probabilities *P*(*X<sup>k</sup>* | *C<sup>k</sup>* ) come from a supervised classifier or from an ensemble of classifiers as in Ref. [8], while *P*(*C<sup>k</sup>* ) assumes the value of the previous posterior probability; thus, *P*(*C<sup>t</sup>* ) <sup>←</sup> *posterior<sup>t</sup>*−<sup>1</sup> .

This strategy for 'updating' the values of the prior by taking the values of previous posteriors is a very common and effective technique used in Bayesian sequential systems. The steps involved in the calculation of the posterior probability, as expressed in Eq. (2), are illustrated in **Figure 3**.

Selection of the class-conditional model to express *P*(*X*|*C*) is an important part of the approach and can be achieved by well-known probabilistic machine learning methods. Although generative methods (e.g. Naïve Bayes, GMM and HMM) provide direct probabilistic interpretation and, therefore, constitute appropriate choices, discriminative methods (e.g. SVM, random forest and ANN) tend to have better classification performance. However, to be a suitable model, a given discriminative method has to be of a probabilistic form; this implies, at least, that the outcomes from the classifier sum to one. A more advanced method can be used to model *P*(*X*|*C*) in a DBN, as the dynamic Bayesian mixture model (DBMM) [8], where a mixture of *n* classifiers is used to model the conditional probability which assumes the form *P*(*X*|*C*) <sup>=</sup> <sup>∑</sup>*<sup>j</sup>*=1 *<sup>n</sup> <sup>ω</sup><sup>j</sup> P* (*X*|*C* ) *j* , *<sup>j</sup>* <sup>=</sup> <sup>1</sup>, …,*n*; where *ω<sup>j</sup>* are the weighting parameters and *<sup>P</sup>* (*X*|*<sup>C</sup>* ) *j* are the probabilities from the classifiers. Further details are provided in Ref. [6].

The product of likelihoods and priors, in the expression of the a-posteriori Eq. (2), has the consequence of penalizing the classes that are less likely to occur. In other words, the classes with low probability, i.e. close to zero, will have an even more low values of posterior; this effect is intensified as the number of time slices increases. Because the priors are recursively assigned by assuming the values of the previous posteriors, we suggest to use additive smoothing to avoid values of priors to be very close to zero.

**3. Inference with DBN**

For two time slices, we have

*P*(*Ct* | *Ct*−<sup>1</sup>

class *Ct*

302 Bayesian Inference

The problem is formulated by considering *P*(*Xt*

**Figure 2.** An example of a DBN with *T* + 1 time slices and two nodes {*C, X*}.

given the data *Xt*:*t*−*<sup>T</sup>* = { *Xt*

node *Xt*–1 does not depend on the node *Ct*

*<sup>P</sup>*(*Ct* <sup>|</sup> *Ct*−1:*t*−*<sup>T</sup>*, *Xt*:*t*−*<sup>T</sup>*) <sup>=</sup> \_1

to one. The class-conditional probabilities *P*(*X<sup>k</sup>*

an ensemble of classifiers as in Ref. [8], while *P*(*C<sup>k</sup>*

) <sup>←</sup> *posterior<sup>t</sup>*−<sup>1</sup>

.

lowing the previous example, *P*(*X<sup>t</sup>*

time slices by the expression

probability; thus, *P*(*C<sup>t</sup>*

, *Xt*−<sup>1</sup>

tribution of the nodes over the time up to *T*. The goal is to infer the current-time value of the

attained by the a-posteriori probability *P*(*Ct* | *Ct*−1:*t*−*<sup>T</sup>*, *Xt*:*t*−*<sup>T</sup>*). The superscript notation denotes

As the number of time slices increases, the problem of inferring the class becomes more complex; therefore, some assumptions can be made in order to find a tractable solution. As a first assumption, let the nodes be independent of later (subsequent in time) nodes. As a conse-

more strong, is that the feature-vector node *X* is independent for all time slices hence, and fol-

we can state the general problem of calculating the posterior probability of a DBN with *T* + 1

*<sup>β</sup>* ∏*<sup>k</sup>*=*<sup>t</sup> t*−*T* { *P*(*Xk*

where *β* is the scale (normalization) factor to guarantee that the values of the a-posteriori sum


) becomes *P*(*X<sup>t</sup>*

, *Xt*−<sup>1</sup>

the set of values over a time interval: {*t*:*t* − *T*} = {*t*, *t* − 1, *t* − 2, …,*t* − *T*}.

, *Xt*:*t*−1) ∝ *P*(*Xt* | *Xt*−<sup>1</sup>

The simplest case is for a single time slice where the posterior reduces to *P*(*C<sup>t</sup>*

quence, and taking as the example for *T* = 1, the probability *P*(*X<sup>t</sup>*−<sup>1</sup> <sup>|</sup> *<sup>C</sup><sup>t</sup>*:*t*−1) <sup>=</sup> *<sup>P</sup>*(*X<sup>t</sup>*−<sup>1</sup>


, *Ct*−<sup>1</sup>

, …,*Xt*−*<sup>T</sup>*} and the prior knowledge of the class, which is

, *Ct*:*t*−1)*P*(*Xt*−<sup>1</sup> | *Ct*:*t*−1)*P*(*Ct* | *Ct*−1)*P*(*Ct*−1). (1)

which is after a time-slice. The second assumption,

, …,*Ct*−*<sup>T</sup>* ) i.e., the joint dis-

) <sup>∝</sup> *<sup>P</sup>*(*X<sup>t</sup>*




<sup>|</sup> *<sup>C</sup><sup>t</sup>*:*t*−1). Given these two assumptions,


) come from a supervised classifier or from

) assumes the value of the previous posterior

Additive smoothing, also called Lidstone smoothing, adds a term (*α*) to the prior distribution and can be expressed as

**Figure 3.** This figure illustrates the DBN, with *T +* 1 time slices, as formulated according to the assumptions presented in Section 3. The product of likelihoods and priors, over the time interval [*t* – *T, t*], becomes the posterior probability as expressed in Eq. (2).

$$\hat{P}\left(\mathbb{C}\_{i}\right) = \frac{P\left(\mathbb{C}\_{i}\right) + a}{1 + a \cdot (nc)'} \text{ } i = 1, \ldots, nc \tag{3}$$

where *α* is the additive smoothing factor and *nc* is the number of classes. The influence of *α* on the smoothed prior *P* ^ (*Ci* ) has to be such that the values of *<sup>P</sup>* ^ (*Ci* ) are greater than zero (*<sup>P</sup>* ^ (*Ci* ) > 0, ∀ *i*) and, moreover, the prior distribution *P* ^ (*Ci* ) should be consistent (the values of *<sup>P</sup>* ^ (*Ci* ) must of course sum to one). A practical range is 0 < *α* < 0.1.

**Figure 4** provides an example of the impact of *α* on a given prior, with values of *α* equal to {0, 0.01, 0.05 and 0.1}. As the value of *α* increases, the prior distribution tends to lose its initial definiteness due to the uniform 'bias' introduced by *α*. In the example shown in **Figure 4**, we have considered a five-class case (*nc* = 5).

**Figure 4.** An example of the influence of the additive factor (*α*) in a given *P*(*Ci* ), *i* = 1, …,5.

### **4. Experiments on classification: mobile robotics case studies**

In order to demonstrate the use of the DBN as formulated above, we will consider two classification problems that find applications in mobile robotics: semantic place recognition [6] and activity classification [8].

### **4.1. Semantic place recognition**

**Figure 5** illustrates a probabilistic system for semantic place recognition where data comes from a laser scanner sensor. In a practical application, the sensor is mounted on-board a mobile robot [6, 7]. Based on **Figure 5**, we can make a direct correspondence with the DBN discussed above by verifying that the feature vector is *X*, the probabilistic classifier outputs the class-conditional probability *P*(*X*|*C* ) and the priors transmit the time-based information through the network.

As an example of the DBN application in semantic place classification, let us report some results from Ref. [6], where a DBN was applied on the image database for robot localization (IDOL) dataset: available at http://www.cas.kth.se/IDOL/. In this context, the problem of semantic place classification can be stated as follows: 'given a set of features, calculated on

**Figure 5.** Illustration for a time-dependent probabilistic system applied in semantic place recognition. In this system, data obtained from a laser scanner.

data from laser scanner sensors (installed on-board a mobile robot), determine the semantic robot location ('corridor', 'room', 'office', etc) by using a classification method'. The experiments in Ref. [6] use a mixture of classifiers to model the class-conditional probability in the DBN; such approach is called DBMM [8].

**Figure 6** shows recognition results in a sequence of nine frames from the IDOL dataset, where the first row depicts images of indoor places as captured by a camera mounted onboard a mobile robot. The second row provides classification results without time slices (i.e. time-base prior probabilities are not incorporated into the DBN), and the subsequent rows show classification probabilities for a DBN with time-slices up to three. In the figure, the vertical line (in red) indicates the transition between classes: from the class 'kitchen' (KT) to the class 'corridor' (CR).

### **4.2. Activity classification**

*P*

^ (*Ci*

have considered a five-class case (*nc* = 5).

and activity classification [8].

**4.1. Semantic place recognition**

through the network.

0, ∀ *i*) and, moreover, the prior distribution *P*

of course sum to one). A practical range is 0 < *α* < 0.1.

the smoothed prior *P*

304 Bayesian Inference

^ (*Ci*

) <sup>=</sup> *<sup>P</sup>*(*Ci*

**4. Experiments on classification: mobile robotics case studies**

**Figure 4.** An example of the influence of the additive factor (*α*) in a given *P*(*Ci*

In order to demonstrate the use of the DBN as formulated above, we will consider two classification problems that find applications in mobile robotics: semantic place recognition [6]

**Figure 5** illustrates a probabilistic system for semantic place recognition where data comes from a laser scanner sensor. In a practical application, the sensor is mounted on-board a mobile robot [6, 7]. Based on **Figure 5**, we can make a direct correspondence with the DBN discussed above by verifying that the feature vector is *X*, the probabilistic classifier outputs the class-conditional probability *P*(*X*|*C* ) and the priors transmit the time-based information

As an example of the DBN application in semantic place classification, let us report some results from Ref. [6], where a DBN was applied on the image database for robot localization (IDOL) dataset: available at http://www.cas.kth.se/IDOL/. In this context, the problem of semantic place classification can be stated as follows: 'given a set of features, calculated on

) <sup>+</sup> *<sup>α</sup>* \_\_\_\_\_\_\_\_\_ 1 + *α* ⋅ (*nc* )

) has to be such that the values of *<sup>P</sup>*

^ (*Ci*

where *α* is the additive smoothing factor and *nc* is the number of classes. The influence of *α* on

**Figure 4** provides an example of the impact of *α* on a given prior, with values of *α* equal to {0, 0.01, 0.05 and 0.1}. As the value of *α* increases, the prior distribution tends to lose its initial definiteness due to the uniform 'bias' introduced by *α*. In the example shown in **Figure 4**, we

^ (*Ci*

) should be consistent (the values of *<sup>P</sup>*

), *i* = 1, …,5.

, *i* = 1, …, *nc* (3)

) are greater than zero (*<sup>P</sup>*

^ (*Ci* ) >

) must

^ (*Ci*

> In the case of the activity classification problem described here, the objective is to classify the human's daily activity based on spatiotemporal skeleton-based features. In such a case, mobile robots mounted with appropriated cameras can make use of such classification models to

**Figure 6.** Classification results on a five-class semantic place recognition problem, extracted from reference [6], using a DBN with mixture models of three classifiers (DBMM [6, 8]).

improve the quality of life of, for example, old-age people, by assisting them in their daily life or detecting anomalous situations. Similar to semantic place recognition problem, the activity classification problem can also be seen as a time-dependent probabilistic system, where the feature vector *X* is the skeleton-based features. From Ref. [8], we report some results on the activity classification.

**Figure 7** exhibits an activity classification framework, based on Ref. [8], which uses a DBN with mixture models (the DBMM approach as previously described in the semantic place classification problem), where the data is acquired by using an RGB-D sensor, followed by the skeleton detection step and the feature extraction process, where the latter is based on geometrical features. From the training stage, global weights are computed using an uncertainty measure (e.g. entropy) as a confidence level for each base classifier based on their performance on the training set. During the test, given the input data (i.e. skeleton features for the current activity), base classifiers are used and merged as mixture models with time slices (using previous time instant classification) to reinforce the current classification.

The well-known dataset for activity recognition Cornell Activity Dataset (CAD60) [9, 17] was used to evaluate the proposed framework in Refs. [8, 18]. The CAD-60 dataset comprises video sequences and skeleton data of human daily activities acquired from a RGB-D sensor. There are 12 human' daily activities performed by four different subjects (two male and two female, one of them being left-handed) grouped in five different environments: office, kitchen,

**Figure 7.** Illustration for a time-dependent probabilistic system applied to activity classification. In this system, data obtained from a RGB-D camera, which provides the spatiotemporal skeleton-based features.

bedroom, bathroom and living room. Additionally, the CAD-60 dataset has two more activities (random movements and still), which are used for classification assessment on test sets, in order to evaluate precision and generalization capacity of the approaches since these activities encompass similar movements to some other activities. We have adopted the same strategy described in Ref. [17], so that we present the classification results in terms of precision (Prec) and recall (Rec) for each scenario. The evaluation criterion was carried out using leave-oneout cross-validation. The idea is to verify the generalization capacity of the classifier by using the strategy of 'new person', i.e. learning from different persons and testing with an unseen person. The classification is made frame-by-frame to account for the accuracy of the frames correctly classified.

improve the quality of life of, for example, old-age people, by assisting them in their daily life or detecting anomalous situations. Similar to semantic place recognition problem, the activity classification problem can also be seen as a time-dependent probabilistic system, where the feature vector *X* is the skeleton-based features. From Ref. [8], we report some results on the

**Figure 6.** Classification results on a five-class semantic place recognition problem, extracted from reference [6], using a

**Figure 7** exhibits an activity classification framework, based on Ref. [8], which uses a DBN with mixture models (the DBMM approach as previously described in the semantic place classification problem), where the data is acquired by using an RGB-D sensor, followed by the skeleton detection step and the feature extraction process, where the latter is based on geometrical features. From the training stage, global weights are computed using an uncertainty measure (e.g. entropy) as a confidence level for each base classifier based on their performance on the training set. During the test, given the input data (i.e. skeleton features for the current activity), base classifiers are used and merged as mixture models with time slices (using previous time instant classification) to reinforce the current

The well-known dataset for activity recognition Cornell Activity Dataset (CAD60) [9, 17] was used to evaluate the proposed framework in Refs. [8, 18]. The CAD-60 dataset comprises video sequences and skeleton data of human daily activities acquired from a RGB-D sensor. There are 12 human' daily activities performed by four different subjects (two male and two female, one of them being left-handed) grouped in five different environments: office, kitchen,

activity classification.

306 Bayesian Inference

DBN with mixture models of three classifiers (DBMM [6, 8]).

classification.

Results show the DBMM approach obtained better classification performance compared to other state-of-the-art methods presented in the ranked table in Ref. [17]. The overall results were precision: 94.83%; recall: 94.74% and accuracy: 94.74%. **Figure 8** presents the classification performance (i.e. precision and recall) for the 'new person' tested in each scenario. For comparison purposes, **Table 1** summarizes the results in terms of accuracy of state-of-the-art single classifiers and a simple averaged ensemble compared with the proposed DBMM for the bedroom (scenario with more misclassification), showing that our approach outperforms other classifiers. The classification performance in terms of overall accuracy, precision and recall has shown that our proposed framework outperforms state-of-the-art methods that use the same datasets [17].

In this section, we have shown the DBMM [8, 18] performance using an offline dataset. Additionally, further tests using a mobile platform with an RGB-D sensor on-board running on-the-fly in an assisted living context was also successfully validated with accuracy above

**Figure 8.** Performance on the CAD-60 ('new person'). Results are reported in terms of precision (Prec) and recall (Rec) and an average (AV) per scenario. Overall AV: precision 94.83%; recall: 94.74%. Activities in (a): Act1—rinsing water; Act2—brushing teeth; Act3—wearing lens; Act4—random + still; activities in (b): Act1—talking on phone; Act2 drinking water; Act3—opening container; Act4—random + still; activities in (c): Act1—talking on phone; Act2—drinking water; Act3—talking on coach; Act4—relaxing on coach; Act5—random + still; activities in (d) Act1—drinking water; Act2—cooking chopping; Act3—cooking stirring; Act4—opening container; Act5—random + still; activities in (e): Act1—talking on phone; Act2—writing on whiteboard; Act3—drinking water; Act4—working on computer; Act5 random + still.


**Table 1.** Results in terms of accuracy on the bedroom scenario of the CAD-60 dataset ('new person') using single classifiers, a simple averaged ensemble (AV) and the DBMM.

90%, as reported in Ref. [18]. More details about the DBMM using a mobile robot for activity recognition and a video showing the classification performance can be found in Ref. [18].

### **5. Conclusion**

In this chapter, the authors have presented a DBN formulation for classification of time-dependent problems together with experimental results on applications of two mobile robots. The first one regarding the semantic place classification and the second one based on activity classification. In both formulations, the DBN was used as basis to compose the DBMM [6, 8, 18], a more complex structure used to handle more complex scenarios. In both applications, the DBMM has shown to be a powerful choice in modelling of time-dependent scenarios.

When it comes to semantic place classification, the model could detect classes' transitions during the robot navigation, thanks to the different time slices (i.e. higher than 2) and the additive smoothing used in the model. In the case of activity recognition, since the activities in the dataset do not have classes' transitions, i.e. only one activity is performed during a task, in this case, a simple version of the DBMM using only one time slice is enough to correct classify all activities. For real-time applications using a mobile robot and in accordance with experimental results reported in Ref. [6], it is suggested to use more than two time slices in the mode.

### **Author details**

Cristiano Premebida<sup>1</sup> \*, Francisco A. A. Souza<sup>1</sup> and Diego R. Faria<sup>2</sup>


### **References**

**Location Activity Bayes ANN SVM AV DBMM** Bedroom 1 79.90% 74.70% 74.90% 76.50% 84.10%

**Figure 8.** Performance on the CAD-60 ('new person'). Results are reported in terms of precision (Prec) and recall (Rec) and an average (AV) per scenario. Overall AV: precision 94.83%; recall: 94.74%. Activities in (a): Act1—rinsing water; Act2—brushing teeth; Act3—wearing lens; Act4—random + still; activities in (b): Act1—talking on phone; Act2 drinking water; Act3—opening container; Act4—random + still; activities in (c): Act1—talking on phone; Act2—drinking water; Act3—talking on coach; Act4—relaxing on coach; Act5—random + still; activities in (d) Act1—drinking water; Act2—cooking chopping; Act3—cooking stirring; Act4—opening container; Act5—random + still; activities in (e): Act1—talking on phone; Act2—writing on whiteboard; Act3—drinking water; Act4—working on computer; Act5—

Activity: 1—talk.on phone, 2—drink.water, 3—open.container, 4—random + still.

classifiers, a simple averaged ensemble (AV) and the DBMM.

random + still.

308 Bayesian Inference

 72.70% 76.60% 81.40% 76.90% 86.40% 79.60% 91.10% 93.10% 87.90% 98.30% 65.70% 93.50% 92.60% 83.90% 97.40% **Average 74.48% 83.98% 85.50% 81.30% 91.55%**

**Table 1.** Results in terms of accuracy on the bedroom scenario of the CAD-60 dataset ('new person') using single


**Applications of Bayesian Inference in Economics**

[4] Murphy KP. Dynamic Bayesian networks: Representation, inference and learning. Ph.D.

[5] Mihajlovic V, Petkovic M. Dynamic Bayesian Networks: A State of the Art. Technical Report, Computer Science Department, University of Twente, Netherlands; 2001

[6] Premebida C, Faria D, Nunes U. Dynamic Bayesian network for semantic place classifi-

[7] Rottmann A, Mozos OM, Stachniss C, Burgard W. Semantic place classification of indoor environments with mobile robots using boosting. In: Proceeding of the 20th National Conference on Artificial Intelligence (AAAI'05); 9-13 July 2005; Pittsburgh, Pennsylvania:

[8] Faria DR, Premebida C, Nunes C. A probabilistic approach for human everyday activities recognition using body motion from RGB-D images. In: Proceedings of the IEEE RO-MAN'14: International Symposium on Robot and Human Interactive Communication; 25-29 August. 2014; Edinburgh, UK. IEEE; Cambridge, MA, USA; 2014

[9] Sung J, Ponce C, Selman B, Saxena A. Unstructured human activity detection from RGBD images. In: Proceedings of the IEEE International Conference on Robotics and

Automation (ICRA), Saint Paul, MN, New York, NY, USA, May 2012; pp. 842-849 [10] Thrun S, Burgard W, Fox D. Probabilistic Robotics. MIT Press; New Jersey, NJ, USA;

[11] Li T, Prieto J, Corchado JM, Bajo J. On the use and misuse of Bayesian filters. In: Proceeding of the IEEE 18th Int. Conference on Information Fusion (Fusion); 6-9 July

[12] Chen Z. Bayesian filtering: From Kalman filters to particle filters and beyond. Statistics.

[14] Duda RO, Hart PE, Stork DG. Pattern Classification. John Wiley & Sons; New Jersey, NJ,

[15] Neapolitan RE. Learning Bayesian Networks. Upper Saddle River, NJ, USA: Prentice-

[16] Russell S, Norvig P. Artificial Intelligence: A Modern Approach. 3rd ed. Prentice Hall;

[17] Cornell Activity Datasets CAD-60 [Internet]. Available from: http://pr.cs.cornell.edu/

[18] Faria DR, Vieira M, Premebida C, Nunes U. Probabilistic human daily activity recognition towards robot-assisted living. In: Proceeding of the IEEE RO-MAN'15: IEEE International Symposium on Robot and Human Interactive Communication; Kobe,

cation in mobile robotics. Autonomous Robots (AURO), Springer; 2016

Dissertation. University of California, Berkeley; 2002

AAAI Press; 2005

2015; Washington, DC, USA. IEEE; 2015

[13] Bishop CM. Pattern recognition. Machine Learning. 2006;**128**:1-58

humanactivities/data.php [Accessed: January 2017

Japan; New York, NY, USA; September 2015

2005

310 Bayesian Inference

USA

2003;**182**(1):1-69

Hall, Inc.; 2003

New Jersey, NJ, USA; 2010

Provisional chapter

### **A Bayesian Model for Investment Decisions in Early Ventures** A Bayesian Model for Investment Decisions

DOI: 10.5772/intechopen.70051

in Early Ventures

Anamaria Berea and Daniel Maxwell

Additional information is available at the end of the chapter Anamaria Berea and Daniel Maxwell

http://dx.doi.org/10.5772/intechopen.70051 Additional information is available at the end of the chapter

#### Abstract

In this research, we present a Bayesian model to aid the investment decision in early stage start-ups and ventures. This model addresses both the venture and the angel investing markets. The model is informed both by previous academic literature on entrepreneurship and by venture capital investment practices. The model is validated through an anonymized experiment where reviewers with previous experience in entrepreneurship or investment or both scored a list of 20 anonymous real companies for which we knew the outcome a priori. The experiment revealed that the model and online scoring platform that we built provide an accuracy of 83% in identifying companies that would later on fail and where the investments would be lost. The model also performs fairly well in identifying companies where the investors would not lose their money but they would either have to wait for a very long time on their returns or they would not receive large return on investment (ROI), and we also show that the model performs modestly in identifying "big exit" companies or companies where the investors would receive high ROI and in a fairly short amount of time.

Keywords: Bayesian networks, investment, start-up, entrepreneurship, decision models

### 1. Introduction

One of the biggest challenges facing early stage investors is a lack of actionable data and effective analytics. Most investment decisions are made based on the instinct (heuristics) of the investor who may or may not have experience in the sector and decisions are often inherently biased. In investment environment is increasingly complex, and investors cannot process all of the factors that are critical to the success of a potential investment and make a well-informed decision. Research suggests that well-built analytic models make better decisions than human experts across virtually every field [1].

© 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and eproduction in any medium, provided the original work is properly cited.

Some of the newest data on the returns on angel investment show that these are about 2.5 times the value of the initial investment and the average period of recovery of investment is 3.6 years [2].

In general, there is little literature with respect to automated techniques or models of investment decision. A very recently published paper shows an interesting risk analysis model that would reduce the risk of investing in early entrepreneurs [3]. This research takes a similar approach—reduce the "bad" investment decisions—but it uses a different model, based on a Bayesian model, which performs well in identifying the future failures of new ventures.

While there is understandably little academic literature on forecasting future star-up success and its relationship to investment decision-making, due to the confidentiality of the data in this business, the decision-making practice in the venture capital and angel investment industries rely heavily on the experience of the investors and on the "collective" thinking of the investors that gather together to rate or assess the pitches or business proposals for various funding rounds of investment.

Therefore, this chapter presents a model for investment decision-making that is informed mainly by the practitioners and is intended to be applied in to investment practice. Its aim is to be a tool that helps the process of rating seed and start-up ventures become more informative and transparent both for investors and for entrepreneurs.

The model built for this research is mainly informed by the interviews and discussions conducted with investors during the summer of 2014. The nodes of the model and the dependencies between the nodes have been created based on these interviews, while the distributions of the prior probabilities have been informed by the academic literature where such information could be found, otherwise they are normal.

This research describes the model in general terms, how it has been implemented in practice and the results of two experiments that have been run to provide validity of its forecasting accuracy. The construction, implementation, and validation of the model, as well as a discussion of findings are presented in the following sections below.

The rest of this chapter is structured as follows: Section 2 describes the model and the rationale behind building it; Section 3 describes the experiments that were conducted using this model, mainly with the purpose of validating its accuracy; Section 4 presents the results from the experiments and an analysis of the accuracy of the model; and Section 5 summarizes succinctly the conclusions of this research.

### 2. The Bayesian investment decision model

We used Bayesian networks modeling to build a probabilistic assessment model of early stage companies or ventures. We based our selection of nodes/factors on a series of interviews and working closely with practitioners in venture capital funding. We afterwards implemented this model on an online platform, available at www.exogenius.net (see Figure 1).

Figure 1. The Bayesian model for investment decision.

Some of the newest data on the returns on angel investment show that these are about 2.5 times the value of the initial investment and the average period of recovery of investment

In general, there is little literature with respect to automated techniques or models of investment decision. A very recently published paper shows an interesting risk analysis model that would reduce the risk of investing in early entrepreneurs [3]. This research takes a similar approach—reduce the "bad" investment decisions—but it uses a different model, based on a Bayesian model, which performs well in identifying the future failures of new ventures.

While there is understandably little academic literature on forecasting future star-up success and its relationship to investment decision-making, due to the confidentiality of the data in this business, the decision-making practice in the venture capital and angel investment industries rely heavily on the experience of the investors and on the "collective" thinking of the investors that gather together to rate or assess the pitches or business proposals for various

Therefore, this chapter presents a model for investment decision-making that is informed mainly by the practitioners and is intended to be applied in to investment practice. Its aim is to be a tool that helps the process of rating seed and start-up ventures become more informa-

The model built for this research is mainly informed by the interviews and discussions conducted with investors during the summer of 2014. The nodes of the model and the dependencies between the nodes have been created based on these interviews, while the distributions of the prior probabilities have been informed by the academic literature where such

This research describes the model in general terms, how it has been implemented in practice and the results of two experiments that have been run to provide validity of its forecasting accuracy. The construction, implementation, and validation of the model, as well as a discus-

The rest of this chapter is structured as follows: Section 2 describes the model and the rationale behind building it; Section 3 describes the experiments that were conducted using this model, mainly with the purpose of validating its accuracy; Section 4 presents the results from the experiments and an analysis of the accuracy of the model; and Section 5 summarizes succinctly

We used Bayesian networks modeling to build a probabilistic assessment model of early stage companies or ventures. We based our selection of nodes/factors on a series of interviews and working closely with practitioners in venture capital funding. We afterwards implemented this

model on an online platform, available at www.exogenius.net (see Figure 1).

tive and transparent both for investors and for entrepreneurs.

information could be found, otherwise they are normal.

2. The Bayesian investment decision model

sion of findings are presented in the following sections below.

is 3.6 years [2].

314 Bayesian Inference

funding rounds of investment.

the conclusions of this research.

The Bayesian model scores on a scale of [0, 100] the potential performance of a company/start-up by identifying three key measures: business execution, value proposition, and exit potential (see Figure 2). These measures are aggregated (nonlinearly) into an overall score of performance. Each of these three important measures scores the future potential of a project or start-up in regard to their proposition (which may be a technological innovation, a social value, or any business value that the entrepreneur presents as the core proposition), their ability to sustain, carry out, and fulfill their proposition (business execution) and the potential of this new venture to exit (either through IPO, buy-out, or in any manner that would be satisfactory for the investor).

Each of these three measures is a child of five subnetworks in the model, which are represented by more granular parent-children nodes each. These five subnetworks are business/entrepreneurship factors or indicators that are measuring the new venture on the following aspects of the business proposal: technical difficulty, uniqueness of innovation, readiness for market, customer engagement, team performance, entrepreneurial and managerial experience, founders and incorporation of the company, and many more. Each of the granular nodes in the model is represented by three to five states and they are informed either by the evidence from published literature (as described below) or otherwise by a uniform distribution priors [4].

The conditional tables of each node have been readjusted after sensitivity analysis was performed, based on data and facts previously published in the entrepreneurship and high-growth companies literature [5–7].

Figure 2. Example of one of five subnetworks of the model—the technology offering is represented by three granular nodes.

For example, the states of the technology (marginal versus breakthrough) node are defined according to the literature on entrepreneurship [7–9]; the number of founders is also determined based on these prior findings, i.e., the state of 2–4 founders has the highest positive impact on the final score, while the other states have low impact or negative impact (more than 5 founders lower the chances of success significantly) [5, 6].

The nodes representing the team complementarity, coordination, and learning are based on the findings of the Startup Genome Project, which was run at Berkley and Stanford Universities [5, 10, 11]. In other words, since the findings show that team complementarity and learning are critically important for the success of the early ventures, the team node in the model reflects these findings through the distribution of prior in its states.

Similarly, the nodes that are assessing the infrastructure of the start-up (broadly construed as not only physical requirements to develop the proposed technology, but also legislative, financial, or logistic infrastructure), are informed by the currently published probabilistic values in previous studies on organizational emergence [12].

The placement of the new venture in the current market is also assessed, and this is done based on the assessment of the projected growth of the company relative to the projected growth of the market or of the industry [7, 12].

For the development of the model, we used both UnBBayes [13] and GeNIe/SMILE [14] opensource softwares dedicated to Bayesian modeling. After the model was built, tested, and developed, it was migrated on the online platform, with easy to use user interface, and where we ran our experiments.

The implementation of the model on an online platform facilitated experimentation for forecasting accuracy. The nodes of the model that provide new evidence, specific to each venture, are represented as a series of 23 questions in a user-friendly interface. For example, the evidence node in the model that represents the uniqueness of the offering became the question "How unique is the proposed offering (idea/innovation/technology/product/service)?" in the online platform. The nodes that were not evidence in the model have obviously not been represented as questions in the online implementation. The reviewers/users have the possibility to see the progression of the three key scores (value proposition, business execution, and exit potential) as well as the final score as they go through answering the individual assessment questions.

### 3. The experimental design for model validation

For example, the states of the technology (marginal versus breakthrough) node are defined according to the literature on entrepreneurship [7–9]; the number of founders is also determined based on these prior findings, i.e., the state of 2–4 founders has the highest positive impact on the final score, while the other states have low impact or negative impact (more than 5 founders

Figure 2. Example of one of five subnetworks of the model—the technology offering is represented by three granular

The nodes representing the team complementarity, coordination, and learning are based on the findings of the Startup Genome Project, which was run at Berkley and Stanford Universities [5, 10, 11]. In other words, since the findings show that team complementarity and learning are critically important for the success of the early ventures, the team node in the model reflects

Similarly, the nodes that are assessing the infrastructure of the start-up (broadly construed as not only physical requirements to develop the proposed technology, but also legislative, financial, or logistic infrastructure), are informed by the currently published probabilistic

The placement of the new venture in the current market is also assessed, and this is done based on the assessment of the projected growth of the company relative to the projected growth of

For the development of the model, we used both UnBBayes [13] and GeNIe/SMILE [14] opensource softwares dedicated to Bayesian modeling. After the model was built, tested, and developed, it was migrated on the online platform, with easy to use user interface, and where

The implementation of the model on an online platform facilitated experimentation for forecasting accuracy. The nodes of the model that provide new evidence, specific to each venture, are represented as a series of 23 questions in a user-friendly interface. For example, the evidence

lower the chances of success significantly) [5, 6].

the market or of the industry [7, 12].

we ran our experiments.

nodes.

316 Bayesian Inference

these findings through the distribution of prior in its states.

values in previous studies on organizational emergence [12].

In order to validate the accuracy of the model scores, an anonymized experiment was designed, where 20 case studies of companies were recreated from real, historical companies. These case studies included the state of funding and potential of various companies while they were startups, before their first or second seed funding and the aim of the experiment was to show whether the exit or the overall scores of the model align statistically with what happened in real life.

In the experiment, there were randomly picked 20 historical cases for which we know the ground truths about their financial history (how they started, how much was their initial funding, and how much was their exit), by using publicly available information from Crunch-Base website, Wikipedia and various failed start-ups, and postmortems case studies. The companies in the sample for the experiment had either high exits (were bought for more than \$500 million), medium exits (were bought for 100–1000K or they took a very long time to exit, i.e., 20 years), or no exits (they shut down or went bankrupt soon after their launch).

Each of these 20 case studies in the sample were recreated as anonymous business proposals, given the information at the time when they were seeking initial funding (i.e., 2010). Therefore, each of these anonymized case studies included the following information: the year when the reviewer had to "travel back in time" (i.e., 2010), with a hyperlink toward published most important business and technological events of that year (i.e., the economist), the company location, the number of founders, the type of incorporation, anonymized information about the founders experience, information about the market and industry at that time, information about the customers, the team, the infrastructure, about the financial past of the company if it existed and, most importantly, information about the product or technology without disclosing its brand name. The reviewers were also free to look for additional information on the web regarding the state of technology and business at that particular time in the past. The oldest case study was placed in 1999 and the newest one in 2014.

In other words, all the possible information about a company that could be included prior to the time of their initial funding request was we included, as long as it could be anonymized.

We conducted two experiments: one with experts in business or investing and other with MBA students at the University of Maryland.

The first experiment was carried by 24 volunteer reviewers, who reviewed five of these anonymous case studies each, by answering the questions from online platform at the forefront of our model for each of their assigned five case studies. The reviewers in the experiment are experienced as either entrepreneurs or investors; therefore, they are a panel of experts that completed the experiment.

The second experiment was carried by MBA students at the University of Maryland, in a 1 h long session. The students were also randomly assigned five case studies each and answered the same questions from the online platform as the experts did.

### 4. Results and accuracy analysis

The first experiment started on March 22, 2016 and by April 13, 2016, 54% of reviewers completed their reviews. We collected 68 (reviews) X-4 (scores) data points. The second experiment was carried out during 1 day in October 2016.

Figure 3. The distribution of overall reviewing scores in the expert experiment. This figure shows the scores on a scale of 0–100 that were given by the professional reviewers (investors and entrepreneurs) in the overall rating for the companies in each of the three groups—high-exits; medium exits; and no exits. The distributions of the reviewers scores show that low exits were scored between 0 and 40 with most scores around a value of 20; medium exits were scored between 0 and 80 with most scores around 40; and the high exits were scored either with scores around 20 or scores around 60.

A reviewer provides the observations for the evidence nodes/questions in the model. The model then provides a distribution on all scores as output, conditional on these observations. Thus, the Bayesian model here is a three-layer model where the metrics are at the top level in the network and the observations (market evaluation, team evaluation, etc.) are at the bottom layer of granular nodes.

Both the measures in the model and the observations are discrete.

experienced as either entrepreneurs or investors; therefore, they are a panel of experts that

The second experiment was carried by MBA students at the University of Maryland, in a 1 h long session. The students were also randomly assigned five case studies each and answered

The first experiment started on March 22, 2016 and by April 13, 2016, 54% of reviewers completed their reviews. We collected 68 (reviews) X-4 (scores) data points. The second exper-

Figure 3. The distribution of overall reviewing scores in the expert experiment. This figure shows the scores on a scale of 0–100 that were given by the professional reviewers (investors and entrepreneurs) in the overall rating for the companies in each of the three groups—high-exits; medium exits; and no exits. The distributions of the reviewers scores show that low exits were scored between 0 and 40 with most scores around a value of 20; medium exits were scored between 0 and 80 with most scores around 40; and the high exits were scored either with scores around 20 or scores around 60.

the same questions from the online platform as the experts did.

completed the experiment.

318 Bayesian Inference

4. Results and accuracy analysis

iment was carried out during 1 day in October 2016.

The data from the anonymized experiments were rematched with the ground truth data from the real case studies and compared the experiments with the evidence on three groups of companies (high exits, medium exits, no exits). The distributions of the exit scores and the overall scores from the experiment for each of these groups are plotted on the following figures (see Figures 3–7).

Figure 4. The distribution of overall reviewing scores in the MBA students experiment. Similarly to the plot above, this figure shows the scores on a scale of 0–100 that were given by the University of Maryland students in the overall rating for the companies in each of the three groups—high exits; medium exits; and no exits. The distributions of the reviewers scores show that low exits were scored between 0 and 40 with most scores around a value below 20; medium exits were scored between 0 and 80 with most scores around either 20 or 40; and the high exits were scored either with scores between 20 and 60.

Figure 5. The distribution of exit reviewing scores in the expert experiment. This figure is similar to Figure 3, except that these are the scores of the professional reviewers for the exit node and not the overall score. The low exits were scores mainly with values close to 0, medium exits with scores between 10 and 60, and high exits scores were very close to a uniform distribution.

Figure 6. The distribution of exit reviewing scores in the MBA students experiment. Similarly as above, this figure shows the distribution of the exit scores for the student reviewers. The scores of the low exit companies were close to zero, the ones of the medium exits around 30 and the ones of the high exits exhibit a much larger range of scores, from 0 to 100.

Figure 7. The overall accuracy of the Bayesian model in the expert panel experiment.

Figure 5. The distribution of exit reviewing scores in the expert experiment. This figure is similar to Figure 3, except that these are the scores of the professional reviewers for the exit node and not the overall score. The low exits were scores mainly with values close to 0, medium exits with scores between 10 and 60, and high exits scores were very close to a

Figure 6. The distribution of exit reviewing scores in the MBA students experiment. Similarly as above, this figure shows the distribution of the exit scores for the student reviewers. The scores of the low exit companies were close to zero, the ones of the medium exits around 30 and the ones of the high exits exhibit a much larger range of scores, from 0 to 100.

uniform distribution.

320 Bayesian Inference

Figure 8. The overall accuracy of the Bayesian model in both experiments.

We can observe from these distributions that the "no exits" or "failures" scored low in both experiments, that the medium exits had medium scores in both experiments, and that the high exits had low, medium, and high scores in both experiments, whether we look at the final overall score or only at the exit key intermediate score (see Figure 8).


Table 1. A summary of the model accuracy based on the experimental results.

In other words, there is consistency between the two groups of reviewers with respect to each of the three groups of companies. Moreover, there is consistency in the reviewers responses and the ground truth data with respect to low-exit and medium-exit companies, but less so for high-exit companies. In other words, we can use this model to identify failures or low exits, but less so to identify high exits and, therefore, the model is designed to prune out "bad" proposals from a pool of varied investment opportunities.

Between the two experiments, we can also observe that the experts are still slightly better than MBA students at identifying low and medium exits.

The responses from the experiment for the "no exits" had a mean exit score of 20% and a median exit score of 16% and a mean and median overall score of 27% with a standard deviation of 16–17%. This means that the companies that failed in real life were reviewed with scores in the range of 16–27% in our model.

The medium exits experimental data had a mean exit score of 31%, a median of 28% and an overall mean and median of 34 and 36%, respectively, with standard deviations of 20 and 17%, respectively. This means that the companies that had medium exits (either low in capital value or took very long to exit) scored around the probabilities of 28–36% in our model.

The high exits had a mean and median exit score of 42%, an overall mean and median of 46% and a standard deviation of 28and 25%, respectively. This means that companies that were bought for more than \$500 million in real life scored around 42–46% in our model (see Table 1).

The accuracy performance of the model was analyzed by using simple quantitative forecasting analysis. Specifically, the mean absolute deviation was used as a metric to calculate the forecasting error. The resolution value of 1 was considered for the companies with high exits, 0.5 for the medium exist, and 0 for the failed or no exit companies. The difference between these resolutions and the actual probabilities given by the reviewers was calculated as a mean absolute deviation. Based on this calculation, the overall accuracy of the model is situated at 75%, the accuracy for the no exits is valued at 83% and the accuracy for the medium and high exits is 77 and 41%, respectively (see Table 1).

### 5. Conclusions

In this research, a probabilistic model that assesses the potential for exit and overall performance of new ventures (start-ups) is presented, from building it based on practice and published statistical data, to its implementation in a readily available online platform that can be used by entrepreneurs and investors alike. The model is designed to assess quantitatively the potential of business while they are still at the very initial stages. The model is well informed with facts that we know from previous academic literature on entrepreneurship and high-growth companies, as well as informed in detail with venture capital experience and practices by working closely with them during the development phase of the model.

The model is validated using two anonymized experiments with experts in the field and MBA students and is currently translated into a commercial product. The results of these experiments and the details of the model are being presented in this chapter as both a validation method and as a viable metric or indicator that can detect ahead of time the future failures and "bad investments." This model can thus be also used by entrepreneurs to self-assess and identify points of weakness in their proposals and current seed ventures. Therefore, this research is presenting a tool for investment decision that can be easily automated and scaled up for the use of any potential investor, either angel or venture or any entrepreneur.

At the same time, these research efforts are also a good pathway to shed more transparency in the investment road map.

### Acknowledgements

In other words, there is consistency between the two groups of reviewers with respect to each of the three groups of companies. Moreover, there is consistency in the reviewers responses and the ground truth data with respect to low-exit and medium-exit companies, but less so for high-exit companies. In other words, we can use this model to identify failures or low exits, but less so to identify high exits and, therefore, the model is designed to prune out "bad" proposals

Experiment mean scores 0.20 0.31 0.42 Experiment median scores 0.16 0.28 0.46 Accuracy 0.83 0.77 0.41

Failed companies Medium-exit companies High-exit companies

Between the two experiments, we can also observe that the experts are still slightly better than

The responses from the experiment for the "no exits" had a mean exit score of 20% and a median exit score of 16% and a mean and median overall score of 27% with a standard deviation of 16–17%. This means that the companies that failed in real life were reviewed with

The medium exits experimental data had a mean exit score of 31%, a median of 28% and an overall mean and median of 34 and 36%, respectively, with standard deviations of 20 and 17%, respectively. This means that the companies that had medium exits (either low in capital value

The high exits had a mean and median exit score of 42%, an overall mean and median of 46% and a standard deviation of 28and 25%, respectively. This means that companies that were bought for more than \$500 million in real life scored around 42–46% in our model (see Table 1). The accuracy performance of the model was analyzed by using simple quantitative forecasting analysis. Specifically, the mean absolute deviation was used as a metric to calculate the forecasting error. The resolution value of 1 was considered for the companies with high exits, 0.5 for the medium exist, and 0 for the failed or no exit companies. The difference between these resolutions and the actual probabilities given by the reviewers was calculated as a mean absolute deviation. Based on this calculation, the overall accuracy of the model is situated at 75%, the accuracy for the no exits is valued at 83% and the accuracy for the medium and high

In this research, a probabilistic model that assesses the potential for exit and overall performance of new ventures (start-ups) is presented, from building it based on practice and published statistical data, to its implementation in a readily available online platform that can be used by

or took very long to exit) scored around the probabilities of 28–36% in our model.

from a pool of varied investment opportunities.

322 Bayesian Inference

scores in the range of 16–27% in our model.

exits is 77 and 41%, respectively (see Table 1).

5. Conclusions

MBA students at identifying low and medium exits.

Table 1. A summary of the model accuracy based on the experimental results.

The author would like to thank Marco Rubin for his professional expertise and professor David Kirsch and his MBA students at the Smith School of Business for help with conducting the experiments and very useful comments.

### Author details

Anamaria Berea<sup>1</sup> \* and Daniel Maxwell<sup>2</sup>

\*Address all correspondence to: aberea@rhsmith.umd.edu


### References


Provisional chapter

### **Recent Advances in Nonlinear Filtering with a Financial Application to Derivatives Hedging under Incomplete Information** Recent Advances in Nonlinear Filtering with a Financial Application to Derivatives Hedging

DOI: 10.5772/intechopen.70060

Claudia Ceci and Katia Colaneri

Additional information is available at the end of the chapter Claudia Ceci and Katia Colaneri

under Incomplete Information

http://dx.doi.org/10.5772/intechopen.70060 Additional information is available at the end of the chapter

### Abstract

[4] Pearl J. Probabilistic Reasoning in Intelligent Systems: Network of Plausible Inference.

[5] Berea A. Essays in high-impact companies and high-impact entrepreneurship [thesis].

[7] Shane SA. The Illusions of Entrepreneurship: The Costly Myths that Entrepreneurs,

[9] Auerswald P, Kauffman S, Lobo J, Shell K. The production recipes approach to modeling technologicalinnovation: An application tolearning by doing. CAEWorking paper. 1998:98–10

[10] Marmer MH, Bjoern L, Dogrultan E, Berman R. Startup Genome Report. Technical report.

[11] Botazzi G, Cefis E, Dosi G. Corporate growth and industrial structures: Some evidence from the italian manufacturing industry. Industrial and Corporate Change. 2002;11(4):705–723

[12] Wolley JL. Studying the emergence of new organizations: Entrepreneurship research

[13] Carvalho RN, Onishi MS, Ladeira M. Development of the Java version of the UnBBayes Framework for probabilistic reasoning. In: Congresso de Iniciacao Cientifica da UnB. Brasilia,

[14] GeNIe and SMILE, software developed at the Decision Systems Laboratory, School of

design. New Perspectives on Entrepreneurship Research. 2011;1(1)

[6] Zoltan J. Acs. Foundations of High Impact Entrepreneurship. Boston, MA; 2008

Investors and Policy Makers Live By. Yale University; 2008

[8] Arthur BW. The Nature of Technology. Free Press; 2009

Berkley University and Stanford University; 2011

DF, Brazil: University of Brasil; 2002

Information Sciences, University of Pittsburgh

Morgan Kaufmann; 1987

324 Bayesian Inference

George Mason University; 2012

In this chapter, we present some recent results about nonlinear filtering for jump diffusion signal and observation driven by correlated Brownian motions having common jump times. We provide the Kushner-Stratonovich and the Zakai equation for the normalized and the unnormalized filter, respectively. Moreover, we give conditions under which pathwise uniqueness for the solutions of both equations holds. Finally, we study an application of nonlinear filtering to the financial problem of derivatives hedging in an incomplete market with partial observation. Precisely, we consider the risk-minimizing hedging approach. In this framework, we compute the optimal hedging strategy for an informed investor and a partially informed one and compare the total expected squared costs of the strategies.

Keywords: nonlinear filtering, jump diffusions, risk minimization, Galtchouk-Kunita-Watanabe decomposition, partial information

### 1. Introduction

Bayesian inference and stochastic filtering are strictly related, since in both approaches, one wants to estimate quantities which are not directly observable. However, while in Bayesian inference, all uncertainty sources are considered as random variables, stochastic filtering refers to stochastic processes. It also covers many situations, from linear to nonlinear case, with various types of noises.

The objective of this chapter is to present nonlinear filtering results for Markovian partially observable systems where the state and the observation processes are described by jump diffusions with correlated Brownian motions and common jump times. We also aim at applying this

© The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and eproduction in any medium, provided the original work is properly cited.

theory to the financial problem of derivatives hedging for a trader who has limitative information on the market.

A filtering model is characterized by a signal process, denoted by X, which cannot be observed directly, and an observation process denoted by Y whose dynamics depends on X. The natural filtration of Y, <sup>F</sup><sup>Y</sup> <sup>¼</sup> {F<sup>Y</sup> <sup>t</sup> , t ∈½0, T�}, represents the available information. The goal of solving a filtering problem is to determine the best estimation of the signal Xt from the knowledge of F<sup>Y</sup> t . Similar to optimal Bayesian filtering, we seek for the best estimation of the signal according to the minimum mean-squared error criterion, which corresponds to compute the posterior distribution of Xt given the available observations up to time t.

Historically, the first example of continuous-time filtering problem is the well-known Kalman-Bucy filter which concerns the case where Y gives the observation of X in additional Gaussian noise and both processes X and Yare modeled by linear stochastic differential equations. In this case, one ends up with a filter having finite-dimensional realization. Since then, the problem has been extended in many directions. To start, a number of authors including Refs. [1–3] studied the nonlinear case in the setting of additional Gaussian noise. Other references in a similar framework are given, for instance, by Refs. [4–8]. Subsequently also the case of counting process or marked point process observation has been considered (see Refs. [9–14] and reference therein). A more recent literature contains the case of mixed-type observations (marked point processes and diffusions or jump-diffusion processes), see, for, example, Refs. [15–18].

There are two major approaches to nonlinear filtering problems: the innovations method and the reference probability method. The latter is usually employed when it is possible to find an equivalent probability measure that makes the state X and the observations Y independent. This technique may appear problematic when, for instance, signal and observation are correlated and present common jump times. Therefore, in this chapter, we use the innovations approach which allows circumventing the technical issues arising in the reference probability method. By characterizing the innovation process and applying a martingale representation theorem, we can derive the dynamics of the filter as the solution of the Kushner-Stratonovich equation, which is a nonlinear stochastic partial integral differential equation. By considering the unnormalized version of the filter, it is possible to simplify this equation and make it at least linear. The resulting equation is called the Zakai equation, and due to its linear nature, it is of particular interest in many applications. We also compute the dynamics of the unnormalized filter, and we investigate pathwise uniqueness for the solutions of both equations. Normalized and unnormalized filters are probability measure and finite measure-valued processes, respectively, and therefore in general infinite-dimensional. Due to this, various recursive algorithms for statistical inference have come in to address this intractability, such as extended Kalman filter, statistical linearization, or particle filters. These algorithms intend to estimate both state and parameters. For the parameter estimation, we also mention the expectation maximization (EM) algorithm which enables to estimate parameters in models with incomplete data, see, for example, Ref. [19].

The success of the filtering theory over the years is due to its use in a great variety of problems arising from many disciplines such as engineering, informational sciences and mathematical finance. Specifically, in this chapter, we have a financial application in view. In real financial markets, it is reasonable that investors cannot fully know all the stochastic factors that may influence the prices of negotiated assets, since these factors are usually associated with economic quantities which are hard to observe. Filtering theory represents a way to measure, in some sense, this uncertainty. A consistent part of the literature over the last years has considered stochastic factor models under partial information for analyzing various financial problems, as, for example, pricing and hedging of derivatives, optimal investment, credit risk, and insurance modeling. A list, definitely nonexhaustive, is given by Refs. [15, 16, 20–26]).

theory to the financial problem of derivatives hedging for a trader who has limitative informa-

A filtering model is characterized by a signal process, denoted by X, which cannot be observed directly, and an observation process denoted by Y whose dynamics depends on X. The natural

filtering problem is to determine the best estimation of the signal Xt from the knowledge of F<sup>Y</sup>

Similar to optimal Bayesian filtering, we seek for the best estimation of the signal according to the minimum mean-squared error criterion, which corresponds to compute the posterior

Historically, the first example of continuous-time filtering problem is the well-known Kalman-Bucy filter which concerns the case where Y gives the observation of X in additional Gaussian noise and both processes X and Yare modeled by linear stochastic differential equations. In this case, one ends up with a filter having finite-dimensional realization. Since then, the problem has been extended in many directions. To start, a number of authors including Refs. [1–3] studied the nonlinear case in the setting of additional Gaussian noise. Other references in a similar framework are given, for instance, by Refs. [4–8]. Subsequently also the case of counting process or marked point process observation has been considered (see Refs. [9–14] and reference therein). A more recent literature contains the case of mixed-type observations (marked point processes and diffusions or jump-diffusion processes), see, for, example, Refs. [15–18].

There are two major approaches to nonlinear filtering problems: the innovations method and the reference probability method. The latter is usually employed when it is possible to find an equivalent probability measure that makes the state X and the observations Y independent. This technique may appear problematic when, for instance, signal and observation are correlated and present common jump times. Therefore, in this chapter, we use the innovations approach which allows circumventing the technical issues arising in the reference probability method. By characterizing the innovation process and applying a martingale representation theorem, we can derive the dynamics of the filter as the solution of the Kushner-Stratonovich equation, which is a nonlinear stochastic partial integral differential equation. By considering the unnormalized version of the filter, it is possible to simplify this equation and make it at least linear. The resulting equation is called the Zakai equation, and due to its linear nature, it is of particular interest in many applications. We also compute the dynamics of the unnormalized filter, and we investigate pathwise uniqueness for the solutions of both equations. Normalized and unnormalized filters are probability measure and finite measure-valued processes, respectively, and therefore in general infinite-dimensional. Due to this, various recursive algorithms for statistical inference have come in to address this intractability, such as extended Kalman filter, statistical linearization, or particle filters. These algorithms intend to estimate both state and parameters. For the parameter estimation, we also mention the expectation maximization (EM) algorithm which enables to estimate parameters in models with

The success of the filtering theory over the years is due to its use in a great variety of problems arising from many disciplines such as engineering, informational sciences and mathematical finance. Specifically, in this chapter, we have a financial application in view. In real financial

distribution of Xt given the available observations up to time t.

incomplete data, see, for example, Ref. [19].

<sup>t</sup> , t ∈½0, T�}, represents the available information. The goal of solving a

t .

tion on the market.

326 Bayesian Inference

filtration of Y, <sup>F</sup><sup>Y</sup> <sup>¼</sup> {F<sup>Y</sup>

In the following, we consider the problem of a trader who wants to determine the hedging strategy for a European-type contingent claim with maturity T in an incomplete financial market where the investment possibilities are given by a riskless asset, assumed to be the numéraire, and a risky asset with price dynamics given by a geometric jump diffusion, modeled by the process Y. We assume that the drift, as well as the intensity and the jump size distribution of the price process, is influenced by an unobservable stochastic factor X, modeled as a correlated jump diffusion with common jump times. By common jump times, we intend to take into account catastrophic events which affect both the asset price and the hidden state variable driving its dynamics. The agent knows the asset prices, since they are publicly available, and trades on the market by using the available information F<sup>Y</sup>.

Partial information easily leads to incomplete financial markets as clearly the number of random sources is larger than the number of tradeable risky asset. Therefore, the existence of a self-financing strategy that replicates the payoff of the given contingent claim at maturity is not guaranteed. Here, we assume that the risky asset price is modeled under a martingale measure, and we choose the risk-minimization approach as hedging criterion, see, for example, Refs. [27, 28].

According to this method, the optimal hedging strategy is the one that perfectly replicates the claim at maturity and has minimum cost in the mean-square sense. Equivalently, we say that it minimizes the associated risk defined as the conditional expected value of the squared future costs, given the available information (see Refs. [28, 29] and references therein).

The risk-minimizing hedging strategy under restricted information is strictly related to Galtchouk-Kunita-Watanabe decomposition of the random variable representing the payoff of the contingent claim in a partial information setting. Here, we provide a characterization of the risk-minimizing strategy under partial information via this orthogonal decomposition and obtain a representation in terms of the corresponding risk-minimizing hedging strategy under full information (see, e.g., Refs. [29, 30]) via predictable projections on the available information flow by means of the filter. Finally, we investigate the difference of expected total risks associated with the optimal hedging strategies under full and partial information.

The chapter has the following structure. In Section 2, we introduce the general framework. In Section 3, we study the filtering equations. In particular, we derive the dynamics for both normalized and unnormalized filters, and we investigate uniqueness of the solutions of the Kushner-Stratonovich and the Zakai equation. In Section 4, we analyze a financial application to risk minimization by computing the optimal hedging strategies for a European-type contingent claim under full and partial information and providing a comparison between the corresponding expected squared total costs.

### 2. The setting

We consider a pair of stochastic processes (X,Y), with values on R � R and càdlàg trajectories, on a complete filtered probability space ðΩ, F, F, PÞ, where F ¼ {Ft, t∈ ½0, T�} is a filtration satisfying the usual condition of right continuity and completeness, and T is a fixed time horizon. The pair (X, Y) represents a partially observable system, where X is a signal process that describes a phenomenon which is not directly observable and Y gives the observation of X, and it is modeled by a process correlated with the signal, having possibly common jump times.

Remark 1. In view of the financial application discussed in Section 4, Y represents the price of some risky asset, while X is an unknown stochastic factor, which may describe the activity of other markets, macroeconomic factors or microstructure rules that influences the dynamics of the stock price process.

We define the observed history as the natural filtration of the observation process Y, that is, <sup>F</sup><sup>Y</sup> ¼ fF<sup>Y</sup> <sup>t</sup> g<sup>t</sup><sup>∈</sup> <sup>½</sup>0,T� , where F<sup>Y</sup> <sup>t</sup> :<sup>¼</sup> <sup>σ</sup>ðYs; <sup>0</sup> <sup>≤</sup> <sup>s</sup> <sup>≤</sup> <sup>t</sup>Þ. The <sup>σ</sup>-algebra <sup>F</sup><sup>Y</sup> <sup>t</sup> can be interpreted as the information available from observations up to time t. We aim to compute the best estimate of the signal X from the available information, in the quadratic sense. In other terms, this corresponds to determine the filter which furnishes the conditional distribution of Xt given F<sup>Y</sup> <sup>t</sup> , for every t ∈ [0, T].

Let MðRÞ be the space of finite measures over R and PðRÞ the subspace of the probability measures over R. Given μ∈MðRÞ, for any bounded measurable function f, we write

$$
\mu(f) = \int\_{\mathbb{R}} f(\mathbf{x}) \mu(\mathbf{dx}).\tag{1}
$$

Definition 2. The filter is the <sup>F</sup><sup>Y</sup>-càdlàg process <sup>π</sup> taking values in <sup>P</sup>ðR<sup>Þ</sup> defined by

$$\pi\_t(f) := \mathbb{E}\left[f(t, X\_t) | \mathcal{F}\_t^\chi \right] = \int\_{\mathbb{R}} f(t, \mathbf{x}) \pi\_t(\mathbf{dx}),\tag{2}$$

for all bounded and measurable functions f(t, x) on [0, T] � R.

In the sequel, we denote by πt� the left version of the filter and for all functions F(t, x, y) such that EjFðt, Xt, YtÞj < ∞ (resp. EjFðt, Xt�, Yt�Þj < ∞) for every t ∈ [0,T], we use the notation πtðFÞ :¼ πtðFðt, � , YtÞÞ (resp. πt�ðFÞ :¼ πt�ðFðt, � , Yt�ÞÞÞ.

In this paper, we wish to consider the filtering problem for a partially observable system (X, Y) described by the following pair of stochastic differential equations:

$$\begin{cases} \mathbf{dX}\_{l} = b\_{0}(t, \mathbf{X}\_{l})\mathbf{d}t + \sigma\_{0}(t, \mathbf{X}\_{l})\mathbf{d}W\_{t}^{0} + \int\_{Z} \mathbf{K}\_{0}(t, \mathbf{X}\_{l}; \zeta)\mathbf{N}(\mathbf{dt}, \mathbf{d}\zeta); \mathbf{X}\_{0} = \mathbf{x}\_{0} \in \mathbb{R} \\\\ \mathbf{d}Y\_{l} = b\_{1}(t, \mathbf{X}\_{l}, Y\_{l})\mathbf{d}t + \sigma\_{1}(t, \mathbf{Y}\_{l})\mathbf{d}W\_{t}^{1} + \int\_{Z} K\_{1}(t, \mathbf{X}\_{l-\prime} \ \mathbf{Y}\_{l-\prime}; \zeta)\mathbf{N}(\mathbf{dt}, \mathbf{d}\zeta); \mathbf{Y}\_{0} = y\_{0} \in \mathbb{R} \end{cases} \tag{3}$$

where W0 and <sup>W</sup><sup>1</sup> are correlated <sup>ð</sup>F, <sup>P</sup>Þ-Brownian motions with correlation coefficient <sup>ρ</sup> <sup>∈</sup> [�1,1] and Nðdt, dζÞ is a Poisson random measure on R<sup>þ</sup> � Z whose intensity νðdζÞdt is a σ – finite measure on a measurable space ðZ, ZÞ. Here, b0, b1, σ0, σ1, K0, and K<sup>1</sup> are R-valued and measurable functions of their arguments. In particular, σ0(t, x) and σ1(t, x, y) are strictly positive for every <sup>ð</sup>t, x, yÞ<sup>∈</sup> <sup>½</sup>0, T� � <sup>R</sup><sup>2</sup> .

For the rest of the paper, we assume that strong existence and uniqueness for system Eq. (3) holds. Sufficient conditions are collected, for instance, in Ref. [18, Appendix]. These assumptions also imply Markovianity for the pair (X, Y).

Remark 3. Note that the quadratic variation process of Y defined by

2. The setting

328 Bayesian Inference

times.

<sup>F</sup><sup>Y</sup> ¼ fF<sup>Y</sup>

every t ∈ [0, T].

8 >>><

>>>:

<sup>t</sup> g<sup>t</sup><sup>∈</sup> <sup>½</sup>0,T�

, where F<sup>Y</sup>

We consider a pair of stochastic processes (X,Y), with values on R � R and càdlàg trajectories, on a complete filtered probability space ðΩ, F, F, PÞ, where F ¼ {Ft, t∈ ½0, T�} is a filtration satisfying the usual condition of right continuity and completeness, and T is a fixed time horizon. The pair (X, Y) represents a partially observable system, where X is a signal process that describes a phenomenon which is not directly observable and Y gives the observation of X, and it is modeled by a process correlated with the signal, having possibly common jump

Remark 1. In view of the financial application discussed in Section 4, Y represents the price of some risky asset, while X is an unknown stochastic factor, which may describe the activity of other markets, macroeconomic factors or microstructure rules that influences the dynamics of the stock price process. We define the observed history as the natural filtration of the observation process Y, that is,

<sup>t</sup> :<sup>¼</sup> <sup>σ</sup>ðYs; <sup>0</sup> <sup>≤</sup> <sup>s</sup> <sup>≤</sup> <sup>t</sup>Þ. The <sup>σ</sup>-algebra <sup>F</sup><sup>Y</sup>

mation available from observations up to time t. We aim to compute the best estimate of the signal X from the available information, in the quadratic sense. In other terms, this corresponds to determine the filter which furnishes the conditional distribution of Xt given F<sup>Y</sup>

Let MðRÞ be the space of finite measures over R and PðRÞ the subspace of the probability

t � � <sup>¼</sup>

In the sequel, we denote by πt� the left version of the filter and for all functions F(t, x, y) such that EjFðt, Xt, YtÞj < ∞ (resp. EjFðt, Xt�, Yt�Þj < ∞) for every t ∈ [0,T], we use the notation

In this paper, we wish to consider the filtering problem for a partially observable system (X, Y)

<sup>t</sup> þ ð Z

> <sup>t</sup> þ ð Z

ð R

K0ðt, Xt�; ζÞNðdt, dζÞ; X<sup>0</sup> ¼ x<sup>0</sup> ∈ R

K1ðt, Xt�, Yt�; ζÞNðdt, dζÞ; Y<sup>0</sup> ¼ y<sup>0</sup> ∈ R

measures over R. Given μ∈MðRÞ, for any bounded measurable function f, we write

ð R

μðfÞ ¼

Definition 2. The filter is the <sup>F</sup><sup>Y</sup>-càdlàg process <sup>π</sup> taking values in <sup>P</sup>ðR<sup>Þ</sup> defined by

<sup>π</sup>tðf<sup>Þ</sup> :<sup>¼</sup> <sup>E</sup> <sup>f</sup>ðt, XtÞjF<sup>Y</sup>

for all bounded and measurable functions f(t, x) on [0, T] � R.

πtðFÞ :¼ πtðFðt, � , YtÞÞ (resp. πt�ðFÞ :¼ πt�ðFðt, � , Yt�ÞÞÞ.

<sup>d</sup>Xt <sup>¼</sup> <sup>b</sup>0ðt, XtÞd<sup>t</sup> <sup>þ</sup> <sup>σ</sup>0ðt, XtÞdW<sup>0</sup>

<sup>d</sup>Yt <sup>¼</sup> <sup>b</sup>1ðt, Xt, YtÞd<sup>t</sup> <sup>þ</sup> <sup>σ</sup>1ðt, YtÞdW<sup>1</sup>

described by the following pair of stochastic differential equations:

<sup>t</sup> can be interpreted as the infor-

fðxÞμðdxÞ: (1)

fðt, xÞπtðdxÞ, (2)

<sup>t</sup> , for

(3)

$$\mathbb{E}\left[Y\right]\_t = Y\_t^2 - 2\int\_0^t Y\_{u^-} \mathrm{d}Y\_{u\prime} \quad \text{t} \in [0, T]\_\prime \tag{4}$$

is <sup>F</sup><sup>Y</sup>-adapted and <sup>½</sup>Y� <sup>t</sup> ¼ ðt 0 σ2 <sup>1</sup>ðu, YuÞd<sup>u</sup> <sup>þ</sup><sup>X</sup> u ≤ t ðΔYuÞ 2 , where ΔYt :¼ Yt � Yt�. Therefore, it is

natural to assume that the signal X does not affect the diffusion coefficient in the dynamics of Y. If Y describes the price of a risky asset, this implies that the volatility of the stock price does not depend on the stochastic factor X.

The jump component of Y can be described in terms of the following integer-valued random measure on [0, T] � R:

$$m(\mathbf{d}t, \mathbf{d}z) = \sum\_{s:\Lambda Y\_s \neq 0} \delta\_{\{s, \Lambda Y\_s\}}(\mathbf{d}t, \mathbf{d}z),\tag{5}$$

where δ<sup>a</sup> denotes the Dirac measure at point a. Note that the following equality holds:

$$\int\_{0}^{t} \int\_{\mathbb{R}} z m(\mathbf{ds}, \mathbf{dz}) = \int\_{0}^{t} \int\_{Z} K\_{1}(\mathbf{s}, X\_{s^{-}}, Y\_{s^{-}}; \mathbb{Q}) N(\mathbf{ds}, \mathbf{d}\mathbb{Q}).\tag{6}$$

For all t ∈ [0, T], for all A ∈BðRÞ, we define the following sets:

$$d^0(t, \mathbf{x}) := \{ \mathbb{Q} \in \mathbf{Z} : \mathbf{K}\_0(t, \mathbf{x}; \mathbb{Q}) \neq \mathbf{0} \}, \quad d^1(t, \mathbf{x}, \mathbf{y}) := \{ \mathbb{Q} \in \mathbf{Z} : \mathbf{K}\_1(t, \mathbf{x}, \mathbf{y}; \mathbb{Q}) \neq \mathbf{0} \} \tag{7}$$

$$d^A(t, \mathbf{x}, y) := \{ \mathbb{Q} \in \mathbf{Z} : K\_1(t, \mathbf{x}, y; \mathbb{Q}) \in A \backslash \{ \mathbf{0} \} \subseteq d^1(t, \mathbf{x}, y), \tag{8}$$

$$D\_t^A := d^A(t, X\_{t-\prime}, Y\_{t-}) \underline{\mathsf{G}} D\_t := d^1(t, X\_{t-\prime}, Y\_{t-}), \quad D\_t^0 := d^0(t, X\_{t-}).\tag{9}$$

Typically, we have D<sup>0</sup> <sup>t</sup> ∩ Dt 6¼ Ø P � a.s., which means that state and observation may have common jump times. This characteristic is particularly meaningful in financial applications to model catastrophic events that produce jumps in both the stock price and the underlying stochastic factor that influences its dynamics.

To ensure existence of the first moment for the pair (X, Y) and non-explosiveness for the jump process governing the dynamics of X and Y, we make the following assumption:

Assumption 4.

$$\mathbb{E}\left[\int\_{0}^{T}|b\_{0}(t,X\_{t})|+\sigma\_{0}^{2}(t,X\_{t})+\int\_{Z}|K\_{0}(t,X\_{t-};\zeta)|\nu(d\zeta)dt\right]<\infty.<\tag{10}$$

$$\mathbb{E}\left[\int\_{0}^{T}|b\_{1}(t,X\_{t\prime},Y\_{t})|+\sigma\_{1}^{2}(t,Y\_{t})+\int\_{Z}|K\_{1}(t,X\_{t-\prime},Y\_{t-\prime};\zeta)|\nu(d\zeta)dt\right]<\tag{11}$$

$$\mathbb{E}\left[\int\_{0}^{T} \nu(D\_t^0 \cup D\_t) \mathbf{d}t\right] < \infty. \tag{12}$$

Denote by <sup>η</sup><sup>P</sup>ðdt, <sup>d</sup>z<sup>Þ</sup> the <sup>ð</sup>F, <sup>P</sup><sup>Þ</sup> compensator of <sup>m</sup>ðdt, <sup>d</sup>z<sup>Þ</sup> (see, e.g., Refs. [9, 31] for the definition).

Then, in Ref. [14, Proposition 2.2], it is proved that

$$\boldsymbol{\eta}^{\mathbf{p}}(\mathbf{d}t, \mathbf{d}z) = \lambda(t, \mathbf{X}\_{t-\prime} \mathbf{Y}\_{t-\prime}) \boldsymbol{\phi}(t, \mathbf{X}\_{t-\prime} \mathbf{Y}\_{t-\prime}, \mathbf{dz}) \mathbf{d}t,\tag{13}$$

where

$$
\lambda(t, \mathbf{x}, y)\phi(t, \mathbf{x}, y, \mathbf{dz}) = \int\_{d^1(t, \mathbf{x}, y)} \delta\_{K^1(t, \mathbf{x}, y; \zeta)}(\mathbf{dz})\nu(\mathbf{d}\zeta) \tag{14}
$$

and in particular <sup>λ</sup>ðt, x, yÞ ¼ <sup>ν</sup>ðd<sup>1</sup> ðt, x, yÞÞ.

Remark 5. Let us observe that both the local jump characteristics ðλðt, Xt�, Yt�Þ, φðt, Xt�, Yt�, dzÞÞ depend on X and, for all A <sup>∈</sup>BðRÞ, <sup>λ</sup>ðt, Xt�, Yt�Þφðt, Xt�, Yt�, AÞ ¼ <sup>ν</sup>ðD<sup>A</sup> <sup>t</sup> Þ provides theðF, PÞ -intensity of the point process NtðAÞ :¼ mðð0, t� � AÞ. According to this, the process λðt, Xt�, Yt�Þ ¼ νðDtÞ is the ðF, PÞ -intensity of the point process NtðRÞ which counts the total number of jumps of Yuntil time t.

#### 2.1. The innovation process

To derive the filtering equation, we use the innovations approach. This method requires to introduce a pair <sup>ð</sup>I, m<sup>π</sup>Þ, called the innovation process, consisting of the <sup>ð</sup>FY, <sup>P</sup>Þ-Brownian motion and the <sup>ð</sup>FY, <sup>P</sup>Þ-compensated jump measure that drive the dynamics of the filter. The innovation also represents the building block of <sup>ð</sup>FY, <sup>P</sup><sup>Þ</sup> -martingales.

To introduce the first component of the innovation process, we assume that

$$\mathbb{E}\left[\exp\left\{\frac{1}{2}\int\_{0}^{T}\left(\frac{b\_{1}(t,X\_{t},Y\_{t})}{\sigma\_{1}(t,Y\_{t})}\right)^{2}\mathrm{d}t\right\}\right] < \infty,\tag{15}$$

and define

$$I\_t := W\_t^1 + \int\_0^t \left( \frac{b\_1(\mathbf{s}, X\_{\mathbf{s}}, Y\_s)}{\sigma\_1(\mathbf{s}, Y\_s)} - \frac{\pi\_s(b\_1)}{\sigma\_1(\mathbf{s}, Y\_s)} \right) \mathbf{ds}, \quad t \in [0, T]. \tag{16}$$

The process <sup>I</sup> is an <sup>ð</sup>FY, <sup>P</sup>Þ-Brownian motion (see, e.g., Ref. [4]) and the <sup>ð</sup>FY, <sup>P</sup>Þ-compensated jump martingale measure is given by

Recent Advances in Nonlinear Filtering with a Financial Application to Derivatives Hedging under Incomplete… http://dx.doi.org/10.5772/intechopen.70060 331

$$m^{\pi}(\mathbf{d}t, \mathbf{d}z) = m^{\pi}(\mathbf{d}t, \mathbf{d}z) - \pi\_{t-}(\lambda\phi(\mathbf{d}z))\mathbf{d}t,\tag{17}$$

See, e.g. Ref. [14]. The following theorem provides a characterization of the <sup>ð</sup>FY, <sup>P</sup>Þ-martingale in terms of the innovation process.

Theorem 6 (A martingale representation theorem). Under Assumption 4 and the integrability condition Eq. (15), every <sup>ð</sup>FY, <sup>P</sup>Þ-local martingale M admits the following decomposition:

$$M\_t = M\_0 + \int\_0^t \int\_{\mathbb{R}} w\_s(\mathbf{z}) m^\pi(\mathbf{ds}, \mathbf{dz}) + \int\_0^t h\_s \mathbf{d}I\_s, \quad t \in [0, T], \tag{18}$$

where wðzÞ ¼ {wtðzÞ, t∈½0, T�} is an <sup>F</sup>Y-predictable process indexed by z, and h <sup>¼</sup> {ht, t<sup>∈</sup> <sup>½</sup>0, T�} is an FY-adapted process such that

$$\int\_{0}^{T} \int\_{\mathbb{R}} |w\_{t}(z)| \pi\_{t-}(\lambda \phi(\mathrm{d}z)) \mathrm{d}t < \infty, \quad \int\_{0}^{T} h\_{t}^{2} \mathrm{d}t < \infty \quad \mathbf{P}-a.s. \tag{19}$$

Proof. The proof is given in Ref. [17, Proposition 2.4]. Note that here condition (15) implies that E ðT 0 b1ðt,Xt,YtÞ σ1ðt,YtÞ � �<sup>2</sup> dt � � <sup>&</sup>lt; <sup>∞</sup>, and also that the process <sup>L</sup> defined by

$$L\_t = \exp\left(-\int\_0^t \frac{b\_1(\mathbf{s}, X\_{\mathbf{s}}, Y\_s)}{\sigma\_1(\mathbf{s}, Y\_s)} \mathbf{d} \mathcal{W}\_s^1 - \frac{1}{2} \int\_0^t \left(\frac{b\_1(\mathbf{s}, X\_{\mathbf{s}}, Y\_s)}{\sigma\_1(\mathbf{s}, Y\_s)}\right)^2 \mathbf{ds}\right),\tag{20}$$

for every t∈ ½0, T�, is an ðF, PÞ-martingale.

### 3. The filtering equations

Theorem 7 (The Kushner-Stratonovich equation). Under Assumptions 4 and condition (15), the filter π solves the following Kushner-Stratonovich equation, that is, for every f ∈ C<sup>1</sup>;<sup>2</sup> <sup>b</sup> ð½0, T� � RÞ:

$$\pi\_t(f) = f(0, \mathbf{x}\_0) + \int\_0^t \pi\_s(\mathcal{L}^\chi f) \mathbf{ds} + \int\_0^t \int\_{\mathbb{R}} w\_s^\pi(f, z) m^\pi(\mathbf{ds}, \mathbf{dz}) + \int\_0^t h\_s^\pi(f) \mathbf{d}I\_{s\prime} \quad t \in [0, T] \tag{21}$$

where

Assumption 4.

330 Bayesian Inference

tion).

where

E ðT 0

E ðT 0

and in particular <sup>λ</sup>ðt, x, yÞ ¼ <sup>ν</sup>ðd<sup>1</sup>

2.1. The innovation process

and define

<sup>j</sup>b0ðt, XtÞj þ <sup>σ</sup><sup>2</sup>

<sup>j</sup>b1ðt, Xt, YtÞj þ <sup>σ</sup><sup>2</sup>

Then, in Ref. [14, Proposition 2.2], it is proved that

<sup>0</sup>ðt, XtÞ þ

<sup>1</sup>ðt, YtÞ þ

E ðT 0 <sup>ν</sup>ðD<sup>0</sup>

λðt, x, yÞφðt, x, y, dzÞ ¼

depend on X and, for all A <sup>∈</sup>BðRÞ, <sup>λ</sup>ðt, Xt�, Yt�Þφðt, Xt�, Yt�, AÞ ¼ <sup>ν</sup>ðD<sup>A</sup>

tion also represents the building block of <sup>ð</sup>FY, <sup>P</sup><sup>Þ</sup> -martingales.

E exp

<sup>t</sup> þ ðt 0

It :<sup>¼</sup> <sup>W</sup><sup>1</sup>

jump martingale measure is given by

To introduce the first component of the innovation process, we assume that

b1ðs, Xs, YsÞ

ðt, x, yÞÞ.

ð Z

ð Z

<sup>t</sup> ∪ DtÞdt � �

Denote by <sup>η</sup><sup>P</sup>ðdt, <sup>d</sup>z<sup>Þ</sup> the <sup>ð</sup>F, <sup>P</sup><sup>Þ</sup> compensator of <sup>m</sup>ðdt, <sup>d</sup>z<sup>Þ</sup> (see, e.g., Refs. [9, 31] for the defini-

ð

<sup>d</sup>1ðt, x, <sup>y</sup><sup>Þ</sup>

Remark 5. Let us observe that both the local jump characteristics ðλðt, Xt�, Yt�Þ, φðt, Xt�, Yt�, dzÞÞ

of the point process NtðAÞ :¼ mðð0, t� � AÞ. According to this, the process λðt, Xt�, Yt�Þ ¼ νðDtÞ is the ðF, PÞ -intensity of the point process NtðRÞ which counts the total number of jumps of Yuntil time t.

To derive the filtering equation, we use the innovations approach. This method requires to introduce a pair <sup>ð</sup>I, m<sup>π</sup>Þ, called the innovation process, consisting of the <sup>ð</sup>FY, <sup>P</sup>Þ-Brownian motion and the <sup>ð</sup>FY, <sup>P</sup>Þ-compensated jump measure that drive the dynamics of the filter. The innova-

> b1ðt,Xt,YtÞ σ1ðt,YtÞ � �<sup>2</sup>

<sup>σ</sup>1ðs, Ys<sup>Þ</sup> � <sup>π</sup>sðb1<sup>Þ</sup>

The process <sup>I</sup> is an <sup>ð</sup>FY, <sup>P</sup>Þ-Brownian motion (see, e.g., Ref. [4]) and the <sup>ð</sup>FY, <sup>P</sup>Þ-compensated

� �

" # ( )

dt

σ1ðs, YsÞ

� �

� �

jK0ðt, Xt�; ζÞjνðdζÞdt

jK1ðt, Xt�, Yt�; ζÞjνðdζÞdt

<sup>η</sup><sup>P</sup>ðdt, <sup>d</sup>zÞ ¼ <sup>λ</sup>ðt, Xt�, Yt�Þφðt, Xt�, Yt�, <sup>d</sup>zÞdt, (13)

< ∞, (10)

< ∞: (12)

<sup>δ</sup>K1ðt, x,y;ζÞðdzÞνðdζ<sup>Þ</sup> (14)

<sup>t</sup> Þ provides theðF, PÞ -intensity

< ∞, (15)

ds, t∈ ½0, T�: (16)

< ∞, (11)

$$w\_t^n(f, z) = \frac{\mathbf{d}\pi\_{t-}(\lambda \phi f)}{\mathbf{d}\pi\_{t-}(\lambda \phi)}(z) - \pi\_{t-}(f) + \frac{\mathbf{d}\pi\_{t-}(\overline{\mathcal{L}}f)}{\mathbf{d}\pi\_{t-}(\lambda \phi)}(z),\tag{22}$$

$$h\_t^{\pi}(f) = \sigma\_1^{-1}(t)[\pi\_t(b\_1f) - \pi\_t(b\_1)\pi\_t(f)] + \rho\pi\_t\left(\sigma\_0\frac{\partial f}{\partial \mathbf{x}}\right). \tag{23}$$

Here, by <sup>d</sup>πt�ðλφf<sup>Þ</sup> <sup>d</sup>πt�ðλφ<sup>Þ</sup> <sup>ð</sup>z<sup>Þ</sup> and <sup>d</sup>πt�ðL<sup>f</sup><sup>Þ</sup> <sup>d</sup>πt�ðλφ<sup>Þ</sup> <sup>ð</sup>zÞ, we mean the Radon-Nikodym derivatives of the measures πt�ðλfφðdzÞÞ and πt�ðLfÞðdzÞ, with respect to π<sup>t</sup>� � λφðdzÞ � . Moreover, the operator L defined by LtfðdzÞ :¼ Lfð.; Yt�, dzÞ is such that for every A ∈BðRÞ,

$$\overline{\mathcal{L}}f(t, \mathbf{x}, y, A) = \int\_{d^4(t, \mathbf{x}, y)} [f(t, \mathbf{x} + \mathbf{K}\_0(t, \mathbf{x}; \boldsymbol{\zeta})) - f(t, \mathbf{x})] \nu(d\boldsymbol{\zeta}) \tag{24}$$

takes into account common jump times between the signal X and the observation Y.

Finally, the operator L<sup>X</sup> given by

$$\mathcal{L}^{X}f(t, \mathbf{x}) = \frac{\partial f}{\partial t} + b\_{0}(t, \mathbf{x})\frac{\partial f}{\partial \mathbf{x}} + \frac{1}{2}\sigma\_{0}^{2}(t, \mathbf{x})\frac{\partial^{2}f}{\partial \mathbf{x}^{2}} + \int\_{Z} \left\{ f(t, \mathbf{x} + K\_{0}(t, \mathbf{x}; \boldsymbol{\zeta})) - f(t, \mathbf{x}) \right\} \nu(\mathbf{d}\boldsymbol{\zeta}).\tag{25}$$

denotes the generator of the Markov process X.

Proof. The theorem is proved in Ref. [17, Theorem 3.1].

Example 8 (Observation dynamics driven by independent point processes with unobservable intensities). In the sequel, we provide an example where the Kushner-Stratonovich equation simplifies and the Radon-Nikodym derivatives appearing in the dynamics of π(f) reduce to ratios. Suppose that there exists a finite set of measurable functions K<sup>i</sup> <sup>1</sup>ðt, yÞ 6¼ 0 for all ðt, yÞ∈ ½0, T� � R, for i ∈{1,…; n}, such that the dynamics of Y is given by

$$\mathbf{d}\mathbf{d}Y\_t = b\_1(t, \mathbf{X}\_{l\prime} \mathbf{Y}\_t)\mathbf{d}t + \sigma\_1(t, \mathbf{Y}\_t)\mathbf{d}\mathbf{W}\_t^1 + \sum\_{i=1}^n \mathbf{K}\_1^i(t, \mathbf{Y}\_{t^-})\mathbf{d}\mathbf{N}\_{l^{\prime}}^i \quad \mathbf{Y}\_0 = y\_0 \in \mathbb{R}\_{\prime} \tag{26}$$

where N<sup>i</sup> are independent counting processes with <sup>ð</sup>F, <sup>P</sup><sup>Þ</sup> intensities <sup>λ</sup><sup>i</sup> ðt, Xt�, Yt�Þ.

For simplicity, in this example, we assume that X and Y have no common jump times. Then, the filtering Eq. (21) reads as

$$\begin{split} \pi\_{t}(f) &= f(0, \mathbf{x}\_{0}) + \int\_{0}^{t} \pi\_{s}(\mathcal{L}^{\mathbf{X}}f) \mathrm{d}s + \int\_{0}^{t} \left\{ \sigma\_{1}(\mathbf{s})^{-1}[\pi\_{s}(\mathbf{b}\_{\mathbf{f}}f) - \pi\_{s}(\mathbf{b}\_{\mathbf{1}})\pi\_{s}(f)] + \rho \pi\_{s}\left(\sigma\_{0}\frac{\partial f}{\partial \mathbf{x}}\right) \right\} \mathrm{d}I\_{s} \\ &+ \sum\_{i=1}^{n} \int\_{0}^{t} \mathbf{1}\_{\pi\_{s} - \left(\lambda^{i}\right) > 0} \frac{\pi\_{\mathbf{s}^{-}}(\lambda^{i}f) - \pi\_{\mathbf{s}^{-}}(f)\pi\_{\mathbf{s}^{-}}(\lambda^{i})}{\pi\_{\mathbf{s}^{-}}(\lambda^{i})} \left( \mathrm{dN}\_{s}^{i} - \pi\_{\mathbf{s}^{-}}(\lambda^{i}) \mathbf{ds} \right), \quad t \in [0, T]. \end{split} \tag{27}$$

Note that Eq. (21) has an equivalent expression in terms of the operator L<sup>X</sup> <sup>0</sup> , given by

$$\begin{split} \mathcal{L}\_{0}^{\mathcal{X}}f(t, \mathbf{x}, \mathbf{y}) &= \mathcal{L}^{\mathcal{X}}f(t, \mathbf{x}) - \overline{\mathcal{L}}f(t, \mathbf{x}, \mathbf{y}, \mathbb{R}) \\ &= \frac{\partial f}{\partial t}(t, \mathbf{x}) + b\_{0}(t, \mathbf{x}) \frac{\partial f}{\partial \mathbf{x}} + \frac{1}{2} \sigma\_{0}^{2}(t, \mathbf{x}) \frac{\partial^{2} f}{\partial \mathbf{x}^{2}} + \int\_{d\_{i}^{\mathbb{L}}(t, \mathbf{x}, \mathbf{y})^{\complement}} \{f(t, \mathbf{x} + \mathbf{K}\_{0}(t, \mathbf{x}, \boldsymbol{\zeta})) - f(t, \mathbf{x})\} \nu(d\mathbf{\varprojlim}), \end{split} \tag{28}$$

where d<sup>1</sup> ðt,x, yÞ <sup>c</sup> <sup>¼</sup> {ζ∈<sup>Z</sup> : <sup>K</sup>1ðt, x, y, <sup>ζ</sup>Þ ¼ 0}. Indeed, we get

$$\mathbf{d}\pi\_{l}(f) = \langle \pi\_{l}(\mathcal{L}\_{0}^{\chi}f) + \pi\_{l}(f)\pi\_{l}(\lambda) - \pi\_{l}(\lambda f) \rangle \mathbf{d}t + h\_{t}^{\pi} \mathbf{d}I\_{t} + \int\_{\mathbb{R}} w^{\pi}(t, \mathbf{z}) m(\mathbf{d}t, \mathbf{dz}).\tag{29}$$

Moreover, the filter has a natural recursive structure. To show this, define the sequence {Tn,Zn}n <sup>∈</sup> <sup>N</sup> of jump times and jump sizes of Y, that is, Zn ¼ YTn � YT� <sup>n</sup> . These are observable data. Then, between two consecutive jump times the filter is governed by a diffusion process, that is, for t∈ ðTn ∧ T,Tnþ<sup>1</sup> ∧ TÞ

$$
\pi\_t(f) = \pi\_{T\_n}(f) + \int\_{T\_\pi}^t \{\pi\_s(\mathcal{L}\_0^\chi f) + \pi\_s(f)\pi\_s(\lambda) - \pi\_s(\lambda f)\}\mathrm{d}s + \int\_{T\_\pi}^t h\_s^\pi(f)\mathrm{d}I\_s. \tag{30}
$$

and at any jump time Tn occurring before time T, it is given by

Lfðt, x, y, AÞ ¼

þ b0ðt, xÞ

Proof. The theorem is proved in Ref. [17, Theorem 3.1].

<sup>d</sup>Yt <sup>¼</sup> <sup>b</sup>1ðt, Xt, YtÞd<sup>t</sup> <sup>þ</sup> <sup>σ</sup>1ðt, YtÞdW<sup>1</sup>

<sup>π</sup>sðLXfÞd<sup>s</sup> <sup>þ</sup>

<sup>π</sup><sup>s</sup>� <sup>ð</sup>λ<sup>i</sup>

∂f ∂x þ 1 2 σ2 <sup>0</sup>ðt, xÞ

∂f ∂x þ 1 2 σ2 <sup>0</sup>ðt, xÞ

Finally, the operator L<sup>X</sup> given by

<sup>L</sup>Xfðt, xÞ ¼ <sup>∂</sup><sup>f</sup>

the filtering Eq. (21) reads as

ðt 0 <sup>1</sup><sup>π</sup>s� <sup>ð</sup>λ<sup>i</sup> Þ>0

¼ ∂f ∂t

ðt,x, yÞ

ðt 0

<sup>0</sup> <sup>f</sup>ðt, x, yÞ ¼ <sup>L</sup>Xfðt, xÞ � <sup>L</sup>fðt, x, y, <sup>R</sup><sup>Þ</sup>

<sup>d</sup>πtðfÞ ¼ {πtðL<sup>X</sup>

ðt, xÞ þ b0ðt, xÞ

πtðfÞ ¼ fð0, x0Þ þ

þ Xn i¼1

where N<sup>i</sup>

332 Bayesian Inference

LX

where d<sup>1</sup>

∂t

denotes the generator of the Markov process X.

ð

dAðt,x,y<sup>Þ</sup>

takes into account common jump times between the signal X and the observation Y.

ratios. Suppose that there exists a finite set of measurable functions K<sup>i</sup>

are independent counting processes with <sup>ð</sup>F, <sup>P</sup><sup>Þ</sup> intensities <sup>λ</sup><sup>i</sup>

σ1ðsÞ �1

<sup>f</sup>Þ � <sup>π</sup><sup>s</sup>� <sup>ð</sup>fÞπ<sup>s</sup>� <sup>ð</sup>λ<sup>i</sup>

ðt, yÞ∈ ½0, T� � R, for i ∈{1,…; n}, such that the dynamics of Y is given by

ðt 0

<sup>π</sup><sup>s</sup>� <sup>ð</sup>λ<sup>i</sup> Þ

Note that Eq. (21) has an equivalent expression in terms of the operator L<sup>X</sup>

<sup>c</sup> <sup>¼</sup> {ζ∈<sup>Z</sup> : <sup>K</sup>1ðt, x, y, <sup>ζ</sup>Þ ¼ 0}. Indeed, we get

{Tn,Zn}n <sup>∈</sup> <sup>N</sup> of jump times and jump sizes of Y, that is, Zn ¼ YTn � YT�

<sup>0</sup> <sup>f</sup>Þ þ <sup>π</sup>tðfÞπtðλÞ � <sup>π</sup>tðλfÞ}d<sup>t</sup> <sup>þ</sup> <sup>h</sup><sup>π</sup>

∂<sup>2</sup>f ∂x<sup>2</sup> þ

Example 8 (Observation dynamics driven by independent point processes with unobservable intensities). In the sequel, we provide an example where the Kushner-Stratonovich equation simplifies and the Radon-Nikodym derivatives appearing in the dynamics of π(f) reduce to

> <sup>t</sup> <sup>þ</sup>X<sup>n</sup> i¼1 Ki

For simplicity, in this example, we assume that X and Y have no common jump times. Then,

Þ

∂<sup>2</sup>f ∂x<sup>2</sup> þ

Moreover, the filter has a natural recursive structure. To show this, define the sequence

� dN<sup>i</sup>

> ð d1 <sup>t</sup> ðt,x,yÞ c

1ðt, Yt�ÞdN<sup>i</sup>

½πsðb1fÞ � πsðb1ÞπsðfÞ� þ ρπ<sup>s</sup> σ<sup>0</sup>

Þds �

� � � �

<sup>s</sup> � <sup>π</sup><sup>s</sup>� <sup>ð</sup>λ<sup>i</sup>

<sup>t</sup> dIt þ ð R

ð Z

½ � fðt, x þ K0ðt, x; ζÞÞ � fðt, xÞ νðdζÞ (24)

{fðt, x þ K0ðt, x; ζÞÞ � fðt, xÞ}νðdζÞ: (25)

<sup>1</sup>ðt, yÞ 6¼ 0 for all

t, Y<sup>0</sup> ¼ y<sup>0</sup> ∈ R, (26)

∂f ∂x

<sup>0</sup> , given by

<sup>w</sup><sup>π</sup>ðt, zÞmðdt, <sup>d</sup>zÞ: (29)

<sup>n</sup> . These are observable

dIs

(27)

(28)

ðt, Xt�, Yt�Þ.

, t ∈½0, T�:

f g fðt, x þ K0ðt, x, ζÞÞ � fðt, xÞ νðdζÞ,

$$\pi\_{T\_{\pi}}(f) = \frac{\mathbf{d}\pi\_{T\_{\pi}^{-}}(\lambda\phi f)}{\mathbf{d}\pi\_{T\_{\pi}^{-}}(\lambda\phi)}(Z\_{\pi}) + \frac{\mathbf{d}\pi\_{T\_{\pi}^{-}}(\mathcal{L}f)}{\mathbf{d}\pi\_{T\_{\pi}^{-}}(\lambda\phi)}(Z\_{\pi}),\tag{31}$$

which implies that πTn ðfÞ is completely determined by the observed data (Tn, Zn) and the knowledge of π<sup>t</sup> (f) in the time interval ½Tn�<sup>1</sup>, TnÞ, since π<sup>T</sup>� <sup>n</sup> ðfÞ ¼ lim<sup>t</sup>!T� <sup>n</sup> πtðfÞ.

Note that the Kushner-Stratonovich equation is an infinite-dimensional nonlinear stochastic differential equation. Often, it is possible to characterize the filter in terms of a simpler equation, known as the Zakai equation which provides the dynamics of the unnormalized version of the filter. Although the Zakai equation is still infinite-dimensional, it has the advantage to be linear.

The idea for getting the dynamics of the unnormalized filter consists of performing an equivalent change of probability measure defined by

$$\left. \frac{d\mathbf{P}\_0}{d\mathbf{P}} \right|\_{\mathcal{F}\_t} = Z\_{t\prime} \quad t \in [0, T] \tag{32}$$

for a suitable strictly positive ðF, PÞ-martingale Z, in such a way that the so-called unnormalized filter p is the MðRÞ-valued process defined by

$$p\_t(f) := \mathbb{E}^0 \left[ Z\_t^{-1} f(t, X\_t) | \mathcal{F}\_t^Y \right], \quad t \in [0, T]. \tag{33}$$

Remark 9. By the Kallianpur-Striebel formula, we get that

$$\pi\_t(f) = \frac{\mathbb{E}^0 \left[ f(t, X\_t) Z\_t^{-1} | \mathcal{F}\_t^Y \right]}{\mathbb{E}^0 \left[ Z\_t^{-1} | \mathcal{F}\_t^Y \right]} = \frac{p\_t(f)}{p\_t(1)}, \quad t \in [0, T]. \tag{34}$$

where pt <sup>ð</sup>1<sup>Þ</sup> :<sup>¼</sup> <sup>E</sup><sup>0</sup> <sup>Z</sup>�<sup>1</sup> <sup>t</sup> <sup>j</sup>F<sup>Y</sup> t � �. This provides the relation between the filter and its unnormalized version. In order to compute the Zakai equation, we make the following assumption.

Assumption 10. Suppose that there exists a transition function <sup>η</sup><sup>0</sup>ðt, y, <sup>d</sup>z<sup>Þ</sup> such that the <sup>ð</sup>FY, <sup>P</sup>Þpredictable measure <sup>η</sup><sup>0</sup>ðt, Yt�, <sup>d</sup>z<sup>Þ</sup> is equivalent to <sup>λ</sup>ðt, Xt�, Yt�Þφðt, Xt�, Yt�, <sup>d</sup>z<sup>Þ</sup> and

$$\mathbb{E}\left[\int\_{0}^{T} \eta^{0}(t, Y\_{t^{-}}, \mathbb{R}) \mathbf{d}t \right] < \infty. \tag{35}$$

Remark 11. In Ref. [18], a weaker assumption is considered. That condition allows to introduce an equivalent probability measure on <sup>ð</sup>Ω, <sup>F</sup><sup>Y</sup> <sup>T</sup> <sup>Þ</sup> which is not necessarily the restriction on <sup>F</sup><sup>Y</sup> <sup>T</sup> of an equivalent probability measure on ðΩ, FTÞ.

Remark 12. In the context of Example 8, Assumption 10 is satisfied if, for instance, λ<sup>i</sup> ðt, Xt�, Yt�Þ > 0 P-a.s. for every t∈½0, T�.

Assumption 10 equivalently means that there exists an <sup>ð</sup>FY, <sup>P</sup>Þ-predictable process Ψðt, Xt�, Yt�, zÞ such that

$$\rho\_t(t, \mathbf{X}\_{t-\prime}, \mathbf{Y}\_{t-\prime})\phi(t, \mathbf{X}\_{t-\prime}, \mathbf{Y}\_{t-\prime}, \mathbf{dz})\mathbf{d}t = (1 + \Psi(t, \mathbf{X}\_{t-\prime}, \mathbf{Y}\_{t-\prime}, \mathbf{z}))\eta^0(t, \mathbf{Y}\_{t-\prime}, \mathbf{dz})\mathbf{d}t \tag{36}$$

and 1 þ Ψðt, Xt � , Yt � , zÞ > 0 P-a.s. for every t ∈½0, T�, z∈ R. Setting

$$\mathcal{U}(t, z) := \frac{1}{1 + \Psi(t, \mathcal{X}\_{\mathcal{t}^-}, \mathcal{Y}\_{\mathcal{t}^-}, z)} - 1,\tag{37}$$

we also assume that the following integrability condition holds:

$$\mathbb{E}\left[\exp\left\{\frac{1}{2}\int\_{0}^{T}\left(\frac{b\_{1}(\mathbf{s},\mathbf{X}\_{s},\mathbf{Y}\_{s})}{\sigma\_{1}(\mathbf{s},\mathbf{Y}\_{s})}\right)^{2}\mathrm{d}\mathbf{s} + \int\_{0}^{T}\int\_{\mathbb{R}}\mathcal{U}^{2}(\mathbf{s},\mathbf{z})\lambda(\mathbf{s},\mathbf{X}\_{s-\cdot},\mathbf{Y}\_{s-\cdot})\phi(\mathbf{s},\mathbf{X}\_{s-\cdot},\mathbf{Y}\_{s-\cdot},\mathbf{dz})\mathrm{d}\mathbf{s}\right\}\right] < \infty. \tag{38}$$

The subsequent proposition provides a useful version of the Girsanov Theorem that fits to our setting.

Proposition 13. Let Assumptions 4 and 10, and condition (38) hold and define the process Zt :¼ E � ðt 0 <sup>b</sup>1<sup>ð</sup>s, Xs, Ys<sup>Þ</sup> <sup>σ</sup>1<sup>ð</sup>s, Ys<sup>Þ</sup> <sup>d</sup>W<sup>1</sup> <sup>s</sup> þ ðt 0 ð R Uðs, zÞ � mðds, dzÞ � λðs, Xs� , Ys� Þφðs, Xs� , Ys� , dzÞds � �� , for every t ∈½0, T�, where EðMÞ denotes the Doléans-Dade exponential of a martingale M. Then, Z is a strictly positive <sup>ð</sup>F, <sup>P</sup><sup>Þ</sup> -martingale. Let <sup>P</sup><sup>0</sup> be the probability measure equivalent to <sup>P</sup> given by

$$\left. \frac{d\mathbf{P}^0}{d\mathbf{P}} \right|\_{\mathcal{F}\_t} = Z\_{t\prime} \quad t \in [0, T]. \tag{39}$$

Then, the process

$$\widehat{\boldsymbol{W}}\_t^1 := \boldsymbol{W}\_t^1 + \int\_0^t \frac{b\_1(\mathbf{s}, \mathbf{X}\_{\boldsymbol{s}}, \mathbf{Y}\_{\boldsymbol{s}})}{\sigma\_1(\mathbf{s}, \mathbf{Y}\_{\boldsymbol{s}})} \mathbf{ds}, \quad t \in [0, T] \tag{40}$$

is an <sup>ð</sup>F, <sup>P</sup><sup>0</sup> <sup>Þ</sup>-Brownian motion, and the <sup>ð</sup>F, <sup>P</sup><sup>0</sup> Þ-predictable projection of the integer-valued random measure mðdt, <sup>d</sup>z<sup>Þ</sup> is given by <sup>η</sup><sup>0</sup>ðt, Yt � , dzÞdt.

Proof. [32, Theorem 9] ensures that Z is a martingale under Assumptions 10, 4 and integrability condition Eq. (38). Then the proof follows by Ref. [31, Chapter III, Theorem 3.24].

Note that, by Eq. (16), we get that the process W f<sup>1</sup> can also be written as Recent Advances in Nonlinear Filtering with a Financial Application to Derivatives Hedging under Incomplete… http://dx.doi.org/10.5772/intechopen.70060 335

$$
\widehat{\boldsymbol{W}}\_t^1 = \boldsymbol{I}\_t + \int\_0^t \pi\_s \left(\frac{b\_1}{\sigma\_1}\right) \mathbf{ds}, \quad t \in [0, T] \tag{41}
$$

which implies that W f1 is also an <sup>ð</sup>FY, <sup>P</sup><sup>0</sup> <sup>Þ</sup>-Brownian motion. Moreover, since <sup>η</sup><sup>0</sup>ðt, Yt � , <sup>d</sup>z<sup>Þ</sup> is <sup>F</sup><sup>Y</sup> predictable, it provides the <sup>ð</sup>FY, <sup>P</sup><sup>0</sup> Þ-predictable projection of the measure mðdt, dzÞ and the observation process <sup>Y</sup> satisfies dYt <sup>¼</sup> <sup>σ</sup>1ðt, YtÞdW<sup>~</sup> <sup>1</sup> <sup>t</sup> þ ð R zmðdt, <sup>d</sup>zÞ. In particular, <sup>η</sup><sup>0</sup> <sup>t</sup>ðRÞ :¼ <sup>η</sup><sup>0</sup>ðt, Yt � , <sup>R</sup><sup>Þ</sup> is the <sup>ð</sup>FY, <sup>P</sup><sup>0</sup> Þ-intensity of the point process which counts the total jumps of Yuntil time t.

Theorem 14 (The Zakai equation). Under Assumptions 4 and 10 and condition (38), let P<sup>0</sup> be the probability measure defined in Proposition 13. For every f ∈C<sup>1</sup>;<sup>2</sup> <sup>b</sup> ð½0, T� � RÞ, the unnormalized filter defined in Eq. (33) satisfies the equation

$$\begin{split} \mathrm{d}p\_{t}(f) &= \left\{ p\_{t}(\mathcal{L}\_{0}^{\chi}f) - p\_{t}(\lambda f) + \eta\_{t}^{0}(\mathbb{R})p\_{t}(f) \right\} \mathrm{d}t + \left\{ \frac{p\_{t}(\mathbb{b}f)}{\sigma\_{1}(t,Y\_{t})} + \rho \, p\_{t} \left( \sigma\_{0} \frac{\partial f}{\partial \mathbf{x}} \right) \right\} \mathrm{d}\widetilde{W}\_{t}^{1} \\ &+ \int\_{\mathbb{R}} \left\{ p\_{t} \left( f^{\sf M} \right)(\mathbf{z}) + \frac{\mathrm{d}p\_{t-}(\overline{\mathcal{L}}f)}{\mathrm{d}\eta\_{t}^{0}}(\mathbf{z}) \right\} m(\mathbf{d}t, \mathbf{d}z) . \end{split} \tag{42}$$

See Ref. [18, Theorem 3.6] for the proof.

Remark 11. In Ref. [18], a weaker assumption is considered. That condition allows to introduce an

Assumption 10 equivalently means that there exists an <sup>ð</sup>FY, <sup>P</sup>Þ-predictable process

Remark 12. In the context of Example 8, Assumption 10 is satisfied if, for instance, λ<sup>i</sup>

<sup>λ</sup>ðt, Xt�, Yt�Þφðt, Xt�, Yt�, <sup>d</sup>zÞd<sup>t</sup> ¼ ð<sup>1</sup> <sup>þ</sup> <sup>Ψ</sup>ðt, Xt�, Yt�, zÞÞη<sup>0</sup>

<sup>U</sup>ðt, z<sup>Þ</sup> :<sup>¼</sup> <sup>1</sup>

Uðs, zÞ �

dP<sup>0</sup> dP � � � � Ft

� , dzÞdt.

condition Eq. (38). Then the proof follows by Ref. [31, Chapter III, Theorem 3.24].

we also assume that the following integrability condition holds:

ds þ ðT 0 ð R U2

� , zÞ > 0 P-a.s. for every t ∈½0, T�, z∈ R. Setting

1 þ Ψðt, Xt

The subsequent proposition provides a useful version of the Girsanov Theorem that fits to our

Proposition 13. Let Assumptions 4 and 10, and condition (38) hold and define the process

every t ∈½0, T�, where EðMÞ denotes the Doléans-Dade exponential of a martingale M. Then, Z is a

b1ðs, Xs, YsÞ

Proof. [32, Theorem 9] ensures that Z is a martingale under Assumptions 10, 4 and integrability

strictly positive <sup>ð</sup>F, <sup>P</sup><sup>Þ</sup> -martingale. Let <sup>P</sup><sup>0</sup> be the probability measure equivalent to <sup>P</sup> given by

� ��

" ( )#

� , Yt � , zÞ

ðs, zÞλðs, Xs� , Ys� Þφðs, Xs� , Ys� , dzÞds

mðds, dzÞ � λðs, Xs� , Ys� Þφðs, Xs� , Ys� , dzÞds

¼ Zt, t∈ ½0, T�: (39)

<sup>σ</sup>1ðs, Ys<sup>Þ</sup> <sup>d</sup>s, t <sup>∈</sup>½0, T� (40)

Þ-predictable projection of the integer-valued random

f<sup>1</sup> can also be written as

<sup>T</sup> <sup>Þ</sup> which is not necessarily the restriction on <sup>F</sup><sup>Y</sup>

<sup>T</sup> of an equivalent

ðt, Xt�, Yt�Þ > 0

< ∞: (38)

, for

ðt, Yt�, dzÞdt (36)

� 1, (37)

equivalent probability measure on <sup>ð</sup>Ω, <sup>F</sup><sup>Y</sup>

probability measure on ðΩ, FTÞ.

P-a.s. for every t∈½0, T�.

334 Bayesian Inference

Ψðt, Xt�, Yt�, zÞ such that

ðt 0

<sup>b</sup>1<sup>ð</sup>s, Xs, Ys<sup>Þ</sup> <sup>σ</sup>1<sup>ð</sup>s, Ys<sup>Þ</sup> <sup>d</sup>W<sup>1</sup>

measure mðdt, <sup>d</sup>z<sup>Þ</sup> is given by <sup>η</sup><sup>0</sup>ðt, Yt

� , Yt

b1ðs,Xs,YsÞ σ1ðs,YsÞ � �<sup>2</sup>

> <sup>s</sup> þ ðt 0 ð R

> > W f1 <sup>t</sup> :<sup>¼</sup> <sup>W</sup><sup>1</sup> <sup>t</sup> þ ðt 0

<sup>Þ</sup>-Brownian motion, and the <sup>ð</sup>F, <sup>P</sup><sup>0</sup>

Note that, by Eq. (16), we get that the process W

and 1 þ Ψðt, Xt

E exp

setting.

Zt :¼ E �

Then, the process

is an <sup>ð</sup>F, <sup>P</sup><sup>0</sup>

#### 3.1. Uniqueness of the filtering equations

In this section, we show pathwise uniqueness for the solution of the Kushner-Stratonovich and the Zakai equations. The first result provides the equivalence of uniqueness of the solutions to the filtering Eqs. (21) and (42).

Theorem 15. Let Assumptions 4 and 10 and condition (38) hold.


Proof. The proof follows by Ref. [18, Theorems 4.5 and 4.6]. Here, note that Assumption 10 implies that the measures μ<sup>t</sup> � ðλφðdzÞÞ and π<sup>t</sup> � ðλφðdzÞÞ are equivalent.

Finally, strong uniqueness for the solution of both filtering equations is established in the subsequent theorems.

Theorem 16. Let (X, Y) be the partially observed system defined in Eq. (3), and assume in addition to Assumptions 4 and 10 and condition (15) that

$$\sup\_{t, \mathbf{x}, \mathbf{y}, \mathbf{y}} \Big[\_{\mathcal{Z}} \{ |K\_0(t, \mathbf{x}; \mathsf{Q})| + |K\_1(t, \mathbf{x}, \mathbf{y}; \mathsf{Q})| \} \nu(d\mathsf{Q}) < \infty. \tag{43}$$

Let μ be a strong solution of the Kushner-Stratonovich equation. Then μ<sup>t</sup> = π<sup>t</sup> P-a.s. for every t∈ ½0, T�.

Proof. See Ref. [17, Theorem 3.3].

Theorem 17. Let (X, Y) be the partially observed system in Eq. (3). Under Assumptions 4 and 10 and conditions (38) and (43), let ξ be a strong solution to the Zakai equation, then ξ<sup>t</sup> = pt P-a.s. for every t∈ ½0, T�.

Proof. The proof follows by Ref. [18, Theorem 4.7], after noticing that under Assumption 10 the measures <sup>ξ</sup>t�ðλφðdzÞÞ and pt�ðλφðdzÞÞ are equivalent.

### 4. A financial application to risk minimization

In the current section, we focus on a financial application. We consider a simple financial market where agents may invest in a risky asset whose price is described by the process Y given in Eq. (3) and a riskless asset with price process B. Without loss of generality, we assume that Bt = 1 for every t∈½0, T�. We also assume throughout the section the following dynamics for the process Y:

$$\mathbf{d}\mathbf{d}Y\_t = Y\_t \Big(\sigma(t, Y\_t)\mathbf{d}W\_t^1 + \int\_{\mathcal{Z}} K(t, X\_{t-\prime}, Y\_{t-\prime}; \zeta) \Big(\mathbf{N}(\mathbf{d}t, \mathbf{d}\zeta) - \nu(\mathbf{d}\zeta)\mathbf{d}t\Big)\Big), \quad Y\_0 = y\_0 \in \mathbb{R}^+ \tag{44}$$

for some functions σðt, yÞ and Kðt, x, y; ζÞ such that σðt, yÞ > 0 and Kðt, x, y; ζÞ > �1.

This choice for the dynamics of Y has a double advantage. On one side assuming a geometric form, together with the condition that Kðt, x, y; ζÞ > �1 guarantees nonnegativity which is desirable when talking about prices. On the other hand, we are modeling Y directly under a martingale measure, and by Assumption 18, it turns out to be a square integrable ðF, PÞmartingale.

Considering Eq. (44) corresponds to take in system (3)

$$\begin{aligned} b\_1(t, \mathbf{x}, \mathbf{y}) &= -y \int\_Z \mathbf{K}(t, \mathbf{x}, \mathbf{y}; \boldsymbol{\zeta}) \nu(\mathbf{d} \boldsymbol{\zeta}) \\ \sigma\_1(t, \mathbf{y}) &= y \sigma(t, \mathbf{y}), \quad \mathbf{K}\_1(t, \mathbf{x}, \mathbf{y}; \boldsymbol{\zeta}) = y \mathbf{K}(t, \mathbf{x}, \mathbf{y}; \boldsymbol{\zeta}). \end{aligned} \tag{45}$$

In addition, we me make the following assumption.

#### Assumption 18.

$$0 < c\_1 < \sigma(t, y) < c\_2 \quad \left| \mathcal{K}(t, \mathbf{x}, y; \mathbb{Q}) \right| < c\_3 \quad \nu(D\_t) < c\_4 \tag{46}$$

for every ðt, x, yÞ∈ ½0, T� � R � R<sup>þ</sup>, ζ ∈ Z and for some positive constants c1, c2, c3, c4.

Remark 19. In the sequel, it might be useful to specify the dynamics of Y also in terms of the jump measure mðdt, dzÞ. Recalling Eqs. (6) and (14), we have

$$\mathbf{d}\mathbf{d}Y\_t = Y\_t \sigma(t, Y\_t) \mathbf{d}W\_t^1 + \int\_{\mathbb{R}} z \Big( m(\mathbf{d}t, \mathbf{d}z) - \lambda(t, X\_{\mathrm{i}}, \, Y\_{\mathrm{i}}) \phi(t, X\_{\mathrm{i}}, \, Y\_{\mathrm{i}}, \, \mathbf{d}z) \mathbf{d}t \Big). \tag{47}$$

The stochastic factor X which affects intensity and jump size distribution of Y may represent the state of the economy and is not directly observable by market agents. This is a typical situation arising in real financial markets.

We model by F<sup>Y</sup> the available information to investors. Since Y is F<sup>Y</sup> adapted, it is in particular an <sup>ð</sup>FY, <sup>P</sup>Þ-martingale with the following decomposition:

$$Y\_t = y\_0 + \int\_0^t Y\_s \sigma(\mathbf{s}, Y\_s) \mathrm{d}I\_s + \int\_0^t \int\_{\mathbb{R}} z \left( m(\mathbf{ds}, \mathbf{dz}) - \pi\_{\mathbf{s}^-} (\lambda \phi(\mathbf{dz})) \mathrm{d}s \right), \quad t \in [0, T]. \tag{48}$$

By Eqs. (14) and (45), in this setting the first component of the innovation process I defined in Eq. (16) is given by It <sup>¼</sup> <sup>W</sup><sup>1</sup> <sup>t</sup> þ ðt 0 1 Ysσðs, Ys<sup>Þ</sup> ð R z � <sup>λ</sup>ðs, Xs, YsÞφðs, Xs, Ys, <sup>d</sup>zÞ � <sup>π</sup>sðλφðdzÞÞ� ds.

Suppose that we are given a European-type contingent claim whose final payoff is a square integrable F<sup>Y</sup> <sup>T</sup> -measurable random variable ξ, that is, ξ∈L<sup>2</sup> <sup>ð</sup>F<sup>Y</sup> <sup>T</sup> Þ where

$$L^2(\mathcal{F}\_T^\chi) := \{ \text{random variables } \Gamma \in \mathcal{F}\_T^\chi : \mathbb{E}[\Gamma^2] < \infty \}. \tag{49}$$

The objective of the agent is to find the optimal hedging strategy for this derivative. Since the number of random sources exceeds the number of tradeable risky assets, the market is incomplete. It is well known that in this setting, perfect replication by self-financing strategies is not feasible. Then, we suppose that the investor intends to pursue the risk-minimization approach. Risk minimization is a quadratic hedging method that allows determining a dynamic investment strategy that replicates perfectly the claim with minimal cost. Let us properly introduce the objects of interest. We start with the following notation. For any pair of F-adapted (respectively, F<sup>Y</sup>-adapted) processes Ψ<sup>1</sup> , Ψ<sup>2</sup> we refer to 〈Ψ<sup>1</sup> ,Ψ<sup>2</sup> 〉 <sup>F</sup> for the predictable covariation computed with respect to filtration F (respectively, 〈Ψ<sup>1</sup> ,Ψ<sup>2</sup> 〉 FY for the predictable covariation computed with respect to filtration F<sup>Y</sup>). Note that

$$\begin{aligned} \langle Y \rangle\_t^{\mathbb{F}} &= \int\_0^t Y\_s^2 \Big( \sigma^2(\mathbf{s}, Y\_{s^-}) + \int\_Z K^2(\mathbf{s}, X\_{s^-}, Y\_{s^-}; \mathbb{L}) \nu(\mathbf{d}\mathbb{L}) \Big) \mathrm{d}s \\ &= \int\_0^t \Big( Y\_s^2 \sigma^2(\mathbf{s}, Y\_{s^-}) + \int\_{\mathbb{R}} z^2 \lambda(\mathbf{s}, X\_{s^-}, Y\_{s^-}) \phi(\mathbf{s}, X\_{s^-}, Y\_{s^-}, \mathbb{L}\mathbf{z}) \Big) \mathrm{d}s, \quad t \in [0, T] \end{aligned} \tag{50}$$

and since Y is also F<sup>Y</sup> adapted, we also have

Let μ be a strong solution of the Kushner-Stratonovich equation. Then μ<sup>t</sup> = π<sup>t</sup> P-a.s. for every t∈ ½0, T�.

Theorem 17. Let (X, Y) be the partially observed system in Eq. (3). Under Assumptions 4 and 10 and conditions (38) and (43), let ξ be a strong solution to the Zakai equation, then ξ<sup>t</sup> = pt P-a.s. for every

Proof. The proof follows by Ref. [18, Theorem 4.7], after noticing that under Assumption 10 the

In the current section, we focus on a financial application. We consider a simple financial market where agents may invest in a risky asset whose price is described by the process Y given in Eq. (3) and a riskless asset with price process B. Without loss of generality, we assume that Bt = 1 for every t∈½0, T�. We also assume throughout the section the following dynamics for the process Y:

�

This choice for the dynamics of Y has a double advantage. On one side assuming a geometric form, together with the condition that Kðt, x, y; ζÞ > �1 guarantees nonnegativity which is desirable when talking about prices. On the other hand, we are modeling Y directly under a martingale measure, and by Assumption 18, it turns out to be a square integrable ðF, PÞ-

> ð Z

σ1ðt, yÞ ¼ yσðt, yÞ, K1ðt, x, y; ζÞ ¼ yKðt, x, y; ζÞ:

Remark 19. In the sequel, it might be useful to specify the dynamics of Y also in terms of the jump

mðdt, dzÞ � λðt, Xt

Nðdt, dζÞ � νðdζÞdt

Kðt, x, y; ζÞνðdζÞ

0 < c<sup>1</sup> < σðt, yÞ < c2, jKðt, x, y; ζÞj < c3, νðDtÞ < c4, (46)

� , Yt

� Þφðt, Xt

� , Yt

� , dzÞdt �

: (47)

��

, Y<sup>0</sup> ¼ y<sup>0</sup> ∈ R<sup>þ</sup> (44)

(45)

Kðt, Xt�, Yt�; ζÞ

b1ðt, x, y޼�y

for every ðt, x, yÞ∈ ½0, T� � R � R<sup>þ</sup>, ζ ∈ Z and for some positive constants c1, c2, c3, c4.

for some functions σðt, yÞ and Kðt, x, y; ζÞ such that σðt, yÞ > 0 and Kðt, x, y; ζÞ > �1.

Proof. See Ref. [17, Theorem 3.3].

measures <sup>ξ</sup>t�ðλφðdzÞÞ and pt�ðλφðdzÞÞ are equivalent.

4. A financial application to risk minimization

<sup>t</sup> þ ð Z

Considering Eq. (44) corresponds to take in system (3)

In addition, we me make the following assumption.

measure mðdt, dzÞ. Recalling Eqs. (6) and (14), we have

<sup>t</sup> þ ð R z �

<sup>d</sup>Yt <sup>¼</sup> Ytσðt, YtÞdW<sup>1</sup>

t∈ ½0, T�.

336 Bayesian Inference

dYt ¼ Yt

martingale.

Assumption 18.

�

<sup>σ</sup>ðt, YtÞdW<sup>1</sup>

$$\langle \mathbf{Y} \rangle\_t^{\mathbb{P}^\mathbf{Y}} = \int\_0^t \left( Y\_s^2 \sigma^2(\mathbf{s}, Y\_{s^-}) + \int\_{\mathbb{R}} z^2 \pi\_{s^-}(\lambda \phi(\mathbf{d}z)) \right) \mathrm{d}s, \quad t \in [0, T]. \tag{51}$$

We stress that, due to the presence of a jump component, the predictable quadratic variations of Y with respect to filtrations F and F<sup>Y</sup> are different.

Now we introduce a technical definition of two spaces, <sup>Θ</sup>ðF<sup>Þ</sup> and <sup>Θ</sup>ðF<sup>Y</sup><sup>Þ</sup>

Definition 20. The space <sup>Θ</sup>ðF<sup>Y</sup><sup>Þ</sup> (respectively, <sup>Θ</sup>ðFÞ) is the space of all <sup>F</sup><sup>Y</sup>-predictable (respectively, F-predictable) processes θ such that

$$\mathbb{E}\left[\int\_{0}^{T}\theta\_{\text{u}}^{2}\mathbf{d}\langle Y\rangle\_{\text{u}}^{\mathbb{F}}\right] < \infty \quad \left(\text{respectively} \quad \mathbb{E}\left[\int\_{0}^{T}\theta\_{\text{u}}^{2}\mathbf{d}\langle Y\rangle\_{\text{u}}^{\mathbb{F}}\right] < \infty\right). \tag{52}$$

We observe that for every <sup>θ</sup><sup>∈</sup> <sup>Θ</sup>ðF<sup>Y</sup>Þ, thanks to <sup>F</sup><sup>Y</sup>-predictability, we have

$$\mathbb{E}\left[\int\_{0}^{T}\boldsymbol{\theta}\_{\boldsymbol{u}}^{2}\mathbf{d}\langle\boldsymbol{Y}\rangle\_{\boldsymbol{u}}^{\overline{\mathbf{y}}}\right] = \mathbb{E}\left[\int\_{0}^{T}\boldsymbol{\theta}\_{\boldsymbol{u}}^{2}\mathbf{d}\langle\boldsymbol{Y}\rangle\_{\boldsymbol{u}}^{\overline{\mathbf{y}}\overline{\mathbf{y}}}\right] < \text{ o}\tag{53}$$

which implies that <sup>Θ</sup>ðF<sup>Y</sup><sup>Þ</sup> <sup>⊆</sup> <sup>Θ</sup>ðFÞ.

Since we have two different levels of information represented by the filtrations F and F<sup>Y</sup>, we may define two classes of admissible strategies.

Definition 21. An <sup>F</sup>Y-strategy (respectively, <sup>F</sup>-strategy) is a pair <sup>ψ</sup> ¼ ðθ, <sup>η</sup><sup>Þ</sup> of stochastic processes, where θ represents the amount invested in the risky asset and η is the amount invested in the riskless asset, such that <sup>θ</sup><sup>∈</sup> <sup>Θ</sup>ðF<sup>Y</sup><sup>Þ</sup> (respectively, <sup>θ</sup><sup>∈</sup> <sup>Θ</sup>ðFÞ) and <sup>η</sup> is <sup>F</sup>Y-adapted (respectively, <sup>F</sup>-adapted).

This definition reflects the fact that investor's choices should be adapted to her/his knowledge of the market. The value of a strategy ψ ¼ ðθ, ηÞ is given by

$$V\_t(\psi) = \theta\_t Y\_t + \eta\_{t'} \quad t \in [0, T]. \tag{54}$$

and its cost is described by the process

$$\mathcal{C}\_t(\psi) = V\_t(\psi) - \int\_0^t \theta\_u \mathbf{d}Y\_{u\nu} \quad \text{t} \in [0, T]. \tag{55}$$

In other terms, the cost of a strategy is the difference between the value process and the gain process. For a self-financing strategy, the value and the gain processes coincide, up to the initial wealth V0, and therefore the cost is constant and equal to Ct ¼ V0, for every t∈ ½0, T�. We continue by defining the risk process, in the partial information setting.

Definition 22. Given an <sup>F</sup>Y-strategy (respectively, an <sup>F</sup>-strategy) <sup>ψ</sup> ¼ ðθ, <sup>η</sup>Þ, we denote by R<sup>F</sup><sup>Y</sup> ðψÞ (respectively, R<sup>F</sup> <sup>ð</sup>ψÞ) the associated risk process defined as

$$\mathcal{R}\_t^{\mathcal{F}^\circ}(\boldsymbol{\psi}) := \mathbb{E}\left[\left(\mathsf{C}\_T(\boldsymbol{\psi}) - \mathsf{C}\_t(\boldsymbol{\psi})\right)^2 | \mathcal{F}\_t^\circ\right], \quad \left(\text{respectively } \mathsf{R}\_t^\mathcal{F}(\boldsymbol{\psi}) := \mathbb{E}\left[\left(\mathsf{C}\_T(\boldsymbol{\psi}) - \mathsf{C}\_t(\boldsymbol{\psi})\right)^2 | \mathcal{F}\_t\right]\right), \tag{56}$$

for every t∈½0, T�.

Then, we have the following definition of risk-minimizing strategy under partial information.

### Definition 23. An FY-strategy ψ is risk minimizing if

i. VTðψÞ ¼ ξ,

Definition 20. The space <sup>Θ</sup>ðF<sup>Y</sup><sup>Þ</sup> (respectively, <sup>Θ</sup>ðFÞ) is the space of all <sup>F</sup><sup>Y</sup>-predictable (respectively,

ðT 0 θ2 <sup>u</sup>d〈Y〉 F u

� �

� �

< ∞

, t ∈½0, T�, (54)

θudYu, t∈ ½0, T�: (55)

ðψÞ

,

(56)

< ∞, (53)

: (52)

< ∞ respectively E

¼ E

Since we have two different levels of information represented by the filtrations F and F<sup>Y</sup>, we

Definition 21. An <sup>F</sup>Y-strategy (respectively, <sup>F</sup>-strategy) is a pair <sup>ψ</sup> ¼ ðθ, <sup>η</sup><sup>Þ</sup> of stochastic processes, where θ represents the amount invested in the risky asset and η is the amount invested in the riskless asset, such that <sup>θ</sup><sup>∈</sup> <sup>Θ</sup>ðF<sup>Y</sup><sup>Þ</sup> (respectively, <sup>θ</sup><sup>∈</sup> <sup>Θ</sup>ðFÞ) and <sup>η</sup> is <sup>F</sup>Y-adapted (respectively, <sup>F</sup>-adapted).

This definition reflects the fact that investor's choices should be adapted to her/his knowledge

ðt 0

In other terms, the cost of a strategy is the difference between the value process and the gain process. For a self-financing strategy, the value and the gain processes coincide, up to the initial wealth V0, and therefore the cost is constant and equal to Ct ¼ V0, for every t∈ ½0, T�. We

Definition 22. Given an <sup>F</sup>Y-strategy (respectively, an <sup>F</sup>-strategy) <sup>ψ</sup> ¼ ðθ, <sup>η</sup>Þ, we denote by R<sup>F</sup><sup>Y</sup>

, respectively R<sup>F</sup>

Then, we have the following definition of risk-minimizing strategy under partial information.

<sup>t</sup> ðψÞ :¼ E

�

� � � �

CTðψÞ � CtðψÞ

�2 jFt

VtðψÞ ¼ θtYt þ η<sup>t</sup>

CtðψÞ ¼ VtðψÞ �

continue by defining the risk process, in the partial information setting.

�2 <sup>j</sup>F<sup>Y</sup> t

(respectively, R<sup>F</sup> <sup>ð</sup>ψÞ) the associated risk process defined as

� �

CTðψÞ � CtðψÞ

ðT 0 θ2 <sup>u</sup>d〈Y〉 FY u

� �

We observe that for every <sup>θ</sup><sup>∈</sup> <sup>Θ</sup>ðF<sup>Y</sup>Þ, thanks to <sup>F</sup><sup>Y</sup>-predictability, we have

� �

F-predictable) processes θ such that

338 Bayesian Inference

E ðT 0 θ2 <sup>u</sup>d〈Y〉 FY u

which implies that <sup>Θ</sup>ðF<sup>Y</sup><sup>Þ</sup> <sup>⊆</sup> <sup>Θ</sup>ðFÞ.

and its cost is described by the process

R<sup>F</sup><sup>Y</sup>

<sup>t</sup> ðψÞ :¼ E

for every t∈½0, T�.

�

� �

E ðT 0 θ2 <sup>u</sup>d〈Y〉 F u

of the market. The value of a strategy ψ ¼ ðθ, ηÞ is given by

may define two classes of admissible strategies.

ii. for any other F<sup>Y</sup> -strategy ψ~ we have R<sup>F</sup><sup>Y</sup> <sup>t</sup> <sup>ð</sup>ψ<sup>Þ</sup> <sup>≤</sup>R<sup>F</sup><sup>Y</sup> <sup>t</sup> <sup>ð</sup>ψ<sup>~</sup> <sup>Þ</sup>, for every t <sup>∈</sup>½0, T�.

The corresponding definitions of risk process and risk-minimizing strategy under full information can be obtained replacing F<sup>Y</sup> and R<sup>F</sup><sup>Y</sup> <sup>t</sup> with F and R<sup>F</sup> <sup>t</sup> in Definition 23. To differentiate, when it is necessary, we use the terms F<sup>Y</sup>-risk-minimizing strategy or F-risk-minimizing strategy. The criterion (ii) in Definition 23 can be also written as

$$\min\_{\psi \in \Theta(\mathbb{F}^\vee)} \mathbb{E}\left[ \left( \mathbb{C}\_T(\psi) - \mathbb{C}\_t(\psi) \right)^2 \right], \quad t \in [0, T]. \tag{57}$$

which intuitively means that a strategy is risk minimizing if it minimizes the variance of the cost. This equivalent definition allows to obtain a nice property of risk-minimizing strategies which turn out to be self-financing on average, that is, the cost process C is a martingale and therefore has constant expectation (see, e.g., Ref. [27, Lemma 2] or [28, Lemma 2.3]).

In the sequel, we aim to characterize the optimal hedging strategy for the contingent claim ξ under full and partial information, that is, the F- and the F<sup>Y</sup>-risk-minimizing strategies. To this, we introduce two orthogonal decompositions known as the Galtchouk-Kunita-Watanabe decompositions under full and partial information (see, e.g., [30]). To understand better the relevance of these decompositions, we assume for a moment completeness of the market and full information. Then, it is well known that for every European-type contingent claim with final payoff ξ, there exists a self-financing strategy ψ ¼ ðθ, ηÞ such that

$$\xi = V\_0 + \int\_0^T \theta\_u \mathbf{d}Y\_{u\nu} \quad \mathbf{P}-\text{a.s.}\tag{58}$$

that is, a replicating portfolio is uniquely determined by the initial wealth and the investment in the risky asset. When the market is incomplete, decomposition Eq. (58) does not hold in general. Intuitively, this implies that we might expect additional terms in Eq. (58), and according to the risk-minimization criterion, this additional terms need to be such that the final cost does not deviate too much from the average cost, in the quadratic sense. Specifically, we have the following decomposition of the random variable ξ:

$$\xi = V\_0 + \int\_0^T \theta\_u \mathbf{d}Y\_u + G\_{T\prime} \quad \mathbf{P}-\text{a.s.}\tag{59}$$

where GT is the value at time T of a suitable process G. The minimality criterion requires that G is a martingale orthogonal to Y. We refer the reader to Ref. [28] for a detailed survey. Under suitable hypothesis, the above decomposition takes the name of Galtchouk-Kunita-Watanabe decomposition.

Now we wish to be more formal, and we introduce the following definitions:

Consider a random variable ξ∈L<sup>2</sup> <sup>ð</sup>F<sup>Y</sup> <sup>T</sup> <sup>Þ</sup>. Since <sup>F</sup><sup>Y</sup> T⊆FT, we can define the following decompositions for ξ.

Definition 24. a. The Galtchouk-Kunita-Watanabe decomposition of ξ∈ L<sup>2</sup> <sup>ð</sup>F<sup>Y</sup> <sup>T</sup> Þ with respect to Yand F is given by

$$\xi = \mathcal{U}\_0^{\mathcal{F}} + \int\_0^T \boldsymbol{\Theta}\_u^{\mathcal{F}} \mathbf{d}Y\_u + \mathbf{G}\_T^{\mathcal{F}} \quad \mathbf{P}-a.s.\tag{60}$$

where U<sup>F</sup> <sup>0</sup> ∈ L<sup>2</sup> <sup>ð</sup>F0Þ, <sup>θ</sup><sup>F</sup> <sup>∈</sup> <sup>Θ</sup>ðF<sup>Þ</sup> and G<sup>F</sup> is a square integrable <sup>ð</sup>F, <sup>P</sup>Þ-martingale, with G<sup>F</sup> <sup>0</sup> ¼ 0, orthogonal to Y, that is, 〈G<sup>F</sup> ,Y〉 F <sup>t</sup> ¼ 0 for every t ∈½0, T�.

b. The Galtchouk-Kunita-Watanabe decomposition of ξ∈ L<sup>2</sup> <sup>ð</sup>F<sup>Y</sup> <sup>T</sup> <sup>Þ</sup> with respect to <sup>Y</sup> and <sup>F</sup><sup>Y</sup> is given by

$$\xi = \mathcal{U}\_0^{\mathcal{F}^\circ} + \int\_0^T \boldsymbol{\Theta}\_{\boldsymbol{u}}^{\mathcal{F}^\circ} \, \mathbf{d}Y\_{\boldsymbol{u}} + \mathbf{G}\_{\boldsymbol{T}}^{\mathcal{F}^\circ} \quad \mathbf{P}-a.s.,\tag{61}$$

where U<sup>F</sup><sup>Y</sup> <sup>0</sup> ∈L<sup>2</sup> <sup>ð</sup>F<sup>Y</sup> <sup>0</sup> <sup>Þ</sup>, <sup>θ</sup><sup>F</sup><sup>Y</sup> <sup>∈</sup> <sup>Θ</sup>ðF<sup>Y</sup><sup>Þ</sup> and G<sup>F</sup><sup>Y</sup> is a square integrable <sup>ð</sup>FY, <sup>P</sup><sup>Þ</sup> -martingale, With G<sup>F</sup><sup>Y</sup> <sup>0</sup> <sup>¼</sup> 0, strongly orthogonal to Y, that is, 〈G<sup>F</sup> ,Y〉 FY <sup>t</sup> ¼ 0 for every t∈ ½0, T�:

In the sequel, we refer to Eqs. (60) and (61) as the Galtchouk-Kunita-Watanabe decompositions under full information and under partial information, respectively. Since Y is a square integrable martingale with respect to both filtrations F and F<sup>Y</sup>, decompositions Eqs. (60) and (61) exist.

Next proposition provides a relation between the integrands θ<sup>F</sup> and θ<sup>F</sup><sup>Y</sup> of decompositions Eqs. (60) and (61) in terms of predictable projections. For any ðF, PÞ-predictable process A of finite variation, we denote by Ap,F<sup>Y</sup> its <sup>ð</sup>FY, <sup>P</sup>Þ-dual-predictable projection.<sup>1</sup>

Proposition 25. The integrands in decompositions Eqs. (60) and (61) satisfy the following relation:

$$\boldsymbol{\Theta}\_{t}^{\boldsymbol{\pi}^{\boldsymbol{Y}}} = \frac{\mathbf{d}\left(\int\_{0}^{t} \boldsymbol{\theta}\_{u}^{\boldsymbol{F}} \mathbf{d} \langle \boldsymbol{Y} \rangle\_{u}^{\boldsymbol{F}}\right)^{p, \boldsymbol{F}^{\boldsymbol{Y}}}}{\mathbf{d} \langle \boldsymbol{Y} \rangle\_{t}^{p, \boldsymbol{F}^{\boldsymbol{Y}}}}, \quad t \in [0, T]. \tag{62}$$

Here, 〈Y〉 p,F<sup>Y</sup> denotes the <sup>ð</sup>FY, <sup>P</sup>Þ-dual-predictable projection of 〈Y〉 <sup>F</sup> and it is given by

$$\mathbb{E}\left[\int\_0^T \phi\_s \mathbf{d} \mathcal{A}\_s\right] = \mathbb{E}\left[\int\_0^T \phi\_s \mathbf{d} \mathcal{A}\_s^{p\_s \mathbb{F}^\vee}\right]$$

<sup>1</sup> We call <sup>ð</sup>FY, <sup>P</sup>Þ- dual predictable projection of a process <sup>A</sup> the <sup>F</sup><sup>Y</sup>-predictable finite variation process Ap,F<sup>Y</sup> such that for any F<sup>Y</sup>-predictable-bounded process φ we have

Recent Advances in Nonlinear Filtering with a Financial Application to Derivatives Hedging under Incomplete… http://dx.doi.org/10.5772/intechopen.70060 341

$$\langle \langle Y \rangle\_t^{p, \mathbb{P}^\circ} = \langle Y \rangle\_t^{\mathbb{P}^\circ} = \int\_0^t Y\_s^2 \sigma^2(s, Y\_{s^-}) \, \mathrm{d}s + \int\_0^t \int\_{\mathbb{R}} z^2 \pi\_{s^-}(\lambda \phi(\mathrm{d}z)) \, \mathrm{d}s, \quad t \in [0, T]. \tag{63}$$

Proof. First note that the <sup>ð</sup>FY, <sup>P</sup>Þ-dual-predictable projection of the process 〈Y〉 <sup>F</sup> coincides with the predictable quadratic variation of the process Y itself, computed with respect to its internal filtration, given in Eq. (51), since for any <sup>ð</sup>FY, <sup>P</sup>Þ-predictable-(bounded) process <sup>φ</sup>, we have that E ðT 0 φt d〈Y〉 F t � � ¼ E ðT 0 φt d〈Y〉 FY t � �. This proves Eq. (63).

Let

Now we wish to be more formal, and we introduce the following definitions:

<sup>T</sup> <sup>Þ</sup>. Since <sup>F</sup><sup>Y</sup>

<sup>t</sup> ¼ 0 for every t ∈½0, T�.

<sup>u</sup> <sup>d</sup>Yu <sup>þ</sup> <sup>G</sup><sup>F</sup>

<sup>u</sup> <sup>d</sup>Yu <sup>þ</sup> <sup>G</sup><sup>F</sup><sup>Y</sup>

FY

In the sequel, we refer to Eqs. (60) and (61) as the Galtchouk-Kunita-Watanabe decompositions under full information and under partial information, respectively. Since Y is a square integrable martingale with respect to both filtrations F and F<sup>Y</sup>, decompositions Eqs. (60) and (61) exist.

Eqs. (60) and (61) in terms of predictable projections. For any ðF, PÞ-predictable process A of

Proposition 25. The integrands in decompositions Eqs. (60) and (61) satisfy the following relation:

d〈Y〉 p,F<sup>Y</sup> t

<sup>ð</sup>F0Þ, <sup>θ</sup><sup>F</sup> <sup>∈</sup> <sup>Θ</sup>ðF<sup>Þ</sup> and G<sup>F</sup> is a square integrable <sup>ð</sup>F, <sup>P</sup>Þ-martingale, with G<sup>F</sup>

<sup>ð</sup>F<sup>Y</sup>

<sup>t</sup> ¼ 0 for every t∈ ½0, T�:

its <sup>ð</sup>FY, <sup>P</sup>Þ-dual-predictable projection.<sup>1</sup>

T⊆FT, we can define the following decompo-

<sup>ð</sup>F<sup>Y</sup>

<sup>T</sup> P � a:s:, (60)

<sup>T</sup> <sup>Þ</sup> with respect to <sup>Y</sup> and <sup>F</sup><sup>Y</sup> is given by

<sup>T</sup> P � a:s:, (61)

, t ∈½0, T�: (62)

<sup>F</sup> and it is given by

is a square integrable <sup>ð</sup>FY, <sup>P</sup><sup>Þ</sup> -martingale, With

<sup>T</sup> Þ with respect to Yand

of decompositions

such that for any

<sup>0</sup> ¼ 0,

<sup>ð</sup>F<sup>Y</sup>

Definition 24. a. The Galtchouk-Kunita-Watanabe decomposition of ξ∈ L<sup>2</sup>

<sup>ξ</sup> <sup>¼</sup> <sup>U</sup><sup>F</sup> <sup>0</sup> þ ðT 0 θF

F

<sup>ξ</sup> <sup>¼</sup> <sup>U</sup><sup>F</sup><sup>Y</sup> <sup>0</sup> þ ðT 0 θ<sup>F</sup><sup>Y</sup>

<sup>∈</sup> <sup>Θ</sup>ðF<sup>Y</sup><sup>Þ</sup> and G<sup>F</sup><sup>Y</sup>

Next proposition provides a relation between the integrands θ<sup>F</sup> and θ<sup>F</sup><sup>Y</sup>

d ðt 0 θF <sup>u</sup> d〈Y〉 F u � �p,F<sup>Y</sup>

denotes the <sup>ð</sup>FY, <sup>P</sup>Þ-dual-predictable projection of 〈Y〉

We call <sup>ð</sup>FY, <sup>P</sup>Þ- dual predictable projection of a process <sup>A</sup> the <sup>F</sup><sup>Y</sup>-predictable finite variation process Ap,F<sup>Y</sup>

¼ E ðT 0 <sup>φ</sup>sdAp,F<sup>Y</sup> s � �

E ðT 0 φsdAs � �

θ<sup>F</sup><sup>Y</sup> <sup>t</sup> ¼

b. The Galtchouk-Kunita-Watanabe decomposition of ξ∈ L<sup>2</sup>

<sup>0</sup> <sup>¼</sup> 0, strongly orthogonal to Y, that is, 〈G<sup>F</sup> ,Y〉

Consider a random variable ξ∈L<sup>2</sup>

sitions for ξ.

340 Bayesian Inference

F is given by

where U<sup>F</sup>

where U<sup>F</sup><sup>Y</sup>

Here, 〈Y〉

1

p,F<sup>Y</sup>

G<sup>F</sup><sup>Y</sup>

<sup>0</sup> ∈ L<sup>2</sup>

<sup>0</sup> ∈L<sup>2</sup>

<sup>ð</sup>F<sup>Y</sup> <sup>0</sup> <sup>Þ</sup>, <sup>θ</sup><sup>F</sup><sup>Y</sup>

finite variation, we denote by Ap,F<sup>Y</sup>

F<sup>Y</sup>-predictable-bounded process φ we have

orthogonal to Y, that is, 〈G<sup>F</sup> ,Y〉

$$\Theta\_t := \frac{\mathbf{d}\left(\int\_0^t \boldsymbol{\Theta}\_u^{\mathbb{F}} \mathbf{d} \langle Y \rangle\_u^{\mathbb{F}}\right)^{p, \mathbb{F}^\mathbb{F}}}{\mathbf{d} \langle Y \rangle\_t^{p, \mathbb{F}^\mathbb{F}}}, \quad t \in [0, T]. \tag{64}$$

By the Galtchouk-Kunita-Watanabe decomposition Eq. (60), we can write

$$\xi = \mathcal{U}\_0^{\mathcal{F}} + \int\_0^T \theta\_\mathbf{u} \mathbf{d}Y\_u + \mathcal{G}\_T^{\mathcal{F}} + \tilde{G}\_T \quad \mathbf{P}-\text{ a.s.},\tag{65}$$

where Ge<sup>t</sup> :¼ ðt 0 ðθF <sup>u</sup> � <sup>θ</sup>uÞdYu, for every <sup>t</sup><sup>∈</sup> <sup>½</sup>0, T�. We observe that for every <sup>F</sup><sup>Y</sup>-predictable process φ the following holds:

$$\begin{split} \mathbb{E}\left[\int\_{0}^{T} \phi\_{u} \boldsymbol{\theta}\_{u} \mathbf{d} \{\boldsymbol{\mathcal{Y}}\}\_{u}^{\overline{\mathbf{y}}} \right] &= \mathbb{E}\left[\int\_{0}^{T} \phi\_{u} \boldsymbol{\theta}\_{u} \mathbf{d} \{\boldsymbol{\mathcal{Y}}\}\_{u}^{\overline{\mathbf{y}}^{\overline{\mathbf{y}}}} \right] \\ = \mathbb{E}\left[\int\_{0}^{T} \phi\_{u} (\boldsymbol{\Theta}\_{u}^{\overline{\mathbf{y}}} \mathbf{d} \{\boldsymbol{\mathcal{Y}}\}\_{u}^{\overline{\mathbf{y}}, \overline{\mathbf{y}}^{\overline{\mathbf{y}}}} \right] &= \mathbb{E}\left[\int\_{0}^{T} \phi\_{u} \boldsymbol{\Theta}\_{u}^{\overline{\mathbf{y}}} \mathbf{d} \{\boldsymbol{\mathcal{Y}}\}\_{u}^{\overline{\mathbf{y}}} \right]. \end{split} \tag{66}$$

By choosing φ = θ and applying the Cauchy-Schwarz inequality, we obtain

$$\mathbb{E}\left[\int\_{0}^{T}(\boldsymbol{\Theta}\_{u})^{2}\mathbf{d}\langle\boldsymbol{Y}\rangle\_{u}^{\overline{\boldsymbol{Y}}^{\overline{\boldsymbol{Y}}}}\right] \leq \mathbb{E}\left[\int\_{0}^{T}(\boldsymbol{\Theta}\_{u}^{\overline{\boldsymbol{Y}}})^{2}\mathbf{d}\langle\boldsymbol{Y}\rangle\_{u}^{\overline{\boldsymbol{Y}}}\right] < \ast . \tag{67}$$

This implies that <sup>θ</sup><sup>∈</sup> <sup>Θ</sup>ðF<sup>Y</sup><sup>Þ</sup> <sup>⊆</sup> <sup>Θ</sup>ðF<sup>Þ</sup> and that <sup>G</sup><sup>e</sup> is an <sup>ð</sup>F, <sup>P</sup>Þ-martingale. Taking the conditional expectation with respect to F<sup>Y</sup> <sup>T</sup> in Eq. (65) leads to

$$\boldsymbol{\xi} = \mathbb{E}\left[\boldsymbol{\mathcal{U}}\_{0}^{\mathcal{F}} \middle| \boldsymbol{\mathcal{F}}\_{T}^{\mathcal{Y}}\right] + \int\_{0}^{T} \boldsymbol{\theta}\_{u} \mathrm{d}\boldsymbol{Y}\_{u} + \boldsymbol{\mathcal{G}}\_{T}^{\mathcal{F}} + \boldsymbol{\widetilde{G}}\_{T} = \mathbb{E}\left[\boldsymbol{\mathcal{U}}\_{0}^{\mathcal{F}} \middle| \boldsymbol{\mathcal{F}}\_{0}^{\mathcal{Y}}\right] + \int\_{0}^{T} \boldsymbol{\theta}\_{u} \mathrm{d}\boldsymbol{Y}\_{u} + \widehat{\boldsymbol{G}}\_{T}^{\mathcal{F}^{\mathcal{Y}}} \quad \mathbf{P}-\text{a.s.} \tag{68}$$

where

$$\widehat{\mathbf{G}}\_{t}^{\mathcal{F}^{\mathcal{Y}}} := \mathbb{E}\left[\mathbf{U}\_{0}^{\mathcal{F}}|\mathcal{F}\_{t}^{\mathcal{Y}}\right] - \mathbb{E}\left[\mathbf{U}\_{0}^{\mathcal{F}}|\mathcal{F}\_{0}^{\mathcal{Y}}\right] + \mathbb{E}\left[\mathbf{G}\_{T}^{\mathcal{F}}|\mathcal{F}\_{t}^{\mathcal{Y}}\right] + \mathbb{E}\left[\widetilde{\mathbf{G}}\_{T}|\mathcal{F}\_{t}^{\mathcal{Y}}\right], \quad t \in [0, T]. \tag{69}$$

which provides the Galtchouk-Kunita-Watanabe decomposition Eq. (61) if we can show that the <sup>ð</sup>FY, <sup>P</sup>Þ-martingale <sup>G</sup><sup>b</sup> <sup>F</sup><sup>Y</sup> is strongly orthogonal to Y, that is, if for any <sup>ð</sup>FY, <sup>P</sup>Þ-predictable- (bounded) process φ the following holds:

$$\mathbb{E}\left[\widehat{\mathbf{G}}\_T^{\mathcal{F}^\mathcal{Y}}\int\_0^T \phi\_u \mathbf{d}\mathcal{Y}\_u\right] = \mathbf{0}.\tag{70}$$

Note that orthogonality of the term E U<sup>F</sup> <sup>0</sup> <sup>j</sup>F<sup>Y</sup> t � � � <sup>E</sup> <sup>U</sup><sup>F</sup> <sup>0</sup> <sup>j</sup>F<sup>Y</sup> 0 � � <sup>þ</sup> <sup>E</sup> <sup>G</sup><sup>F</sup> <sup>T</sup> <sup>j</sup>F<sup>Y</sup> t � � follows by the orthogonality of G<sup>F</sup> and Y. Moreover, we have

$$\mathbb{E}\left[\mathbb{E}\left[\widetilde{\mathbf{G}}\_{T}|\mathcal{F}\_{T}^{\mathbf{Y}}\right]\right]\_{0}^{T}\phi\_{u}\mathbf{d}Y\_{u}\right] = \mathbb{E}\left[\widetilde{\mathbf{G}}\_{T}\right]\_{0}^{T}\phi\_{u}\mathbf{d}Y\_{u}\Big] = \mathbb{E}\left[\int\_{0}^{T}\phi\_{u}(\boldsymbol{\Theta}\_{u}^{\mathbf{F}}-\boldsymbol{\Theta}\_{u})\mathbf{d}\{Y\}\_{u}^{\mathbf{F}}\right],\tag{71}$$

and by Eq. (64)

$$\begin{split} \mathbb{E}\left[\int\_{0}^{T} \phi\_{u} \theta\_{u} \mathbf{d} \{\mathbf{Y}\}\_{u}^{\overline{\mathbf{y}}} \right] &= \mathbb{E}\left[\int\_{0}^{T} \phi\_{u} \theta\_{u} \mathbf{d} \{\mathbf{Y}\}\_{u}^{\overline{\mathbf{y}}^{\overline{\mathbf{y}}}} \right] \\ &= \mathbb{E}\left[\int\_{0}^{T} \phi\_{u} \mathbf{d} (\int\_{0}^{u} \theta\_{r}^{\overline{r}} \mathbf{d} \{\mathbf{Y}\}\_{r})^{p, \overline{\mathbf{y}}^{\overline{\mathbf{y}}}} \right] = \mathbb{E}\left[\int\_{0}^{T} \phi\_{u} \theta\_{u}^{\overline{r}} \mathbf{d} \{\mathbf{Y}\}\_{u}^{\overline{\mathbf{y}}} \right], \end{split} \tag{72}$$

which proves strong orthogonality.

Theorem 26 shows the relation between the Galtchouk-Kunita-Watanabe decompositions and the optimal strategies under full and partial information.

Theorem 26. i. Every contingent claim ξ∈ L<sup>2</sup> <sup>ð</sup>F<sup>Y</sup> <sup>T</sup> , PÞ admits a unique F-risk-minimizing strategy <sup>ψ</sup>�,<sup>F</sup> ¼ ðθ�,<sup>F</sup> , <sup>η</sup>�,<sup>F</sup> <sup>Þ</sup>, explicitly given by

$$
\theta^{\*,\mathcal{F}} = \theta^{\mathcal{F}}, \quad \eta^{\*,\mathcal{F}} = V(\psi^{\*,\mathcal{F}}) - \theta^{\*,\mathcal{F}}Y,\tag{73}
$$

where Vtðψ�,<sup>F</sup> Þ ¼ <sup>E</sup> <sup>ξ</sup>jF<sup>t</sup> ½ � for every t <sup>∈</sup>½0, T�, with minimal cost

$$\mathbb{C}\_t(\psi^{\*,\mathcal{F}}) = \mathcal{U}\_0^{\mathcal{F}} + \mathcal{G}\_t^{\mathcal{F}}, \quad t \in [0, T]. \tag{74}$$

Here, θ<sup>F</sup> , U<sup>F</sup> <sup>0</sup> , and G<sup>F</sup> are given in Definition 24 part a.

ii. Moreover, it also admits a unique <sup>F</sup>Y-risk-minimizing strategy <sup>ψ</sup>�,<sup>F</sup> ¼ ðθ�,<sup>F</sup> , <sup>η</sup>�,F<sup>Y</sup> Þ, explicitly given by

$$\boldsymbol{\Theta}^{\*,\mathcal{F}^{\boldsymbol{\chi}}} = \boldsymbol{\Theta}^{\mathcal{F}^{\boldsymbol{\chi}}}, \quad \boldsymbol{\eta}^{\*,\mathcal{F}^{\boldsymbol{\chi}}} = V(\boldsymbol{\psi}^{\*,\mathcal{F}^{\boldsymbol{\chi}}}) - \boldsymbol{\Theta}^{\*,\mathcal{F}^{\boldsymbol{\chi}}} \boldsymbol{Y}, \tag{75}$$

where Vtðψ�,F<sup>Y</sup> Þ ¼ <sup>E</sup> <sup>ξ</sup>jF<sup>Y</sup> t � � for every t∈½0, T�, with minimal cost

$$\mathbf{C}\_t(\boldsymbol{\psi}^{\*,\mathcal{F}^\circ}) = \mathbf{U}\_0^{\mathcal{F}^\circ} + \mathbf{G}\_t^{\mathcal{F}^\circ}, \quad t \in [0, T]. \tag{76}$$

and θ<sup>F</sup><sup>Y</sup> , U<sup>F</sup><sup>Y</sup> <sup>0</sup> and G<sup>F</sup><sup>Y</sup> are given in Definition 24 part b.

which provides the Galtchouk-Kunita-Watanabe decomposition Eq. (61) if we can show that

<sup>0</sup> <sup>j</sup>F<sup>Y</sup> t � � � <sup>E</sup> <sup>U</sup><sup>F</sup>

ðT 0 φudYu � �

> ðT 0

φuθud〈Y〉

p,F<sup>Y</sup>

Theorem 26 shows the relation between the Galtchouk-Kunita-Watanabe decompositions and

<sup>ð</sup>F<sup>Y</sup>

<sup>0</sup> <sup>þ</sup> <sup>G</sup><sup>F</sup>

<sup>¼</sup> <sup>V</sup>ðψ�,F<sup>Y</sup>

� �

<sup>E</sup> <sup>G</sup><sup>b</sup> <sup>F</sup><sup>Y</sup> T ðT 0 φudYu � �

¼ E Ge <sup>T</sup>

¼ E

� �

is strongly orthogonal to Y, that is, if for any <sup>ð</sup>FY, <sup>P</sup>Þ-predictable-

<sup>0</sup> <sup>j</sup>F<sup>Y</sup> 0 � � <sup>þ</sup> <sup>E</sup> <sup>G</sup><sup>F</sup>

> ðT 0 <sup>φ</sup>uðθ<sup>F</sup>

¼ E

FY u

> ðT 0 <sup>φ</sup>uθ<sup>F</sup> <sup>u</sup> d〈Y〉 F u

<sup>θ</sup>�,<sup>F</sup> <sup>¼</sup> <sup>θ</sup><sup>F</sup> , <sup>η</sup>�,<sup>F</sup> <sup>¼</sup> <sup>V</sup>ðψ�,<sup>F</sup> Þ � <sup>θ</sup>�,<sup>F</sup> Y, (73)

Þ � <sup>θ</sup>�,F<sup>Y</sup>

� �

¼ E

¼ 0: (70)

<sup>T</sup> <sup>j</sup>F<sup>Y</sup> t

<sup>u</sup> � θuÞd〈Y〉

,

<sup>T</sup> , PÞ admits a unique F-risk-minimizing strategy

<sup>t</sup> , t ∈½0, T�: (74)

<sup>t</sup> , t∈½0, T�, (76)

Þ, explicitly given by

Y, (75)

� �

� � follows by the

F u , (71)

(72)

the <sup>ð</sup>FY, <sup>P</sup>Þ-martingale <sup>G</sup><sup>b</sup> <sup>F</sup><sup>Y</sup>

342 Bayesian Inference

(bounded) process φ the following holds:

Note that orthogonality of the term E U<sup>F</sup>

orthogonality of G<sup>F</sup> and Y. Moreover, we have

T h ið<sup>T</sup>

> E ðT 0

Theorem 26. i. Every contingent claim ξ∈ L<sup>2</sup>

� �

0 φudYu

¼ E

φuθud〈Y〉

� �

ðT 0 φudð ðu 0 θF <sup>r</sup> d〈Y〉rÞ

the optimal strategies under full and partial information.

where Vtðψ�,<sup>F</sup> Þ ¼ <sup>E</sup> <sup>ξ</sup>jF<sup>t</sup> ½ � for every t <sup>∈</sup>½0, T�, with minimal cost

<sup>0</sup> , and G<sup>F</sup> are given in Definition 24 part a.

θ�,F<sup>Y</sup>

t

Þ ¼ <sup>E</sup> <sup>ξ</sup>jF<sup>Y</sup>

Ctðψ�,<sup>F</sup> Þ ¼ <sup>U</sup><sup>F</sup>

ii. Moreover, it also admits a unique <sup>F</sup>Y-risk-minimizing strategy <sup>ψ</sup>�,<sup>F</sup> ¼ ðθ�,<sup>F</sup> , <sup>η</sup>�,F<sup>Y</sup>

, η�,F<sup>Y</sup>

� � for every t∈½0, T�, with minimal cost

Þ ¼ <sup>U</sup><sup>F</sup><sup>Y</sup>

<sup>0</sup> <sup>þ</sup> <sup>G</sup><sup>F</sup><sup>Y</sup>

<sup>¼</sup> <sup>θ</sup><sup>F</sup><sup>Y</sup>

Ctðψ�,F<sup>Y</sup>

F u

E E <sup>G</sup><sup>e</sup> <sup>T</sup>jF<sup>Y</sup>

which proves strong orthogonality.

<sup>ψ</sup>�,<sup>F</sup> ¼ ðθ�,<sup>F</sup> , <sup>η</sup>�,<sup>F</sup> <sup>Þ</sup>, explicitly given by

and by Eq. (64)

Here, θ<sup>F</sup> , U<sup>F</sup>

where Vtðψ�,F<sup>Y</sup>

Proof. The proof of part i. is given, for example, in Ref. [28, Theorem 2.4]. For part ii., note that using the martingale representation of Y with respect to its inner filtration given in Eq. (48) and the fact that ξ∈ L<sup>2</sup> <sup>ð</sup>F<sup>Y</sup> <sup>T</sup> Þ, it is possible to reduce the partial information case to full information and apply again [28, Theorem 2.4]. □

Proposition 25 helps us in the computation of the optimal strategy under partial information. Indeed, it is sufficient to compute the corresponding strategy θ�,<sup>F</sup> under full information and the Radon-Nikodym derivative given in Eq. (62). To get more explicit representations, we assume that the payoff of the contingent claim has the form ξ ¼ HðT,YTÞ, for some function <sup>H</sup> : <sup>½</sup>0, T� � <sup>R</sup><sup>þ</sup> ! <sup>R</sup>. Let <sup>L</sup>X,Y denote the Markov generator of the pair (X, Y), that is

$$\begin{split} \mathcal{L}^{\chi,\Upsilon} f(t, \mathbf{x}, \mathbf{y}) &= \frac{\partial f}{\partial t} + b\_0(t, \mathbf{x}) \frac{\partial f}{\partial \mathbf{x}} + b\_1(t, \mathbf{x}, \mathbf{y}) \frac{\partial f}{\partial \mathbf{y}} + \frac{1}{2} \sigma\_0^2(t, \mathbf{x}) \frac{\partial^2 f}{\partial \mathbf{x}^2} + \rho y \sigma\_0(t, \mathbf{x}) \sigma(t, \mathbf{y}) \frac{\partial^2 f}{\partial \mathbf{x} \partial \mathbf{y}} \\ &+ \frac{1}{2} y^2 \sigma^2(t, \mathbf{y}) \frac{\partial^2 f}{\partial \mathbf{y}^2} + \int\_Z \Delta f(t, \mathbf{x}, \mathbf{y}; \mathbb{L}) \nu(\mathbf{d}\mathbb{L}) \end{split} \tag{77}$$

for every f ∈C<sup>1</sup>;2;<sup>2</sup> <sup>b</sup> ð½0, T� � R � RþÞ, where

$$
\Delta f(t, \mathbf{x}, \mathbf{y}; \boldsymbol{\zeta}) := f(t, \mathbf{x} + K\_0(t, \mathbf{x}; \boldsymbol{\zeta}), \mathbf{y}(1 + K(t, \mathbf{x}, \mathbf{y}; \boldsymbol{\zeta}))) - f(t, \mathbf{x}, \mathbf{y}).\tag{78}
$$

By the Markov property, we have that for any t∈½0, T� there exists a measurable function hðt, x, yÞ such that

$$h(t, X\_t, \mathcal{Y}\_t) = \mathbb{E}[H(T, \mathcal{Y}\_T) | \mathcal{F}\_t]. \tag{79}$$

If the function h is sufficiently regular, for instance h∈ C<sup>1</sup>;2;<sup>2</sup> <sup>b</sup> ð½0, T� � R � RþÞ, we can apply Itô's formula and get that

$$h(\mathbf{t}, \mathbf{X}\_{\mathbf{t}}, \mathbf{Y}\_{\mathbf{t}}) = h(\mathbf{0}, \mathbf{X}\_{\mathbf{0}}, \mathbf{Y}\_{\mathbf{0}}) + \int\_{0}^{t} \mathcal{L}^{\mathbf{X}, \mathbf{Y}} h(\mathbf{s}, \mathbf{X}\_{\mathbf{s}}, \mathbf{Y}\_{\mathbf{s}}) d\mathbf{s} + \mathcal{M}\_{\mathbf{t}}^{\mathbf{h}} \tag{80}$$

where Mh is the <sup>ð</sup>F, <sup>P</sup>Þ-martingale given by

$$\begin{split} \mathbf{d}M\_{t}^{h} &= \int\_{0}^{t} \frac{\partial h}{\partial \mathbf{x}}(\mathbf{s}, \mathbf{X}\_{s}, \mathbf{Y}\_{s}) \sigma\_{0}(\mathbf{s}, \mathbf{X}\_{s}) \mathbf{d}\mathcal{W}\_{s}^{0} + \int\_{0}^{t} \frac{\partial h}{\partial \mathbf{y}}(\mathbf{s}, \mathbf{X}\_{s}, \mathbf{Y}\_{s}) \mathbf{Y}\_{s} \sigma(\mathbf{s}, \mathbf{Y}\_{s}) \mathbf{d}\mathcal{W}\_{s}^{1} \\ &+ \int\_{0}^{t} \int\_{Z} \Delta h(\mathbf{s}, \mathbf{X}\_{s-\prime}, \mathbf{Y}\_{s-\prime}; \zeta) \Big(\mathcal{N}(\mathbf{ds}, \mathbf{d}\zeta) - \nu(\mathbf{d}\zeta)\mathbf{ds} \Big) . \end{split} \tag{81}$$

By Eq. (79), the process {hðt, Xt, YtÞ, t ∈½0, T�} is an ðF, PÞ-martingale. Then, the finite variation term vanishes, which means that the function <sup>h</sup> satisfies <sup>L</sup>X,Yhðt, Xt, YtÞ ¼ 0, <sup>P</sup>-a.s. and for almost every t∈ ½0, T�. The next proposition provides the risk-minimizing strategy under partial information.

Proposition 27. Assume h ∈C<sup>1</sup>;2;<sup>2</sup> <sup>b</sup> ð½0, T� � <sup>R</sup> � <sup>R</sup>þÞ. Then the first components <sup>θ</sup>�,<sup>F</sup> and <sup>θ</sup>�,F<sup>Y</sup> of the risk-minimizing strategies under full and partial information are given by

$$\theta\_{t}^{\*,\mathcal{F}} = \frac{g(t, \mathcal{X}\_{t-\prime}, \mathcal{Y}\_{t-\prime})}{Y\_{t-\prime}^{2} \sigma^{2}(t, \mathcal{Y}\_{t-\prime}) + \int\_{\mathbb{R}} z^{2} \lambda(t, \mathcal{X}\_{t-\prime}, \mathcal{Y}\_{t-\prime}) \phi(t, \mathcal{X}\_{t-\prime}, \mathcal{Y}\_{t-\prime}, \text{d}z)}, \tag{82}$$

$$\Theta\_t^{\*, \mathcal{F}^\vee} = \frac{\pi\_{t-}(\mathbf{g})}{Y\_{t-}^2 \sigma(t, Y\_{t-}) + \int\_{\mathbb{R}} z^2 \pi\_{t-}(\lambda \phi(\mathbf{dz}))}, \quad t \in [0, T] \tag{83}$$

respectively, where the function g(t, x, y) is

$$g(t, \mathbf{x}, \mathbf{y}) = \rho \,\sigma\_0(t, \mathbf{x}) y \sigma(t, \mathbf{y}) \frac{\partial \mathbf{h}}{\partial \mathbf{x}} + \mathbf{y}^2 \sigma^2(t, \mathbf{y}) \frac{\partial \mathbf{h}}{\partial \mathbf{y}} + \int\_Z \mathbf{y} \mathbf{K}(t, \mathbf{x}, \mathbf{y}; \mathsf{L}) \Delta h(t, \mathbf{x}, \mathbf{y}; \mathsf{L}) \nu(d\mathsf{L}).\tag{84}$$

Proof. Consider decomposition Eq. (60) for ξ ¼ HðT,YTÞ. Then, conditioning on F<sup>t</sup> we get

$$h(t, \mathbf{X}\_t, \mathbf{Y}\_t) = \mathcal{U}\_0 + \int\_0^t \boldsymbol{\theta}\_s^{\*, \mathcal{F}} \mathbf{d} \mathbf{Y}\_s + \mathbf{G}\_t^{\mathcal{F}}.\tag{85}$$

Taking the covariation with respect to Y and F, we obtain

$$\langle \langle h(\cdot, X, Y), Y \rangle\_t^{\mathbb{F}} = \int\_0^t \theta\_s^{\*, \mathcal{F}} \mathbf{d} \langle Y \rangle\_s^{\mathbb{F}}.\tag{86}$$

On the other hand, <sup>h</sup>ðt, Xt, YtÞ ¼ Mh <sup>t</sup> , then taking Eqs. (81) and (44) into account we get that

$$\langle \langle h(\cdot, X, Y), Y \rangle\_t^{\overline{F}} = \int\_0^t \mathbf{g}(\text{s}, X\_{\text{s}}, Y\_s) \text{ds},\tag{87}$$

where g(t, x, y) is given in Eq. (84). Hence, by Eqs. (50) and (87), we may represent θ�,<sup>F</sup> as

$$\boldsymbol{\Theta}\_{t}^{\*,\mathcal{F}} = \frac{\mathbf{d}\langle\boldsymbol{h}(\cdot,\boldsymbol{X},\boldsymbol{Y}),\boldsymbol{Y}\rangle\_{t}^{\overline{\mathbb{F}}}}{\mathbf{d}\langle\boldsymbol{Y}\rangle\_{t}^{\overline{\mathbb{F}}}} = \frac{\operatorname{g}(t,\boldsymbol{X}\_{t-\cdot},\boldsymbol{Y}\_{t-\cdot})}{\mathbf{Y}\_{\cdot-}^{2}\sigma^{2}(t,\boldsymbol{Y}\_{t-\cdot}) + \int\_{\mathbb{R}}\boldsymbol{z}^{2}\lambda(t,\boldsymbol{X}\_{t-\cdot},\boldsymbol{Y}\_{t-\cdot})\phi(t,\boldsymbol{X}\_{t-\cdot},\boldsymbol{Y}\_{t-\cdot},\mathbf{dz})} \tag{88}$$

Note that by Eq. (51) and

$$\left(\int\_{0}^{t} \boldsymbol{\theta}\_{u}^{\*,\mathcal{F}} \, \mathrm{d}\langle Y \rangle\_{u}^{\mathrm{F}} \right)^{p,\stackrel{\mathrm{p},\stackrel{\mathrm{p}}{\mathrm{r}}}{}} = \left(\int\_{0}^{t} \mathrm{g}(s,X\_{s},Y\_{s}) \mathrm{d}s\right)^{p,\stackrel{\mathrm{p},\stackrel{\mathrm{p}}{\mathrm{r}}}{}} = \int\_{0}^{t} \pi\_{s}(\mathbf{g}) \mathrm{d}s,\tag{89}$$

applying Eq. (62) we get representation Eq. (83).

Our ultimate objective in this section is to investigate on the relation between costs of the F-optimal strategy and the F<sup>Y</sup>-optimal strategy, or equivalently the associated risk processes.

It clearly holds that θ�,F<sup>Y</sup> <sup>∈</sup> <sup>Θ</sup>ðFÞ, and then the <sup>F</sup><sup>Y</sup>-risk-minimizing strategy is also an <sup>F</sup>-strategy. Considering the corresponding risks, we have

$$\begin{split} \mathbb{E}\left[\left(\mathsf{C}\_{\mathcal{T}}(\boldsymbol{\psi}^{\ast,\mathcal{F}})-\mathsf{C}\_{t}(\boldsymbol{\psi}^{\ast,\mathcal{F}})\right)^{2}|\mathcal{F}\_{t}^{\boldsymbol{Y}}\right] &= \mathbb{E}\left[\mathbb{E}\left[\left(\mathsf{C}\_{\mathcal{T}}(\boldsymbol{\psi}^{\ast,\mathcal{F}})-\mathsf{C}\_{t}(\boldsymbol{\psi}^{\ast,\mathcal{F}})\right)^{2}|\mathcal{F}\_{t}\right]|\mathcal{F}\_{t}^{\boldsymbol{Y}}\right] \\ &\geq \mathbb{E}\left[\mathbb{E}\left[\left(\mathsf{C}\_{\mathcal{T}}(\boldsymbol{\psi}^{\ast,\mathcal{F}})-\mathsf{C}\_{t}(\boldsymbol{\psi}^{\ast,\mathcal{F}})\right)^{2}|\mathcal{F}\_{t}\right]|\mathcal{F}\_{t}^{\boldsymbol{Y}}\right] = \mathbb{E}\left[\left(\mathsf{C}\_{\mathcal{T}}(\boldsymbol{\psi}^{\ast,\mathcal{F}})-\mathsf{C}\_{t}(\boldsymbol{\psi}^{\ast,\mathcal{F}})\right)^{2}|\mathcal{F}\_{t}^{\boldsymbol{Y}}\right], \end{split} \tag{90}$$

and then E R<sup>F</sup> <sup>t</sup> <sup>ð</sup>ψ�,<sup>F</sup> <sup>Þ</sup> � � <sup>≤</sup><sup>E</sup> <sup>R</sup><sup>F</sup><sup>Y</sup> <sup>t</sup> <sup>ð</sup>ψ�,F<sup>Y</sup> Þ h i, for every <sup>t</sup><sup>∈</sup> <sup>½</sup>0, T�. In the remaining part of the paper, we assume that F<sup>Y</sup> <sup>0</sup> ¼ F<sup>0</sup> ¼ {Ω, Ø}, and we wish to measure the difference in the total risk taken by an informed investor, endowed with a filtration F, and a partially informed investor, whose information is described by F<sup>Y</sup>. Precisely, we compute the difference R<sup>F</sup><sup>Y</sup> <sup>0</sup> <sup>ð</sup>ψ�,F<sup>Y</sup> Þ �R<sup>F</sup> <sup>0</sup> <sup>ð</sup>ψ�,<sup>F</sup> <sup>Þ</sup>. By decompositions Eqs. (60) and (61), we have that CTðψ�,<sup>F</sup> Þ � <sup>C</sup>0ðψ�,<sup>F</sup> Þ ¼ <sup>G</sup><sup>F</sup> T and CTðψ�,F<sup>Y</sup> Þ � <sup>C</sup>0ðψ�,F<sup>Y</sup> Þ ¼ <sup>G</sup><sup>F</sup><sup>Y</sup> <sup>T</sup> and also

$$\mathbf{G}\_{T}^{\mathcal{F}^{\mathcal{Y}}} = \mathbf{U}\_{0}^{\mathcal{F}} - \mathbf{U}\_{0}^{\mathcal{F}^{\mathcal{Y}}} + \int\_{0}^{T} (\boldsymbol{\Theta}\_{r}^{\*,\mathcal{F}} - \boldsymbol{\Theta}\_{r}^{\*,\mathcal{F}^{\mathcal{Y}}}) \mathbf{d}Y\_{r} + \mathbf{G}\_{T}^{\mathcal{F}} \tag{91}$$

since F<sup>Y</sup> <sup>0</sup> <sup>¼</sup> <sup>F</sup><sup>0</sup> <sup>¼</sup> {Ω, Ø}, <sup>U</sup><sup>F</sup> <sup>0</sup> <sup>¼</sup> <sup>U</sup><sup>F</sup><sup>Y</sup> <sup>0</sup> . Then computing the square of <sup>G</sup><sup>F</sup><sup>Y</sup> <sup>T</sup> and taking the expectation we get

$$\mathbb{E}\left[\left(\mathbf{G}\_{T}^{\mathcal{F}^{\circ}}\right)^{2}\right] = \mathbb{E}\left[\left(\mathbf{G}\_{T}^{\mathcal{F}}\right)^{2}\right] + \mathbb{E}\left[\left(\int\_{0}^{T} \left(\boldsymbol{\Theta}\_{r}^{\ast,\mathcal{F}} - \boldsymbol{\Theta}\_{r}^{\ast,\mathcal{F}^{\circ}}\right) \mathbf{d}Y\_{r}\right)^{2}\right] + 2\mathbb{E}\left[\mathbf{G}\_{T}^{\mathcal{F}}\int\_{0}^{T} \left(\boldsymbol{\Theta}\_{r}^{\ast,\mathcal{F}} - \boldsymbol{\Theta}\_{r}^{\ast,\mathcal{F}^{\circ}}\right) \mathbf{d}Y\_{r}\right]. \tag{92}$$

It follows from Itô isometry and the fact that G<sup>F</sup> is orthogonal to Y, that

$$\mathbb{E}\left[\left(\mathbf{G}\_{T}^{\mathcal{F}^{\mathcal{Y}}}\right)^{2}\right] = \mathbb{E}\left[\left(\mathbf{G}\_{T}^{\mathcal{F}}\right)^{2}\right] + \mathbb{E}\left[\int\_{0}^{T} \left(\boldsymbol{\Theta}\_{r}^{\ast,\mathcal{F}} - \boldsymbol{\Theta}\_{r}^{\ast,\mathcal{F}^{\mathcal{Y}}}\right)^{2} \langle \boldsymbol{Y} \rangle\_{r}^{\mathbb{F}} \right]. \tag{93}$$

Then the difference that we want to evaluate becomes

$$\begin{split} \mathbb{E}\left(\boldsymbol{\upmathbb{P}}^{\mathcal{F}}(\boldsymbol{\upmathbb{y}}^{\circ,\mathcal{F}}) - \mathbb{R}\_{0}^{\mathcal{F}}(\boldsymbol{\upmathbb{y}}^{\circ,\mathcal{F}})\right) &= \mathbb{E}\left[\left(\mathbf{G}\_{\boldsymbol{\upmathbb{y}}}^{\mathcal{F}}\right)^{2}\right] - \mathbb{E}\left[\left(\mathbf{G}\_{\boldsymbol{\upmathbb{y}}}^{\mathcal{F}}\right)^{2}\right] = \mathbb{E}\left[\int\_{0}^{T} (\boldsymbol{\varmathbb{B}}^{\circ,\mathcal{F}}\_{r} - \boldsymbol{\upmathbb{B}}^{\circ,\mathcal{F}}\_{r})^{2} \mathbf{d} \langle \mathbf{Y} \rangle\_{r}^{\mathbb{F}} \right] \\ &= \mathbb{E}\left[\int\_{0}^{T} (\boldsymbol{\varmathbb{B}}^{\circ,\mathcal{F}}\_{r})^{2} \mathbf{d} \langle \mathbf{Y} \rangle\_{r}^{\mathbb{F}} \right] + \mathbb{E}\left[\int\_{0}^{T} (\boldsymbol{\upmathbb{B}}^{\circ,\mathcal{F}}\_{r})^{2} \mathbf{d} \langle \mathbf{Y} \rangle\_{r}^{\mathbb{F}} \right] - 2\mathbb{E}\left[\int\_{0}^{T} \boldsymbol{\upmathbb{B}}^{\circ,\mathcal{F}}\_{r} \boldsymbol{\upmathbb{B}}^{\circ,\mathcal{F}}\_{r} \mathbf{d} \langle \mathbf{Y} \rangle\_{r}^{\mathbb{F}} \right]. \end{split} \tag{94}$$

Using Eq. (62) and the definition of F<sup>Y</sup>-dual-predictable projections, we have that

$$\mathbb{E}\left[\int\_{0}^{t} \boldsymbol{\theta}\_{r}^{\*,\mathcal{F}^{\mathcal{Y}}} \boldsymbol{\theta}\_{r}^{\*,\mathcal{F}} \mathbf{d} \langle \boldsymbol{Y} \rangle\_{r}^{\mathbb{F}} \right] = \mathbb{E}\left[\int\_{0}^{t} (\boldsymbol{\theta}\_{r}^{\*,\mathcal{F}^{\mathcal{Y}}})^{2} \mathbf{d} \langle \boldsymbol{Y} \rangle\_{r}^{\mathbb{F}^{\mathcal{Y}}} \right] = \mathbb{E}\left[\int\_{0}^{t} (\boldsymbol{\theta}\_{r}^{\*,\mathcal{F}^{\mathcal{Y}}})^{2} \mathbf{d} \langle \boldsymbol{Y} \rangle\_{r}^{\mathbb{F}} \right],\tag{95}$$

which implies

Proposition 27. Assume h ∈C<sup>1</sup>;2;<sup>2</sup>

344 Bayesian Inference

θ�,<sup>F</sup>

Y2

respectively, where the function g(t, x, y) is

On the other hand, <sup>h</sup>ðt, Xt, YtÞ ¼ Mh

<sup>t</sup> <sup>¼</sup> <sup>d</sup>〈hð�,X,YÞ,Y〉

ðt 0 θ�,<sup>F</sup> <sup>u</sup> d〈Y〉 F u

applying Eq. (62) we get representation Eq. (83).

d〈Y〉 F t

� �p,F<sup>Y</sup>

θ�,<sup>F</sup>

Note that by Eq. (51) and

gðt, x, yÞ ¼ ρ σ0ðt, xÞyσðt, yÞ

θ�,F<sup>Y</sup>

risk-minimizing strategies under full and partial information are given by

<sup>t</sup> <sup>¼</sup> <sup>g</sup>ðt, Xt�, Yt�Þ

ð R

<sup>t</sup> <sup>¼</sup> <sup>π</sup>t�ðg<sup>Þ</sup>

<sup>t</sup>�σðt, Yt�Þ þ

∂h ∂x <sup>þ</sup> <sup>y</sup><sup>2</sup> σ2 ðt, yÞ ∂h ∂y þ ð Z

hðt, Xt, YtÞ ¼ U<sup>0</sup> þ

〈hð�,X,YÞ,Y〉

〈hð�,X,YÞ,Y〉

Y2

¼ �ð<sup>t</sup> 0

F t

ð R

Proof. Consider decomposition Eq. (60) for ξ ¼ HðT,YTÞ. Then, conditioning on F<sup>t</sup> we get

F <sup>t</sup> ¼ ðt 0 θ�,<sup>F</sup> <sup>s</sup> d〈Y〉 F

F <sup>t</sup> ¼ ðt 0

where g(t, x, y) is given in Eq. (84). Hence, by Eqs. (50) and (87), we may represent θ�,<sup>F</sup> as

<sup>t</sup>� <sup>σ</sup><sup>2</sup>ðt, Yt�Þ þ

<sup>¼</sup> <sup>g</sup>ðt, Xt�, Yt�Þ

ð R

gðs,Xs,YsÞds

Our ultimate objective in this section is to investigate on the relation between costs of the F-optimal strategy and the F<sup>Y</sup>-optimal strategy, or equivalently the associated risk processes.

�p,F<sup>Y</sup> ¼ ðt 0

ðt 0 θ�,<sup>F</sup>

<sup>s</sup> <sup>d</sup>Ys <sup>þ</sup> <sup>G</sup><sup>F</sup>

<sup>t</sup> , then taking Eqs. (81) and (44) into account we get that

<sup>t</sup>�σ<sup>2</sup>ðt, Yt�Þ þ

Taking the covariation with respect to Y and F, we obtain

Y2

<sup>b</sup> ð½0, T� � <sup>R</sup> � <sup>R</sup>þÞ. Then the first components <sup>θ</sup>�,<sup>F</sup> and <sup>θ</sup>�,F<sup>Y</sup>

z<sup>2</sup>λðt, Xt�, Yt�Þφðt, Xt�, Yt�, dzÞ

z<sup>2</sup>πt�ðλφðdzÞÞ

of the

(88)

, t ∈½0, T� (82)

, t∈ ½0, T� (83)

yKðt, x, y; ζÞΔhðt, x, y; ζÞνðdζÞ: (84)

<sup>t</sup> : (85)

<sup>s</sup> : (86)

πsðgÞds, (89)

gðs, Xs, YsÞds, (87)

z<sup>2</sup>λðt, Xt�, Yt�Þφðt, Xt�, Yt�, dzÞ

$$R\_0^{\mathcal{F}^\vee}(\psi^{\*,\mathcal{F}^\vee}) - R\_0^{\mathcal{F}}(\psi^{\*,\mathcal{F}}) = \mathbb{E}\left[\int\_0^T (\boldsymbol{\Theta}\_r^{\*,\mathcal{F}})^2 \mathbf{d}\langle\boldsymbol{Y}\rangle\_r^{\mathcal{F}}\right] - \mathbb{E}\left[\int\_0^T (\boldsymbol{\Theta}\_r^{\*,\mathcal{F}^\vee})^2 \mathbf{d}\langle\boldsymbol{Y}\rangle\_r^{\mathcal{F}^\vee}\right].\tag{96}$$

Plugging in the expressions for the optimal strategies given in Eqs. (82) and (83), respectively, and denoting <sup>Σ</sup>ðt, Xt, Yt<sup>Þ</sup> :<sup>¼</sup> <sup>Y</sup><sup>2</sup> t � <sup>σ</sup><sup>2</sup>ðt, YtÞ þ <sup>ð</sup> Z <sup>z</sup><sup>2</sup>λðt, Xt�, Yt�Þφðt, Xt�, Yt�, <sup>d</sup>z<sup>Þ</sup> � , we have

$$\begin{split} &R\_0^{\mathcal{F}^\circ} (\psi^{\*,\mathcal{F}^\circ}) - R\_0^{\mathcal{F}} (\psi^{\*,\mathcal{F}}) = \mathbb{E} \left[ \int\_0^T \left( \frac{\mathcal{g}^2(t, X\_{t\cdot} Y\_t)}{\Sigma(t, X\_{t\cdot} Y\_t)} - \frac{\pi\_t^2(\mathcal{g})}{\pi\_t(\Sigma)} \right) dt \right] \\ &\lesssim \text{CE} \left[ \int\_0^T \left( \mathcal{g}^2(t, X\_t, Y\_t) - \pi\_t^2(\mathcal{g}) \right) dt \right] = \text{CE} \left[ \int\_0^T \left( \mathcal{g}(t, X\_t, Y\_t) - \pi\_t(\mathcal{g}) \right)^2 dt \right] \end{split} \tag{97}$$

for some C > 0, where the inequality follows by Assumption 18, and in the last equality, we used E ðT 0 2gðt, Xt, StÞπtðgÞdt � � ¼ E ðT 0 2πtðgÞ 2 dt � �.

We can conclude by saying that we found an upper bound for the expected difference between the total risks taken by an informed investor and a partially informed one which is directly proportional to the mean-squared error between the process {gðt, Xt, StÞ, t∈½0, T�} and its filtered estimate πðgÞ ¼ {πtðgÞ, t ∈½0, T�}.

### Author details

Claudia Ceci<sup>1</sup> \* and Katia Colaneri<sup>2</sup>

\*Address all correspondence to: c.ceci@unich.it


### References


R<sup>F</sup><sup>Y</sup> <sup>0</sup> <sup>ð</sup>ψ�,F<sup>Y</sup>

346 Bayesian Inference

and denoting <sup>Σ</sup>ðt, Xt, Yt<sup>Þ</sup> :<sup>¼</sup> <sup>Y</sup><sup>2</sup>

R<sup>F</sup><sup>Y</sup> <sup>0</sup> <sup>ð</sup>ψ�,F<sup>Y</sup>

used E

ðT 0

Author details

Claudia Ceci<sup>1</sup>

References

≤ CE

ðT 0 �

2gðt, Xt, StÞπtðgÞdt � �

filtered estimate πðgÞ ¼ {πtðgÞ, t ∈½0, T�}.

\* and Katia Colaneri<sup>2</sup>

\*Address all correspondence to: c.ceci@unich.it

Equations. 1967;3(2):179-190

Fields. 1969;11(3):230-243

Þ � <sup>R</sup><sup>F</sup>

Þ � <sup>R</sup><sup>F</sup>

<sup>0</sup> <sup>ð</sup>ψ�,<sup>F</sup> Þ ¼ <sup>E</sup>

<sup>0</sup> <sup>ð</sup>ψ�,<sup>F</sup> Þ ¼ <sup>E</sup>

� �

<sup>g</sup><sup>2</sup>ðt, Xt, YtÞ � <sup>π</sup><sup>2</sup>

¼ E

ðT 0 2πtðgÞ 2 dt

1 Department of Economics, University of Chieti-Pescara, Pescara, Italy

2 Department of Economics, University of Perugia, Perugia, Italy

Mathematics, Series A: Control. 1964;2(1):106-119

<sup>σ</sup><sup>2</sup>ðt, YtÞ þ

t � ðT 0 <sup>ð</sup>θ�,<sup>F</sup> <sup>r</sup> Þ 2 d〈Y〉 F r

> ð Z

ðT 0

<sup>t</sup>ðgÞ � dt

� �

� �

Plugging in the expressions for the optimal strategies given in Eqs. (82) and (83), respectively,

<sup>g</sup><sup>2</sup>ðt, Xt, Yt<sup>Þ</sup> <sup>Σ</sup>ðt, Xt, Yt<sup>Þ</sup> � <sup>π</sup><sup>2</sup>

¼ CE

for some C > 0, where the inequality follows by Assumption 18, and in the last equality, we

We can conclude by saying that we found an upper bound for the expected difference between the total risks taken by an informed investor and a partially informed one which is directly proportional to the mean-squared error between the process {gðt, Xt, StÞ, t∈½0, T�} and its

[1] Kushner H. On the differential equations satisfied by conditional probability densities of Markov processes, with applications. Journal of the Society for Industrial and Applied

[2] Kushner H. Dynamical equations for optimal nonlinear filtering. Journal of Differential

[3] Zakai M. On the optimal filtering of diffusion processes. Probability Theory and Related

.

� �

ðT 0 �

� �

� E

<sup>z</sup><sup>2</sup>λðt, Xt�, Yt�Þφðt, Xt�, Yt�, <sup>d</sup>z<sup>Þ</sup>

<sup>t</sup>ðgÞ πtðΣÞ

dt

gðt,Xt,YtÞ � πtðgÞ

� �

ðT 0 ðθ�,F<sup>Y</sup> <sup>r</sup> Þ 2 d〈Y〉 FY r

� �

�

�2 dt

, we have

: (96)

(97)


**Provisional chapter**

### **Airlines Content Recommendations Based on Passengers' Choice Using Bayesian Belief Networks Passengers' Choice Using Bayesian Belief Networks**

**Airlines Content Recommendations Based on** 

DOI: 10.5772/intechopen.70131

Sien Chen, Wenqiang Huang, Mengxi Chen, Junjiang Zhong and Jie Cheng Junjiang Zhong and Jie Cheng Additional information is available at the end of the chapter

Sien Chen, Wenqiang Huang, Mengxi Chen,

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/intechopen.70131

#### **Abstract**

[21] Ceci C, Colaneri K, Cretarola A. Local risk-minimization under restricted information to

[22] Ceci C, Colaneri K, Cretarola A. Hedging of unit-linked life insurance contracts with unobservable mortality hazard rate via local risk-minimization. Insurance: Mathematics

[23] Ceci C, Gerardi A. Pricing for geometric marked point processes under partial information: entropy approach. International Journal of Theoretical and Applied Finance.

[24] Frey R. Risk minimization with incomplete information in a model for high-frequency

[25] Nagai H, Peng S. Risk-sensitive dynamic portfolio optimization with partial information

[26] Bäuerle N, Rieder U. Portfolio optimization with jumps and unobservable intensity

[27] Föllmer H, Sondermann D. Hedging of non redundant contingent claims. In: Hildenbrand W, Mas-Colell A, editors. Contribution to Mathematical Economics. North

[28] Schweizer M. A guided tour through quadratic hedging approaches. In: Jouini E, Cvitanic J, Musiela M, editors. Option Pricing, Interest Rate and Risk Management.

[29] Schweizer M. Risk minimizing hedging strategies under partial information. Mathemat-

[30] Ceci C, Cretarola A, Russo F. GKW representation theorem under restricted information. An application to risk-minimization. Stochastics and Dynamics. 2014;14(2):1350019 (p. 23)

[31] Jacod J, Shiryaev A. Limit Theorems for Stochastic Processes. 2nd ed. Berlin: Springer;

[32] Protter P, Shimbo K, Ethier SN, Feng J, Stockbridge RH eds, No arbitrage and general semimartingales. In: Markov Processes and Related Topics: A Festschrift for Thomas G. Kurtz. Institute of Mathematical Statistics; Beachwood, Ohio, USA, 2008. pp. 267-283

on infinite time horizon. Annals of Applied Probability. 2000;12:173-195

Holland, Amsterdam New York Oxford Tokyo; 1986. pp. 205-223

Cambridge University Press; Cambridge, 2001. pp. 538-574

asset prices. Electronic Journal of Probability. 2015;20(96):1-30

and Economics. 2015;60:47-60.

ical Finance. 1994;4:327-342

2003

data. Mathematical Finance. 2000;10(2):215-222

process. Mathematical Finance. 2007;17(2):205-224

2009;12:179-207

348 Bayesian Inference

Faced with the increasingly fierce competition in the aviation market, the strategy of consumer choice has gained increasing significance in both academia and practice. As ever-increasing travel choices and growing consumer heterogeneity, how do airline companies satisfy passengers' needs? With a vast amount of data, how do airline managers combine information to excavate the relationship between independent variables to gain insight about passengers' choices and value system as well as determining best personalized contents to them? Using the real case of China Southern Airlines, this paper illustrates how Bayesian belief network (BBN) can enable airlines dynamically recommend relevant contents based on predicting passengers' choice to optimize the loyalty. The findings of this study provide airline companies useful insights to better understand the passengers' choices and develop effective strategies for growing customer relationship.

**Keywords:** consumer choice, Bayesian belief network, recommendation system

### **1. Introduction**

In a world of increasingly global competition, companies have to compete on the effectiveness and efficiency of their marketing strategies to capture new opportunities to satisfy customers' needs. In other words, having the greatest product at the lowest price is not competitive enough. Choice behavior is affected by a consumer's own preference for entire product categories and particular brands, allowing companies to collect market and industry data, learn about consumer preference, and change sales tactics. In general, companies must consider consumer choice and offer their customers varieties of differentiated products and different types of choices to meet consumer demand when they formulate revenue decisions

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

and marketing strategies. For instance, most airlines have different fare classes (e.g., economy class versus first class) that differ in the level of services and facilities available for customers. Companies have to understand the choices that consumers make when facing such a product assortment and provide appropriate contents for each consumer. Once individual choice has been modeled, the choice prediction would be of great value to managers for the estimation of the impact of a change in product formulation [1, 2].

What one cares most is choice, the selection of a suitable content from a set of available alternatives. Given the growing diversity of the purchasing channels and information media, companies are increasingly interested in modeling and understanding an actual process through which consumers choose products, in addition to measure consumers' future choices. Better understanding of consumer choices and predicting preference is important for enterprises to introduce new products and implement target marketing. Preference prediction could also be used more extensively by companies to guide decision optimization [3]. The researchers and managers are mostly interested in knowing human choice behavior, particularly the underlying choice mechanisms, and reveal and investigate fundamental reasons behind it. Choice behavior is complex and yet rational, as a result, decision makers seek to simplify the formulation of choice process. Capturing the consumers' choice decision, a method that estimates direct and indirect effects, the situation-specific variables and clear causal relationship could offer better representing choice behavior mechanisms. Based on choice mechanism, companies need to measure preference and predict consumer decision making to conduct market research to design products. In this chapter, we aim to investigate these concerns:


With increasing awareness of transportation competition, if airline companies hope to survive and make profit, they have to realize that customer resource is the most valuable competitive advantage and try their best to satisfy the needs of their customers. Besides, Mowen (1988) emphasized that managers can reset promotional strategies to satisfy different types of consumers' desire most effectively, through the best channels [7]. Based on the above analysis, we decided to introduce a personalized content recommendation system to satisfy China air passengers' desire. As we all know, good relationship with customers is crucial for airline companies to keep advantage in competition and furthermore make profit in the long run. Using real history data from China Southern Airline and Bayesian network can fix best personalized contents to each individual passenger.

### **2. Consumer choice behavior and Bayesian belief network**

This chapter introduces Bayesian belief networks (BBNs) for predicting air passengers' choice. On the basis of these choices, airlines can recommend best relevant content to passengers, including products, service, tips, notices, feature introductions, and information sharing to improve their travel experience, satisfaction, and loyalty. The remainder of this paper is organized as follows. Section 2 briefly discusses a review of consumer choice behavior, provides some definitions, and illustrates advantages of Bayesian belief networks. In the next section, we establish BBN models by using the case of China Southern Airlines with real transaction data, including passengers' basic information, history decision options, and purchase characteristics to predict the possible contents which the consumer will choose, followed by model results and discussion.

### **2.1. Consumer choice behavior**

and marketing strategies. For instance, most airlines have different fare classes (e.g., economy class versus first class) that differ in the level of services and facilities available for customers. Companies have to understand the choices that consumers make when facing such a product assortment and provide appropriate contents for each consumer. Once individual choice has been modeled, the choice prediction would be of great value to managers for the estimation

What one cares most is choice, the selection of a suitable content from a set of available alternatives. Given the growing diversity of the purchasing channels and information media, companies are increasingly interested in modeling and understanding an actual process through which consumers choose products, in addition to measure consumers' future choices. Better understanding of consumer choices and predicting preference is important for enterprises to introduce new products and implement target marketing. Preference prediction could also be used more extensively by companies to guide decision optimization [3]. The researchers and managers are mostly interested in knowing human choice behavior, particularly the underlying choice mechanisms, and reveal and investigate fundamental reasons behind it. Choice behavior is complex and yet rational, as a result, decision makers seek to simplify the formulation of choice process. Capturing the consumers' choice decision, a method that estimates direct and indirect effects, the situation-specific variables and clear causal relationship could offer better representing choice behavior mechanisms. Based on choice mechanism, companies need to measure preference and predict consumer decision making to conduct market

research to design products. In this chapter, we aim to investigate these concerns:

• How to design a model using only current period choices to infer consumers' inter-tempo-

• Is the process dynamic and it allows the researchers to analyze influences of effect changes? With increasing awareness of transportation competition, if airline companies hope to survive and make profit, they have to realize that customer resource is the most valuable competitive advantage and try their best to satisfy the needs of their customers. Besides, Mowen (1988) emphasized that managers can reset promotional strategies to satisfy different types of consumers' desire most effectively, through the best channels [7]. Based on the above analysis, we decided to introduce a personalized content recommendation system to satisfy China air passengers' desire. As we all know, good relationship with customers is crucial for airline companies to keep advantage in competition and furthermore make profit in the long run. Using real history data from China Southern Airline and Bayesian network can fix best per-

This chapter introduces Bayesian belief networks (BBNs) for predicting air passengers' choice. On the basis of these choices, airlines can recommend best relevant content to passengers,

• How can we infer consumers' choice for content in the future?

sonalized contents to each individual passenger.

**2. Consumer choice behavior and Bayesian belief network**

ral preferences?

350 Bayesian Inference

of the impact of a change in product formulation [1, 2].

When consumers face multiple alternative products, brands, and services, they tend to repeat the same choices that proved satisfactory in similar situations [4]. Information integration theory offers a specific mechanism to describe how individuals integrate separate pieces of available information into an overall index of preferences [5, 6]. The theory proposes that in situations where information about the products and brands are available in the marketplace, consumers tend to value and weight product attributes more often at the time of making a purchase decision. We formulated a comprehensive evaluation by combing consumers' values and weights under certain rules. Marketing managers should carefully study these decision-making processes and results to understand where consumers can collect relevant information, how consumers form beliefs, and what criteria consumers use to make product or service choices. As a result, companies can develop products that emphasize the appropriate attributes, and managers can reset promotional strategies to satisfy different types of consumers' desire most effectively, through the best channels [7]. Another interest issue to academia is in determining whether there are systematic differences in consumers' choice behavior. Identifying and understanding these differences are important for developing or formulating effective marketing strategies.

Consumer choice behavior has been mainly conceptualized as a combination of some sociodemographics and the attributes of alternatives [8]. Constructs, such as utility, attitude, or cognition, are used to map the attributes into one of the choice behavior. However, little research has been conducted about the choice process models considering the socio-demographic characteristics and the attributes of their decision alternatives in a recent study.

### *2.1.1. Consumer choice behavior in airline industry*

In recent years, the airline industry faces the economic challenges, which are coupled with volatile fuel prices and pressure of environment protection. In addition, with increasing awareness of competition caused by the development of other transportation alternatives, if airline companies hope to survive and make profit, they have to realize that customer resource is the most valuable competitive advantage and try their best to satisfy the needs of their customers. Thus, good relationship with customers is crucial for airline companies to keep advantage in competition and furthermore make profit in the long run. Domestic and international airline companies have long shifted their attention to customer relationship management [9]. Relationship with passengers has been taken as one of the most important goals for every airline company to maximize passengers' loyalty and revenues. Besides good performance in airline operation and business management, another important key for success is leveraging power of customer relationship to attain superb performance. Those airline companies, who could correctly estimate trends and risks in the airline market and take necessary actions to satisfy their customers, could be much more successful in the industry.

Although understanding changing needs of passengers is of great importance for airlines, passengers' decision-making processes have received relatively little managerial attention. Therefore, further understanding of those decision-making processes is crucial for airline companies to improve their operations and business models [10].

With today's ever-increasing travel choices and growing consumer heterogeneity, numerous factors affect passengers' choices, for example, their socio-demographic status, decisionmaking patterns, cultural background, ticket cost, travel objectives, time schedule, and so on. The role of each factor is difficult to define, let alone the interaction between different factors. Leisure travelers are becoming more and more supply-oriented selecting airlines with most convenient schedules and best service experience. A change from selecting the travel destination and seeking for appropriate transportations to one, which a desirable airline service is initially set and the trip is arranged around in, will very likely alter the dominant social ideological trend of travel behavior.

### **2.2. Bayesian belief network**

Belief networks are probabilistic graphical representations of models that capture relationship between different variables. Belief networks use either directed or undirected graphs to represent a dependency model. The directed acyclic graph (DAG) is more flexible and expressive. It is also able to investigate a wide range of probabilistic interdependency than undirected graphs. For example, induced and transitive dependency cannot be modeled accurately by undirected graphs, but can be easily represented in DAGs.

A causal belief network is made up of various types of nodes. An arc between two nodes represents a causal relation, an originating node of the arc is a parent node and the others are child nodes. A root node has no parents while a leaf node has no children. Each node has an underlying conditional probability table (CPT) that describes the option distribution for specific nodes associated with each possible combination of the parent nodes. Bayesian belief network (BBN) is a specific type of causal belief network, consisting of a set of nodes, where each node represents a variable in the dependency model and the connecting arcs represent the causal relationship among variables. **Figure 1** shows a simple BBN example about heart disease and heartache patients. The CPT's of the nodes is also illustrated in **Figure 1**. As for any causal belief network, the nodes represent stochastic variables and the arcs identify direct causal influences among the linked variables [11]. Each node or variable may take one of a number of possible states. The certainty of these states is determined from the belief in each possible state of all the nodes. The belief in each state of a node is updated whenever the belief

Airlines Content Recommendations Based on Passengers' Choice Using Bayesian Belief Networks http://dx.doi.org/10.5772/intechopen.70131 353

**Figure 1.** A Bayesian belief network depicting relationship among heart disease and heartache patients.

Relationship with passengers has been taken as one of the most important goals for every airline company to maximize passengers' loyalty and revenues. Besides good performance in airline operation and business management, another important key for success is leveraging power of customer relationship to attain superb performance. Those airline companies, who could correctly estimate trends and risks in the airline market and take necessary actions to

Although understanding changing needs of passengers is of great importance for airlines, passengers' decision-making processes have received relatively little managerial attention. Therefore, further understanding of those decision-making processes is crucial for airline com-

With today's ever-increasing travel choices and growing consumer heterogeneity, numerous factors affect passengers' choices, for example, their socio-demographic status, decisionmaking patterns, cultural background, ticket cost, travel objectives, time schedule, and so on. The role of each factor is difficult to define, let alone the interaction between different factors. Leisure travelers are becoming more and more supply-oriented selecting airlines with most convenient schedules and best service experience. A change from selecting the travel destination and seeking for appropriate transportations to one, which a desirable airline service is initially set and the trip is arranged around in, will very likely alter the dominant social ideo-

Belief networks are probabilistic graphical representations of models that capture relationship between different variables. Belief networks use either directed or undirected graphs to represent a dependency model. The directed acyclic graph (DAG) is more flexible and expressive. It is also able to investigate a wide range of probabilistic interdependency than undirected graphs. For example, induced and transitive dependency cannot be modeled accurately by undirected graphs, but can be easily represented in

A causal belief network is made up of various types of nodes. An arc between two nodes represents a causal relation, an originating node of the arc is a parent node and the others are child nodes. A root node has no parents while a leaf node has no children. Each node has an underlying conditional probability table (CPT) that describes the option distribution for specific nodes associated with each possible combination of the parent nodes. Bayesian belief network (BBN) is a specific type of causal belief network, consisting of a set of nodes, where each node represents a variable in the dependency model and the connecting arcs represent the causal relationship among variables. **Figure 1** shows a simple BBN example about heart disease and heartache patients. The CPT's of the nodes is also illustrated in **Figure 1**. As for any causal belief network, the nodes represent stochastic variables and the arcs identify direct causal influences among the linked variables [11]. Each node or variable may take one of a number of possible states. The certainty of these states is determined from the belief in each possible state of all the nodes. The belief in each state of a node is updated whenever the belief

satisfy their customers, could be much more successful in the industry.

panies to improve their operations and business models [10].

logical trend of travel behavior.

**2.2. Bayesian belief network**

DAGs.

352 Bayesian Inference

in each state of any directly connected node changes. The difference between Bayesian belief networks and other causal belief networks is that BBNs use Bayesian calculus to process the state probabilities of each node from the predetermined conditional and prior probabilities. The belief network is dynamic and their probabilities are subject to changes.

A Bayesian belief network is a graphical representation of a Bayesian probabilistic dependency within a knowledge domain [12], particularly appropriate for target recognition problems, where the category, identity, and class of target groups are to be recognized [13]. Bayesian belief networks have proven to be very useful, befitting to small and incomplete data collections. A Bayesian network can be, for example, used to save a considerable amount of space, explicit treatment of uncertainty, and support for decision analysis, casual relationship, and fast responses. Bayesian network is also suited to structural learning applications, and a combination of different sources of the preferred knowledge [14]. Besides, Bayesian approach finds the inclusion optimal model structure from data constructed by the a priori knowledge, and a constraint-based approach finds the optimal model structure from conditional dependences in each pair of variables. Given the ascertained information, Bayesian belief networks are used to determine or infer the posterior probability distributions for the variables of interest [11]. As such, they do not include decisions or utilities that typify the preferences of the users, but the user make decisions based on these probability distributions [11]. The causal relationships in Bayesian belief networks allow the correlation between variables to be modeled and predictions to be made. Comparing to classical statistical approaches, Bayesian belief networks have a distinct advantage [15]. BBN becomes not only a powerful tool for knowledge representation but reasoning under conditions of uncertainty [16], frequently dealing with real-world problems such as building medical diagnostic systems, forecasting, and manufacturing process control for several decades [17]. Nowadays BBN has been extended to other applications including software risk management [18], ecosystem and environmental management [19], and transportation [20]. There is a great impact of key events on long-term transport mode choice decisions using Bayesian belief network, precisely Bayesian decision network, for the exploration of the suggested formalism in measuring, analyzing, and predicting dynamic travel mode choice in relation to key events and critical incidents [21]. However, seldom researches are found using BBN as an application in airlines marketing management. This paper introduces Bayesian belief networks using relative and contextual variables to estimate a logic relationship and test the causal mechanisms of current passengers' choice and predict their future preference.

### **3. Case study: BBN in China Southern Airlines**

Air passengers make their choices using prior information available as well as information they obtain from the internal and external environments. Passengers integrate all the information actually available to them (including prior information and any information affect them) and turn them into preferences of a product. The basic aim is to support airline decision makers in their analysis of the impact of variables on passenger demand in the future. Prediction of passenger choice for the distant future is critical to guide managers in the specification of marketing strategies to be used. Such distant future predictions necessitate large-scale models of passenger choice but that pressing need contrasts sharply with the capabilities of traditional forecasting and modeling techniques. In this study, both qualitative and quantitative approaches are studied. Developed as such, the BBN is expected to guide airline managers in their future product decisions, facilitating analysis of specific decisions based on predicting the choice modes of passengers; highlighting the causal relationships among variables in the process and finally showing the impact of changes. To represent the dynamic nature of the causal relationship and to draw inferences based on the uncertainty concerning the states of the variables; this part constructs a Bayesian belief network for airline content recommendation mode using a case study of Chinese airline. A basic assumption of BBN is that when the conditional probabilities for each variable are multiplied, the joint probability distributions for all variables in the network are then calculated [22]. The structure is determined based on experts' judgments on content recommendation mode and a logic relationship.

Three components of a belief network are important: the nodes representing variables, the links among nodes, and states representing the expected utilities or probabilities. Therefore, the first step of the process is the development of a casual network. For this purpose, relevant variables and the logic relationship of this network should be determined. In the next stage, belief networks explore how the changes of states of variables (nodes) influence consumers' future choices and the needs of contents. Therefore, the static causal model is transformed into a dynamic one through the calculation of the Bayesian belief network. The resulting network is subjected to scenario analysis to help airline decision makers in their analysis on future product designs.

### **3.1. Determination of the basic variables and casual relations**

To obtain a mutually, selectively exhaustive list of basic variables of the airline companies, interviews are conducted with airline domain experts, who are encouraged to identify the variables that might be relevant to the research. Thereafter, 35 variables are generated based on the situation of China and with weights of the expert judgments and estimation. The decision variables are classified into four groups:

**1.** Personal characteristics

representation but reasoning under conditions of uncertainty [16], frequently dealing with real-world problems such as building medical diagnostic systems, forecasting, and manufacturing process control for several decades [17]. Nowadays BBN has been extended to other applications including software risk management [18], ecosystem and environmental management [19], and transportation [20]. There is a great impact of key events on long-term transport mode choice decisions using Bayesian belief network, precisely Bayesian decision network, for the exploration of the suggested formalism in measuring, analyzing, and predicting dynamic travel mode choice in relation to key events and critical incidents [21]. However, seldom researches are found using BBN as an application in airlines marketing management. This paper introduces Bayesian belief networks using relative and contextual variables to estimate a logic relationship and test the causal mechanisms of current passengers' choice and

Air passengers make their choices using prior information available as well as information they obtain from the internal and external environments. Passengers integrate all the information actually available to them (including prior information and any information affect them) and turn them into preferences of a product. The basic aim is to support airline decision makers in their analysis of the impact of variables on passenger demand in the future. Prediction of passenger choice for the distant future is critical to guide managers in the specification of marketing strategies to be used. Such distant future predictions necessitate large-scale models of passenger choice but that pressing need contrasts sharply with the capabilities of traditional forecasting and modeling techniques. In this study, both qualitative and quantitative approaches are studied. Developed as such, the BBN is expected to guide airline managers in their future product decisions, facilitating analysis of specific decisions based on predicting the choice modes of passengers; highlighting the causal relationships among variables in the process and finally showing the impact of changes. To represent the dynamic nature of the causal relationship and to draw inferences based on the uncertainty concerning the states of the variables; this part constructs a Bayesian belief network for airline content recommendation mode using a case study of Chinese airline. A basic assumption of BBN is that when the conditional probabilities for each variable are multiplied, the joint probability distributions for all variables in the network are then calculated [22]. The structure is determined based on experts' judgments on content

Three components of a belief network are important: the nodes representing variables, the links among nodes, and states representing the expected utilities or probabilities. Therefore, the first step of the process is the development of a casual network. For this purpose, relevant variables and the logic relationship of this network should be determined. In the next stage, belief networks explore how the changes of states of variables (nodes) influence consumers' future choices and the needs of contents. Therefore, the static causal model is transformed into a dynamic one through the calculation of the Bayesian belief network. The resulting network is subjected to scenario analysis to help airline decision makers in their analysis on

predict their future preference.

354 Bayesian Inference

**3. Case study: BBN in China Southern Airlines**

recommendation mode and a logic relationship.

future product designs.


Personal characteristics include airline passengers' demographic status and member information related to air travel. Experience and behavioral characteristics include passenger purchase behavior, decisions in choosing products, and attributes of particular experience. Preference characteristics include consumer preference and travel patterns. Individual perception describes the evaluation of passengers' loyalty, satisfaction, and comfort. After the identification of variables, the next step is the determination of the causal relations among all the variables. The use of this network is proposed to capture the knowledge and assumptions and to understand the mechanism of consumer choice processes`. The whole network is built up using Netica. The changes exist in the network are subjected to field tests using real world data from Chinese data sources.

### **3.2. Implementation of the BBN**

The content recommendation is a new attempt in airline companies' new marketing strategies. After obtaining and integrating consumers' choice behavior, airline companies forecast and measure passengers' preference to predict intertemporal choices in the future. Based on the predicted choices, airline decision makers formulate relevant content and recommend it to target consumer groups. Passengers can get information about what they want to know which can improve their loyalty, satisfaction, and comfort with airlines. Better customer relationship, more market share in the fierce competition. Content recommendations include products, services, tips, notices, introductions, and information sharing. Products include popular routes; international and domestic hotels; duty free gifts, etc. Services include special assistance, baggage inquiry, online check-in, pre-paid luggages, and so on. Tips include travel guide, entertainment activities, lounge locations, and flight delays. News and promotions, mileage redemption, and offers are also included in notices. Introductions involve frequent flyer program, activities, flight and hotel, boarding and arrival procedure, and so on. Information sharing is a new measure applied to web search with the popularity of social media. Airline industry starts to realize this platform can further improve service experience. Passengers can link the data of Weibo (China popular social media) or WeChat to flight reservation process that is easy for them to know who are in the same flight [23]. The network in **Figure 2** shows airline content recommendations for given choices of passengers.

**Figure 2.** Network for content recommendation mode.

The first cluster illustrates personal information. Demographic data elements include gender, age, and education. By using these three attributes, one can speculate individual occupation and time pressure. Distinguishing leisure travelers and business travelers depends on time sensitivities. The node 'feasibility' indicates the air travel feasibility of each node combination, upgrade (yes or no), travel mode (leisure or business), and time pressure (yes or no). The second cluster depicts experience and behavior characteristic. The experience characteristic describes passengers' trip and destination experience. The third cluster represents passenger's preference such as fare class preference, seat preference, flight time preference, and holiday preference. It is worth to emphasize that distance and upgrade may lead passenger to change their class selections. Passengers will choose more comfortable classes when they take longer range flights. When it comes to membership upgrades, passengers are more likely to choose traveling first class to accumulate qualified miles. The fourth cluster describes individuals' perception evaluating the effect of variable changes on passengers' loyalty, satisfaction, and degree of comfort. This cluster refers to benefit variables that intend to cover the most significant perception. One of the benefit variables, namely, loyalty is affected by membership class. The higher the membership class, the higher level of stickiness to an Airline company. In this aspect, the outcome should take the weights of the benefit nodes into account.

#### **3.3. Results and discussion**

The data from airlines are used to complete all CPTs of the nature nodes. After completing all tables, we use Netica software to compile the network and determine the probabilities of six contents. **Figure 3** shows the compiled decision network.

Airlines Content Recommendations Based on Passengers' Choice Using Bayesian Belief Networks http://dx.doi.org/10.5772/intechopen.70131 357

**Figure 3.** Compiled decision network (Cluster 1).

The first cluster illustrates personal information. Demographic data elements include gender, age, and education. By using these three attributes, one can speculate individual occupation and time pressure. Distinguishing leisure travelers and business travelers depends on time sensitivities. The node 'feasibility' indicates the air travel feasibility of each node combination, upgrade (yes or no), travel mode (leisure or business), and time pressure (yes or no). The second cluster depicts experience and behavior characteristic. The experience characteristic describes passengers' trip and destination experience. The third cluster represents passenger's preference such as fare class preference, seat preference, flight time preference, and holiday preference. It is worth to emphasize that distance and upgrade may lead passenger to change their class selections. Passengers will choose more comfortable classes when they take longer range flights. When it comes to membership upgrades, passengers are more likely to choose traveling first class to accumulate qualified miles. The fourth cluster describes individuals' perception evaluating the effect of variable changes on passengers' loyalty, satisfaction, and degree of comfort. This cluster refers to benefit variables that intend to cover the most significant perception. One of the benefit variables, namely, loyalty is affected by membership class. The higher the membership class, the higher level of stickiness to an Airline company. In this

aspect, the outcome should take the weights of the benefit nodes into account.

contents. **Figure 3** shows the compiled decision network.

The data from airlines are used to complete all CPTs of the nature nodes. After completing all tables, we use Netica software to compile the network and determine the probabilities of six

**3.3. Results and discussion**

**Figure 2.** Network for content recommendation mode.

356 Bayesian Inference

From **Figure 3**, the probability of tips is the highest, reaching 18.5% in total among all the contents decision options. Due to tips containing travel guide, entertainment activities, lounge caution, and delay calling, different kinds of hints remind passengers to have considerable experience. The probabilities of products and notice are around 17%. The other three contents are similar under the average level just over 15.4%. The beliefs and probabilities will be updated when evidence for certain nature nodes change. We will discuss some examples below.

After entering the evidence 'Yes' for the node 'Feasibility', which is colored gray in **Figure 4**, the belief for the decision nodes are auto-updated and recalculated. We find that only the probability of 'Introduce' option changed according to this new evidence. When air traveling is totally feasible for passengers no matter his or her travel purpose is holiday or business, the 'Introduce' is less useful to provide them flight information, boarding, and arrival procedure what they already know. They concern more about the services, delay caution, popular routes, holiday destination, and so on.

**Figure 4.** Compiled decision network ('Feasibility' = 'yes').

**Figure 5** represents the influence of the evidence 'Yes' for the nature node 'Change' on the decision options mode. The beliefs and probabilities updated automatically. The results illustrates if passengers change their purchase behavior or trip modes, for example, taking high-speed train, the effective method to retain their customers, airline managers could recommend relative tips and notice to them and give them more comfortable services.

We compiled network with the 'Long' for 'Distance', 'High' for 'TotalFlight', and 'High' for 'Frequency', respectively. The consequences are represented on **Figure 6 (a–c)**. The trends of three results are similar. The probabilities of 'Products', 'Introduce', and 'Shares' rise outstanding. However, what surprised us is that the probabilities of 'Services' decrease sharply. This result gives decision makers a good suggestion that passengers who have high frequency traveling behavior need products recommendation, destination introduce, web link to share when they experience long range journey. In the same way, the service is not as important as other aspects.

As each passenger has own preference. The information about preference is too diverse, so that we introduce a nature node 'Flexibility' to describe the overall variation of consumers' preferences. Controlling states of these nodes, including seat, class, flight time and holiday, have no obvious effects on decision option modes. Therefore, we set 'Flexibility' to 'Yes' for sure in **Figure 7**. 'Shares' has the biggest change in the entire content recommendation options mode that means 'Share' is the most useful method to address flexibility problems whose regular pattern is hard to capture. Facing this situation, airline managers share links to their passengers on social media to release a service "meeting & sitting in the same flight" [23]. As the results shown in **Figure 8**, the membership class has significant influence on customers' loyalty. Members with highest qualification are stickier to their choices of airlines; the probability of high loyalty ascends from 35.7 to 85.2% when we set 'Yes' for state 'Gold'. Moreover, more than half of silver card owners remain loyal to their airline companies. For airlines, managers should better service loyal customers, reduce the loss of customers and mining new customers.

In the first part, we come up with three questions: How can we infer consumers' choice for content in the future? How to design a model using only current period choices to infer consumers' inter-temporal preferences? Is the process dynamic and it allows the researchers to analyze influences of effect changes?

**Figure 5.** (a) Compiled decision network (Cluster 2). (b) Compiled decision network ('Change' = 'Yes').

Airlines Content Recommendations Based on Passengers' Choice Using Bayesian Belief Networks http://dx.doi.org/10.5772/intechopen.70131 359

**Figure 6.** (a) Compiled decision network ('Distance' = 'Long'). (b) Compiled decision network ('TotalFlights'='High'). (c) Compiled decision network ('Frequency' = 'High').

This paper uses automatic updating process to explain the dynamics of belief networks. BBN model represents a complex network that constructs and model consumer choice process. From the examples above, we investigate clearly how the evidence of one state of a node change affects the probability of decision options. Based on China Southern Airline historical real data, we predict the passengers' choice and help airline managers recommend relative contents to satisfy passengers' needs.

**Figure 7.** Compiled decision network ('Flexibility' = 'Yes').

**Figure 5** represents the influence of the evidence 'Yes' for the nature node 'Change' on the decision options mode. The beliefs and probabilities updated automatically. The results illustrates if passengers change their purchase behavior or trip modes, for example, taking high-speed train, the effective method to retain their customers, airline managers could rec-

We compiled network with the 'Long' for 'Distance', 'High' for 'TotalFlight', and 'High' for 'Frequency', respectively. The consequences are represented on **Figure 6 (a–c)**. The trends of three results are similar. The probabilities of 'Products', 'Introduce', and 'Shares' rise outstanding. However, what surprised us is that the probabilities of 'Services' decrease sharply. This result gives decision makers a good suggestion that passengers who have high frequency traveling behavior need products recommendation, destination introduce, web link to share when they experience long range journey. In the same way, the service is not as important as

As each passenger has own preference. The information about preference is too diverse, so that we introduce a nature node 'Flexibility' to describe the overall variation of consumers' preferences. Controlling states of these nodes, including seat, class, flight time and holiday, have no obvious effects on decision option modes. Therefore, we set 'Flexibility' to 'Yes' for sure in **Figure 7**. 'Shares' has the biggest change in the entire content recommendation options mode that means 'Share' is the most useful method to address flexibility problems whose regular pattern is hard to capture. Facing this situation, airline managers share links to their passengers on social media to release a service "meeting & sitting in the same flight" [23]. As the results shown in **Figure 8**, the membership class has significant influence on customers' loyalty. Members with highest qualification are stickier to their choices of airlines; the probability of high loyalty ascends from 35.7 to 85.2% when we set 'Yes' for state 'Gold'. Moreover, more than half of silver card owners remain loyal to their airline companies. For airlines, managers should better service loyal customers, reduce the loss of customers and mining new

In the first part, we come up with three questions: How can we infer consumers' choice for content in the future? How to design a model using only current period choices to infer consumers' inter-temporal preferences? Is the process dynamic and it allows the researchers to

**Figure 5.** (a) Compiled decision network (Cluster 2). (b) Compiled decision network ('Change' = 'Yes').

ommend relative tips and notice to them and give them more comfortable services.

other aspects.

358 Bayesian Inference

customers.

analyze influences of effect changes?

**Figure 8.** (a–d) Compiled decision network of Cluster 4.

### **4. Conclusion and implications**

This article measures air passengers' preference and predicts their choices in the future based on current choice behavior using Bayesian belief network. This network can represent complex choice behavior and causal relationship among different variables, and the use of the probability of options can capture passengers' dynamic decision-making processes. The most powerful of the Bayesian network is that the probability of getting results from each stage is a reflection of mathematics and science. In other words, the network will infer reasonable results if we obtain enough information based on statistical knowledge.

We illustrate it by conducting a detailed empirical study of a data set from a Chinese Airline company. Our research demonstrates that understanding the extent to which the consumer choice behavior is beneficial for airline mangers strategic decision making.

To help with formulating better marketing strategies, the airline companies may consider adoption of the following procedures.


A good strategy should analyze passengers' trip behavior and preferences to conduct cross selling, filter unnecessary information, and to present consumer recommendations and offer the most valuable product portfolio to customers.

We expect that, together with the need for the more specific features; BBN combined with Artificial Intelligence and deep learning are of great value to addressing uncertainty problems and consumer choice behavior in the future.

### **Author details**

Sien Chen1,2\*, Wenqiang Huang<sup>3</sup> , Mengxi Chen<sup>4</sup> , Junjiang Zhong<sup>5</sup> and Jie Cheng<sup>6</sup>


2 Antai College of Economics and Management, Shanghai Jiao Tong University, Shanghai, China


### **References**

**4. Conclusion and implications**

360 Bayesian Inference

**Figure 8.** (a–d) Compiled decision network of Cluster 4.

adoption of the following procedures.

This article measures air passengers' preference and predicts their choices in the future based on current choice behavior using Bayesian belief network. This network can represent complex choice behavior and causal relationship among different variables, and the use of the probability of options can capture passengers' dynamic decision-making processes. The most powerful of the Bayesian network is that the probability of getting results from each stage is a reflection of mathematics and science. In other words, the network will infer reasonable

We illustrate it by conducting a detailed empirical study of a data set from a Chinese Airline company. Our research demonstrates that understanding the extent to which the consumer

To help with formulating better marketing strategies, the airline companies may consider

results if we obtain enough information based on statistical knowledge.

choice behavior is beneficial for airline mangers strategic decision making.


[21] Verhoeven M, Arente TA, Timmermans HJP, van der Waerden PJHJ. Modeling the impact of key events on long-term transport mode choice decisions: A decision network approach using event history data, Transportation Research Record. 2005;**1926**:106-114. DOI: 10.3141/1926-13

[4] Hansen F. Consumer choice behavior: An experimental approach. Journal of Marketing

[5] Anderson NH. Contributions to Information Integration Theory Volume II: Social.

[6] Bettman J, Capon N, Lutz RJ. Cognitive algebra in multi-attribute attitude models.

[7] Mowen JC. Beyond consumer decision making. Journal of Consumer Marketing.

[9] Davenport TH. At the Big Data Crossroads: Turning Towards a Smarter Travel Experience. 2013. Available from: http://www.bigdata.amadeus.com/assets/pdf/Amadeus\_Big\_Data.

[10] Buhalis D, Law R. Progress in information technology and tourism management: 20 years on and 10 years after the Internet—The state of eTourism research. Tourism

[11] Suermondt HJ. Explanation in Bayesian belief networks. PhD thesis, Palo Alto, California:

[13] Stewart L, McCarty Jr P. The use of Bayesian belief networks to fuse continuous and discrete information for target recognition, tracking and situation assessment. Proceedings

[14] Uusitalo L. Advantages and challenges of Bayesian networks in environmental model-

[15] Heckerman D. A Tutorial on Learning with Bayesian Networks, Technical Report

[16] Cheng J, Greiner R, Kelly J, Kelly J, Bell D, Liu W. Learning Bayesian networks from data: An information-theory based approach. Artificial Intelligence. 2002;**137**(1/2):43-90

[17] Heckerman D, Mamdani A, Wellman MP. Real-world applications of Bayesian net-

[18] Fan C, Yu Y. BBN-based software project risk management. The Journal of Systems and

[19] Uusitalo, L. Advantages and challenges of Bayesian networks in environmental model-

[20] Ulegine F, Onsel S, Topcu YI, Aktas E, Kabak O. An integrated transportation decision support system for transportation policy decisions: The case of Turkey. Transportation

[12] Jensen FV. An Introduction to Bayesian Networks. London UK: UCL Press; 1996

Medical Information Sciences, Stanford University; March 1992

[8] Wierenga B, van Raaj WF. Consumentengedrag. Leiden; Stenfert Kroese BV; 1987

Lawrence Erlbaum Associates. Psychology Press, New York; 1991

Journal of Marketing Research. 1975;**12**(May):151-164

Research. 1969;**6**(4):436-443

pdf (Accessed: March 2, 2018, 14:50)

Management. 2008;**28**(4):587-590

of the SPIE, 1992;**1699**:177-185

Software. 2004;**73**(2):193-203

ling. Ecological Modelling. 2007;**203**(3/4):312-318

MSR-TR-95-06, Redmond, WA: Microsoft Corporation; 1996

works. Communications of the ACM. 1995;**38**(3):24-26

ling. Ecological Modelling. 2007;**203**(3/4):312-318

Research Part A, Policy and Practice. 2007;**41**(1):40-97

1988;**5**(1):15-25.

362 Bayesian Inference


### *Edited by Javier Prieto Tejedor*

The range of Bayesian inference algorithms and their different applications has been greatly expanded since the first implementation of a Kalman filter by Stanley F. Schmidt for the Apollo program. Extended Kalman filters or particle filters are just some examples of these algorithms that have been extensively applied to logistics, medical services, search and rescue operations, or automotive safety, among others. This book takes a look at both theoretical foundations of Bayesian inference and practical implementations in different fields. It is intended as an introductory guide for the application of Bayesian inference in the fields of life sciences, engineering, and economics, as well as a source document of fundamentals for intermediate Bayesian readers.

Photo by LV4260 / iStock

Bayesian Inference

Bayesian Inference

*Edited by Javier Prieto Tejedor*