We are IntechOpen, the world's leading publisher of Open Access books Built by scientists, for scientists

4,900+

Open access books available

124,000+

International authors and editors

140M+

Downloads

Our authors are among the

Top 1% most cited scientists

12.2%

Contributors from top 500 universities

Selection of our books indexed in the Book Citation Index in Web of Science™ Core Collection (BKCI)

### Interested in publishing with us? Contact book.department@intechopen.com

Numbers displayed above are based on latest data collected. For more information visit www.intechopen.com

## Meet the editor

Niansheng Tang is Professor of Statistics and dean of the School of Mathematics and Statistics, Yunnan University. He was elected a Yangtze River Scholars Distinguished Professor in 2013, a member of the International Statistical Institute (ISI) in 2016, and a member of the board of International Chinese Statistical Association (ICSA) in 2018. He obtained the National Science Foundation for Distinguished Young Scholars of China in 2012.

He serves as a member of the editorial board for *Statistics and Its Interface* and *Journal of Systems Science and Complexity*. He is also an editor for *Communications in Mathematics and Statistics*. His research interests include biostatistics, Bayesian statistics, missing data analysis, statistical diagnosis, variable selection, and high-dimensional data analysis. He has published more than 170 research papers and authored fours books.

Contents

**Section 1**

**Section 2**

*by Hongsheng Dai*

**Section 3**

**Preface XI**

The Choice of the Prior **1**

**Chapter 1 3**

Some Advances on Sampling Methods **15**

**Chapter 2 17**

**Chapter 3 29**

Bayesian Inference for Complicated Data **49**

**Chapter 4 51**

**Chapter 5 63**

**Chapter 6 79**

**Chapter 7 89**

Patient Bayesian Inference: Cloud-Based Healthcare Data Analysis

The Bayesian Posterior Estimators under Six Loss Functions for

On the Impact of the Choice of the Prior in Bayesian Statistics

*by Fatemeh Ghaderinezhad and Christophe Ley*

A Brief Tour of Bayesian Sampling Methods

A Review on the Exact Monte Carlo Simulation

Bayesian Analysis for Random Effects Models

Bayesian Inference of Gene Regulatory Network

Using Constraint-Based Adaptive Boost Algorithm

Unrestricted and Restricted Parameter Spaces

*by Junshan Shen and Catherine C. Liu*

*by Xi Chen and Jianhua Xuan*

*by Shahid Naseem*

*by Ying-Ying Zhang*

*by Michelle Y. Wang and Trevor Park*

### Contents


Preface

Over the years, due to great applications in various fields such as social science, biomedicine, genomics, and signal processing, and the improvement of computing ability, Bayesian statistics have made substantial developments. In particular, many novel Bayesian theories and methods, including novel sampling techniques, the selection of the prior, and new Bayesian estimation procedures, have been developed. This book introduces key ideas of Bayesian sampling methods, Bayesian estimation, and the selection of the prior. This book is structured around topics on the impact of the choice of the prior on Bayesian statistics, some advances on Bayesian sampling methods, and Bayesian inference for complicated data including breast cancer data, cloud-based healthcare data, gene network

Fundamental statistical problems have changed with the move from continuous/ discrete data to network and cloud-based data analyses. As a result of network and cloud-based data analyses, traditional Bayesian sampling techniques suffer from unprecedented challenges. To this end, this book introduces some novel approaches to make Bayesian inference on a few topics of interest, rather than give

This book includes three sections and seven chapters. Section I introduces the impact problem of the choice of the prior. It includes Chapter 1, in which Professor Ley Christophe investigates the impact of the choice of the prior on Bayesian statistics including conjugate prior and Jeffrey's prior. Section II focuses on some advances on sampling methods. It contains Chapters 2 and 3, in which Professor Wang Michelle introduces Gibbs sampler, slice sampler, Metropolis-Hastings sampling, Hamiltonian Monte Carlo, and cluster sampling, among others, and Professor Dai Hongsheng reviews exact Monte Carlo simulation techniques. Section III describes Bayesian inference for complicated data. It contains Chapters 4, 5, 6, and 7, in which Professor Liu Catherine introduces Bayesian analysis for random effects models, Professor Chen Xi studies Bayesian integration for gene network data, Professor Nguyen Loc discusses Bayesian inference for cloud-based healthcare data, and Dr. Zhang Ying-Ying considers Bayesian

I was invited to edit this book after the publication of "Bayesian analysis for hidden Markov factor analysis models," which I co-wrote with Xia Yemao, Zeng Xiaoqian, and Tang Niansheng. I am very grateful to Mr. Mateo Pulko for his kind invitation to edit this book and for providing me the chance to work with my aforementioned coauthors. I would also like to thank Professors Ley Christophe, Wang Michelle, Dai Hongsheng, Liu Catherine, Chen Xi, Nguyen

data, and longitudinal data.

a comprehensive overview.

estimators under six loss functions.

## Preface

Over the years, due to great applications in various fields such as social science, biomedicine, genomics, and signal processing, and the improvement of computing ability, Bayesian statistics have made substantial developments. In particular, many novel Bayesian theories and methods, including novel sampling techniques, the selection of the prior, and new Bayesian estimation procedures, have been developed. This book introduces key ideas of Bayesian sampling methods, Bayesian estimation, and the selection of the prior. This book is structured around topics on the impact of the choice of the prior on Bayesian statistics, some advances on Bayesian sampling methods, and Bayesian inference for complicated data including breast cancer data, cloud-based healthcare data, gene network data, and longitudinal data.

Fundamental statistical problems have changed with the move from continuous/ discrete data to network and cloud-based data analyses. As a result of network and cloud-based data analyses, traditional Bayesian sampling techniques suffer from unprecedented challenges. To this end, this book introduces some novel approaches to make Bayesian inference on a few topics of interest, rather than give a comprehensive overview.

This book includes three sections and seven chapters. Section I introduces the impact problem of the choice of the prior. It includes Chapter 1, in which Professor Ley Christophe investigates the impact of the choice of the prior on Bayesian statistics including conjugate prior and Jeffrey's prior. Section II focuses on some advances on sampling methods. It contains Chapters 2 and 3, in which Professor Wang Michelle introduces Gibbs sampler, slice sampler, Metropolis-Hastings sampling, Hamiltonian Monte Carlo, and cluster sampling, among others, and Professor Dai Hongsheng reviews exact Monte Carlo simulation techniques. Section III describes Bayesian inference for complicated data. It contains Chapters 4, 5, 6, and 7, in which Professor Liu Catherine introduces Bayesian analysis for random effects models, Professor Chen Xi studies Bayesian integration for gene network data, Professor Nguyen Loc discusses Bayesian inference for cloud-based healthcare data, and Dr. Zhang Ying-Ying considers Bayesian estimators under six loss functions.

I was invited to edit this book after the publication of "Bayesian analysis for hidden Markov factor analysis models," which I co-wrote with Xia Yemao, Zeng Xiaoqian, and Tang Niansheng. I am very grateful to Mr. Mateo Pulko for his kind invitation to edit this book and for providing me the chance to work with my aforementioned coauthors. I would also like to thank Professors Ley Christophe, Wang Michelle, Dai Hongsheng, Liu Catherine, Chen Xi, Nguyen

Loc, and Zhang Ying-Ying for their contributions. I sincerely hope that this book will be of great interest to statisticians, engineers, doctors, and machine learning researchers.

> **Niansheng Tang** Yunnan University, China

> > Section 1

The Choice of the Prior

Section 1

## The Choice of the Prior

**Chapter 1**

**Abstract**

On the Impact of the Choice of the

A key question in Bayesian analysis is the effect of the prior on the posterior, and

**Keywords:** conjugate prior, Jeffreys prior, prior distribution, posterior distribution,

A key question in Bayesian analysis is the choice of the prior in a given situation. Numerous proposals and divergent opinions exist on this matter, but our aim is not to delve into a review or discussion, rather we want to provide the reader with a description of a useful new tool allowing him/her to make a decision. More precisely, we explain how to effectively measure the effect of the choice of a given prior on the resulting posterior. How much do two posteriors, derived from two distinct priors, differ? Providing a quantitative answer to this question is important as it also informs us about the ensuing inferential procedures. It has been proved formally in [1, 2] that, under certain regularity conditions, the impact of the prior is waning as the sample size increases. From a practical viewpoint it is however more interesting to know what happens at finite sample size *n*, and this is precisely the

Recently, [3, 4] have devised a novel tool to answer this question. They measure the Wasserstein distance between the posterior distributions based on two distinct priors at fixed sample size *n*. The Wasserstein (more precisely, Wasserstein-1)

for *X*<sup>1</sup> and *X*<sup>2</sup> random variables with respective distribution functions *P*<sup>1</sup> and *P*2, and where H stands for the class of Lipschitz-1 functions. It is a popular distance

∣E½ �� *h X*ð Þ<sup>1</sup> E½ � *h X*ð Þ<sup>2</sup> ∣

how we can measure this effect. Will the posterior distributions derived with distinct priors become very similar if more and more data are gathered? It has been proved formally that, under certain regularity conditions, the impact of the prior is waning as the sample size increases. From a practical viewpoint it is more important to know what happens at finite sample size *n*. In this chapter, we shall explain how we tackle this crucial question from an innovative approach. To this end, we shall review some notions from probability theory such as the Wasserstein distance and the popular Stein's method, and explain how we use these a priori unrelated concepts in order to measure the impact of priors. Examples will illustrate our findings,

Prior in Bayesian Statistics

*Fatemeh Ghaderinezhad and Christophe Ley*

including conjugate priors and the Jeffreys prior.

Stein's method, Wasserstein distance

situation we are considering in this chapter.

*d*Wð Þ¼ *P*1*, P*<sup>2</sup> sup

*h* ∈ H

**1. Introduction**

distance is defined as

**3**

#### **Chapter 1**

### On the Impact of the Choice of the Prior in Bayesian Statistics

*Fatemeh Ghaderinezhad and Christophe Ley*

#### **Abstract**

A key question in Bayesian analysis is the effect of the prior on the posterior, and how we can measure this effect. Will the posterior distributions derived with distinct priors become very similar if more and more data are gathered? It has been proved formally that, under certain regularity conditions, the impact of the prior is waning as the sample size increases. From a practical viewpoint it is more important to know what happens at finite sample size *n*. In this chapter, we shall explain how we tackle this crucial question from an innovative approach. To this end, we shall review some notions from probability theory such as the Wasserstein distance and the popular Stein's method, and explain how we use these a priori unrelated concepts in order to measure the impact of priors. Examples will illustrate our findings, including conjugate priors and the Jeffreys prior.

**Keywords:** conjugate prior, Jeffreys prior, prior distribution, posterior distribution, Stein's method, Wasserstein distance

#### **1. Introduction**

A key question in Bayesian analysis is the choice of the prior in a given situation. Numerous proposals and divergent opinions exist on this matter, but our aim is not to delve into a review or discussion, rather we want to provide the reader with a description of a useful new tool allowing him/her to make a decision. More precisely, we explain how to effectively measure the effect of the choice of a given prior on the resulting posterior. How much do two posteriors, derived from two distinct priors, differ? Providing a quantitative answer to this question is important as it also informs us about the ensuing inferential procedures. It has been proved formally in [1, 2] that, under certain regularity conditions, the impact of the prior is waning as the sample size increases. From a practical viewpoint it is however more interesting to know what happens at finite sample size *n*, and this is precisely the situation we are considering in this chapter.

Recently, [3, 4] have devised a novel tool to answer this question. They measure the Wasserstein distance between the posterior distributions based on two distinct priors at fixed sample size *n*. The Wasserstein (more precisely, Wasserstein-1) distance is defined as

$$d\_{\mathcal{W}}(P\_1, P\_2) = \sup\_{h \in \mathcal{H}} |\mathbf{E}[h(X\_1)] - \mathbf{E}[h(X\_2)]|^2$$

for *X*<sup>1</sup> and *X*<sup>2</sup> random variables with respective distribution functions *P*<sup>1</sup> and *P*2, and where H stands for the class of Lipschitz-1 functions. It is a popular distance

between two distributions, related to optimal transport and therefore also known as *earth mover distance* in computer science, see [5] for more information. The resulting distance thus gives us the desired measure of the difference between two posteriors. If one of the two priors is the flat uniform prior (leading to the posterior coinciding with the data likelihood), then this measure quantifies how much the other chosen prior has impacted on the outcome as compared to a data-only posterior. Now, the Wasserstein distance being mostly impossible to calculate exactly, it is necessary to obtain sharp upper and lower bounds, which will partially be achieved by using techniques from the so-called Stein method, a famous tool in probabilistic approximation theory. We opt for the Wasserstein metric instead of, e.g., the Kullback-Leibler divergence because of precisely its nice link with the Stein method, see [3].

**2.2 Notation and formulation of the main goal**

*DOI: http://dx.doi.org/10.5772/intechopen.88994*

*On the Impact of the Choice of the Prior in Bayesian Statistics*

*pi*

**2.3 The general result**

to for a proof.

(ii) *ρ θ*ð Þ<sup>Ð</sup> *<sup>θ</sup>*

**5**

*a*1

(iii) lim *<sup>θ</sup>*!*a*2*, <sup>b</sup>*2*ρ θ*ð Þ<sup>Ð</sup> *<sup>θ</sup>*

*<sup>I</sup>*2⊆*I*1, allowing us to express *<sup>p</sup>*2ð Þ *<sup>θ</sup>*; *<sup>x</sup>* as *<sup>κ</sup>*2ð Þ *<sup>x</sup>*

*τi*ð Þ¼ *θ*; *x*

*a*1

<sup>∣</sup>*μ*<sup>1</sup> � *<sup>μ</sup>*2<sup>∣</sup> <sup>¼</sup> <sup>∣</sup><sup>E</sup> *<sup>τ</sup>*1ð Þ <sup>Θ</sup> 1; *<sup>x</sup> <sup>ρ</sup>* ½ � <sup>0</sup>

E½ � *ρ*ð Þ Θ <sup>1</sup>

1 *pi* ð Þ *θ*; *x*

ð Þ¼ *θ*; *x κi*ð Þ *x pi*

We start by fixing our notations. We consider independent and identically distributed (discrete or absolutely continuous) observations *X*1*,* …*, Xn* from a parametric model with parameter of interest *θ* ∈ Θ ⊆ . We denote the likelihood of *X*1*,* …*, Xn* by ℓð Þ *x*; *θ* where *x* ¼ ð Þ *x*1*,* …*, xn* are the observed values. Take two different (possibly improper) prior densities *p*1ð Þ*θ* and *p*2ð Þ*θ* for our parameter *θ*; the famous Bayes' theorem then readily yields the respective posterior densities

where *κ*1ð Þ *x , κ*2ð Þ *x* are normalizing constants that depend only on the observed values. We denote by ð Þ Θ <sup>1</sup>*, P*<sup>1</sup> and ð Þ Θ <sup>2</sup>*, P*<sup>2</sup> the couples of random variables and cumulative distribution functions associated with the densities *p*1ð Þ *θ*; *x* and *p*2ð Þ *θ*; *x* . These notations allow us to formulate the main goal: measure the Wasserstein distance between *p*1ð Þ *θ*; *x* and *p*2ð Þ *θ*; *x* , as this will exactly correspond to the difference between the posteriors resulting from the two priors *p*<sup>1</sup> and *p*2. Sharp upper and lower bounds have been provided for this Wasserstein distance, first in [3] for the special case of one prior being flat uniform, then in all generality in [4]. The determination of the upper bound has been achieved by means of the Stein Method: first a relevant Stein operator has been found (Part A), and then a new technique designed in [3] has been put to use for Part B. The reader is referred to these two papers for details about the calculations; since this chapter is part of a book on Bayesian inference, we prefer to keep out those rather probabilistic manipulations.

The key element in the mathematical developments underlying the present problem is that the densities *p*1ð Þ *θ*; *x* and *p*2ð Þ *θ*; *x* are *nested*, meaning that one support is included in the other. Without loss of generality we here suppose that

> *ρ θ*ð Þ¼ *<sup>p</sup>*2ð Þ*<sup>θ</sup> <sup>p</sup>*1ð Þ*<sup>θ</sup> :*

Theorem 1.1 Consider H the set of Lipschitz-1 functions on and define

where *ai* is the lower bound of the support *Ii* ¼ *ai* ð Þ *, bi* of *pi*. Suppose that both posterior distributions have finite means *μ*<sup>1</sup> and *μ*2, respectively. Assume that *θ*↦*ρ θ*ð Þ is differentiable on *I*<sup>2</sup> and satisfies (i) E j Θ <sup>1</sup> � *μ*<sup>1</sup> ½ � j*ρ*ð Þ Θ <sup>1</sup> < ∞,

ð*θ ai*

ð Þ *h y*ð Þ� <sup>E</sup>½ � *<sup>h</sup>*ð Þ <sup>Θ</sup> <sup>1</sup> *<sup>p</sup>*1ð*y*; *<sup>x</sup>*Þ*dy* � � is integrable for all *<sup>h</sup>*<sup>∈</sup> <sup>H</sup> and

ð Þ Θ <sup>1</sup> ∣

The following general result has been obtained in [4], where we refer the reader

ð Þ *μ<sup>i</sup>* � *y pi*

ð Þ *h y*ð Þ� E½ � *h*ð Þ Θ <sup>1</sup> *p*1ð Þ *y*; *x dy* ¼ 0 for all *h*∈ H. Then

≤ *d*Wð Þ *P*1*, P*<sup>2</sup> ≤

*<sup>κ</sup>*1ð Þ *<sup>x</sup> ρ θ*ð Þ*p*1ð Þ *<sup>θ</sup>*; *<sup>x</sup>* with

ð Þ *y*; *x dy, i* ¼ 1*,* 2*,* (1)

E *τ*1ð Þj Θ 1; *x ρ* ½ � <sup>0</sup>

E½ � *ρ*ð Þ Θ <sup>1</sup>

ð Þj Θ <sup>1</sup>

ð Þ*θ* ℓð Þ *x*; *θ , i* ¼ 1*,* 2*,*

The chapter is organized as follows. In Section 2 we provide the notations and terminology used throughout the paper, provide the reader with the minimal necessary background knowledge on the Stein method, and state the main result regarding the measure of the impact of priors. Then in Section 3 we illustrate how this new measure works in practice, by first working out a completely new example, namely priors for the scale parameter of the inverse gamma distribution, and second giving new insights into an example first treated in both [3, 4], namely priors for the success parameter in the binomial distribution.

#### **2. The measure in its most general form**

In this section we provide the reader with the general form of the new measure of the impact of the choice of prior distributions. Before doing so, we however first give a very brief overview on Stein's method that is of independent interest.

#### **2.1 Stein's method in a nutshell**

Stein's method is a popular tool in applied and theoretical probability, typically used for Gaussian and Poisson approximation problems. The principal goal of the method is to provide quantitative assessments in distributional comparison statements of the form *W*≈*Z* where *Z* follows a known and well-understood probability distribution (typically normal or Poisson) and *W* is the object of interest. Charles Stein [6] in 1972 laid the foundation of what is now called "Stein's method" by aiming at normal approximations.

Stein's method consists of two distinct components, namely

**Part A**: a framework allowing to convert the problem of bounding the error in the approximation of *W* by *Z* into a problem of bounding the expectation of a certain functional of *W*.

**Part B**: a collection of techniques to bound the expectation appearing in Part A; the details of these techniques are strongly dependent on the properties of *W* as well as on the form of the functional.

We refer the interested reader to [7, 8] for detailed recent accounts on this powerful method. The reader will understand in the next sections why Stein's method has been of use for quantifying the desired measure, even without formal proofs or mathematical details.

#### **2.2 Notation and formulation of the main goal**

between two distributions, related to optimal transport and therefore also known as

The chapter is organized as follows. In Section 2 we provide the notations and terminology used throughout the paper, provide the reader with the minimal necessary background knowledge on the Stein method, and state the main result regarding the measure of the impact of priors. Then in Section 3 we illustrate how this new measure works in practice, by first working out a completely new example, namely priors for the scale parameter of the inverse gamma distribution, and second giving new insights into an example first treated in both [3, 4], namely

In this section we provide the reader with the general form of the new measure of the impact of the choice of prior distributions. Before doing so, we however first give a very brief overview on Stein's method that is of independent interest.

Stein's method is a popular tool in applied and theoretical probability, typically used for Gaussian and Poisson approximation problems. The principal goal of the method is to provide quantitative assessments in distributional comparison statements of the form *W*≈*Z* where *Z* follows a known and well-understood probability distribution (typically normal or Poisson) and *W* is the object of interest. Charles Stein [6] in 1972 laid the foundation of what is now called "Stein's method" by

**Part A**: a framework allowing to convert the problem of bounding the error in the approximation of *W* by *Z* into a problem of bounding the expectation of a

**Part B**: a collection of techniques to bound the expectation appearing in Part A; the details of these techniques are strongly dependent on the properties of *W* as

We refer the interested reader to [7, 8] for detailed recent accounts on this powerful method. The reader will understand in the next sections why Stein's method has been of use for quantifying the desired measure, even without formal

priors for the success parameter in the binomial distribution.

Stein's method consists of two distinct components, namely

**2. The measure in its most general form**

**2.1 Stein's method in a nutshell**

aiming at normal approximations.

certain functional of *W*.

proofs or mathematical details.

**4**

well as on the form of the functional.

*earth mover distance* in computer science, see [5] for more information. The resulting distance thus gives us the desired measure of the difference between two posteriors. If one of the two priors is the flat uniform prior (leading to the posterior coinciding with the data likelihood), then this measure quantifies how much the other chosen prior has impacted on the outcome as compared to a data-only posterior. Now, the Wasserstein distance being mostly impossible to calculate exactly, it is necessary to obtain sharp upper and lower bounds, which will partially be achieved by using techniques from the so-called Stein method, a famous tool in probabilistic approximation theory. We opt for the Wasserstein metric instead of, e.g., the Kullback-Leibler divergence because of precisely its nice link with the Stein

method, see [3].

*Bayesian Inference on Complicated Data*

We start by fixing our notations. We consider independent and identically distributed (discrete or absolutely continuous) observations *X*1*,* …*, Xn* from a parametric model with parameter of interest *θ* ∈ Θ ⊆ . We denote the likelihood of *X*1*,* …*, Xn* by ℓð Þ *x*; *θ* where *x* ¼ ð Þ *x*1*,* …*, xn* are the observed values. Take two different (possibly improper) prior densities *p*1ð Þ*θ* and *p*2ð Þ*θ* for our parameter *θ*; the famous Bayes' theorem then readily yields the respective posterior densities

$$p\_i(\theta; \mathfrak{x}) = \kappa\_i(\mathfrak{x}) p\_i(\theta) \ell(\mathfrak{x}; \theta), \quad i = 1, 2, 3$$

where *κ*1ð Þ *x , κ*2ð Þ *x* are normalizing constants that depend only on the observed values. We denote by ð Þ Θ <sup>1</sup>*, P*<sup>1</sup> and ð Þ Θ <sup>2</sup>*, P*<sup>2</sup> the couples of random variables and cumulative distribution functions associated with the densities *p*1ð Þ *θ*; *x* and *p*2ð Þ *θ*; *x* .

These notations allow us to formulate the main goal: measure the Wasserstein distance between *p*1ð Þ *θ*; *x* and *p*2ð Þ *θ*; *x* , as this will exactly correspond to the difference between the posteriors resulting from the two priors *p*<sup>1</sup> and *p*2. Sharp upper and lower bounds have been provided for this Wasserstein distance, first in [3] for the special case of one prior being flat uniform, then in all generality in [4]. The determination of the upper bound has been achieved by means of the Stein Method: first a relevant Stein operator has been found (Part A), and then a new technique designed in [3] has been put to use for Part B. The reader is referred to these two papers for details about the calculations; since this chapter is part of a book on Bayesian inference, we prefer to keep out those rather probabilistic manipulations.

#### **2.3 The general result**

The key element in the mathematical developments underlying the present problem is that the densities *p*1ð Þ *θ*; *x* and *p*2ð Þ *θ*; *x* are *nested*, meaning that one support is included in the other. Without loss of generality we here suppose that *<sup>I</sup>*2⊆*I*1, allowing us to express *<sup>p</sup>*2ð Þ *<sup>θ</sup>*; *<sup>x</sup>* as *<sup>κ</sup>*2ð Þ *<sup>x</sup> <sup>κ</sup>*1ð Þ *<sup>x</sup> ρ θ*ð Þ*p*1ð Þ *<sup>θ</sup>*; *<sup>x</sup>* with

$$
\rho(\theta) = \frac{p\_2(\theta)}{p\_1(\theta)}.
$$

The following general result has been obtained in [4], where we refer the reader to for a proof.

Theorem 1.1 Consider H the set of Lipschitz-1 functions on and define

$$\pi\_i(\theta; \mathfrak{x}) = \frac{\mathbf{1}}{p\_i(\theta; \mathfrak{x})} \int\_{\mathfrak{a}\_i}^{\theta} (\mu\_i - \mathfrak{y}) p\_i(\mathfrak{y}; \mathfrak{x}) d\mathfrak{y}, \quad i = \mathbf{1}, \mathbf{2}, \tag{1}$$

where *ai* is the lower bound of the support *Ii* ¼ *ai* ð Þ *, bi* of *pi*. Suppose that both posterior distributions have finite means *μ*<sup>1</sup> and *μ*2, respectively. Assume that *θ*↦*ρ θ*ð Þ is differentiable on *I*<sup>2</sup> and satisfies (i) E j Θ <sup>1</sup> � *μ*<sup>1</sup> ½ � j*ρ*ð Þ Θ <sup>1</sup> < ∞, (ii) *ρ θ*ð Þ<sup>Ð</sup> *<sup>θ</sup> a*1 ð Þ *h y*ð Þ� <sup>E</sup>½ � *<sup>h</sup>*ð Þ <sup>Θ</sup> <sup>1</sup> *<sup>p</sup>*1ð*y*; *<sup>x</sup>*Þ*dy* � � is integrable for all *<sup>h</sup>*<sup>∈</sup> <sup>H</sup> and (iii) lim *<sup>θ</sup>*!*a*2*, <sup>b</sup>*2*ρ θ*ð Þ<sup>Ð</sup> *<sup>θ</sup> a*1 ð Þ *h y*ð Þ� E½ � *h*ð Þ Θ <sup>1</sup> *p*1ð Þ *y*; *x dy* ¼ 0 for all *h*∈ H. Then <sup>∣</sup>*μ*<sup>1</sup> � *<sup>μ</sup>*2<sup>∣</sup> <sup>¼</sup> <sup>∣</sup><sup>E</sup> *<sup>τ</sup>*1ð Þ <sup>Θ</sup> 1; *<sup>x</sup> <sup>ρ</sup>* ½ � <sup>0</sup> ð Þ Θ <sup>1</sup> ∣ E½ � *ρ*ð Þ Θ <sup>1</sup> ≤ *d*Wð Þ *P*1*, P*<sup>2</sup> ≤ E *τ*1ð Þj Θ 1; *x ρ* ½ � <sup>0</sup> ð Þj Θ <sup>1</sup> E½ � *ρ*ð Þ Θ <sup>1</sup>

and, if the variance of Θ <sup>1</sup> exists,

$$|\mu\_1 - \mu\_2| \le d\_W(P\_1, P\_2) \le ||\rho'||\_\infty \frac{\mathsf{Var}[\varTheta\_1]}{\mathsf{E}[\rho(\varTheta\_1)]},$$

where k k� <sup>∞</sup> stands for the infinity norm.

This result quantifies in all generality the measure of the difference between two priors *p*<sup>1</sup> and *p*2, and comprises of course the special case where one prior is flat uniform. Quite nicely, if *ρ* is a monotone increasing or decreasing function, the bounds do coincide, leading to

$$d\_{\mathcal{W}}(P\_1; P\_2) = \frac{\mathbf{E}[\pi\_1(\Theta\_1; \mathfrak{x}) | \rho'(\Theta\_1)]}{\mathbf{E}[\rho(\Theta\_1)]},\tag{2}$$

*<sup>p</sup>*1ð Þ *<sup>β</sup>*j*<sup>x</sup>* <sup>∝</sup> <sup>1</sup>

*DOI: http://dx.doi.org/10.5772/intechopen.88994*

gamma distribution with density *β*↦ *<sup>κ</sup><sup>η</sup>*

*<sup>p</sup>*2ð Þ *<sup>β</sup>*j*<sup>x</sup>* <sup>∝</sup>*β<sup>η</sup>*�<sup>1</sup> exp f g �*κβ* � *<sup>β</sup>n<sup>α</sup>* exp �*<sup>β</sup>*

*ρ β*ð Þ¼ *<sup>p</sup>*2ð Þ *<sup>β</sup>*

*<sup>p</sup>*1ð Þ *<sup>β</sup>* <sup>∝</sup>

<sup>¼</sup> *<sup>n</sup><sup>α</sup>*

*ρ*0

� �ð Þ *β* the related density, we get

ð<sup>∞</sup> 0

*κη* Γð Þ*η*

> ð<sup>∞</sup> 0

� � � � � �

P*<sup>n</sup> i*¼1 1 *xi*

then the density

*f*

**7**

*Gamma n<sup>α</sup>,*

P*<sup>n</sup> i*¼1 1 *xi*

½ �¼ *ρ*ð Þ Θ <sup>1</sup>

P*<sup>n</sup> i*¼1 1 *xi* � �*<sup>n</sup><sup>α</sup>*

Γð Þ *nα*

<sup>¼</sup> *<sup>κ</sup><sup>η</sup>* Γð Þ*η* *<sup>β</sup> <sup>β</sup>n<sup>α</sup>* exp �*<sup>β</sup>*

*On the Impact of the Choice of the Prior in Bayesian Statistics*

X*n i*¼1

( )

which is none other than a gamma distribution with parameters *nα,*

parameter of an IG distribution. We consider thus as second prior a general

which is a gamma distribution with updated parameters *nα* þ *η,*

*κη*

1 *xi*

Now, the gamma distribution happens to be the conjugate prior for the scale

parameters *η* and *κ* are strictly positive. The ensuing posterior distribution *P*<sup>2</sup> has

X*n i*¼1

( )

Considering Jeffreys prior as *p*<sup>1</sup> and the gamma prior as *p*<sup>2</sup> leads to the ratio

One can easily check that all conditions of Theorem 1.1 are fulfilled, hence we

� � � � �

*i*¼1 1 *xi*

P*<sup>n</sup> i*¼1 1 *xi* � *η* P*<sup>n</sup> i*¼1 1 *xi*

P*<sup>n</sup> i*¼1 1 *xi*

*<sup>β</sup><sup>η</sup>*�<sup>1</sup> exp ð Þ �*κβ* ½ � *<sup>η</sup>* � *κβ*

*<sup>β</sup><sup>η</sup>* exp f g �*κβ <sup>β</sup><sup>n</sup>α*�<sup>1</sup> exp �*<sup>β</sup>*

<sup>Γ</sup>ð Þ*<sup>η</sup> <sup>β</sup>η*�<sup>1</sup> exp f g �*κβ* 1 *β*

can calculate the bounds. The lower bound is directly obtained as follows:

þ *nακ* � *nα*

*<sup>d</sup>*Wð Þ *<sup>P</sup>*1*, P*<sup>2</sup> <sup>≥</sup> *<sup>μ</sup>*<sup>1</sup> � *<sup>μ</sup>*<sup>2</sup> j j¼ *<sup>n</sup><sup>α</sup>* <sup>P</sup>*<sup>n</sup>*

P*<sup>n</sup> i*¼1 1 *xi* P*<sup>n</sup> i*¼1 1 *xi* þ *κ* � �

P*<sup>n</sup> i*¼1 1 *xi* P*<sup>n</sup> i*¼1 1 *xi* þ *κ* � �

� � � � � �

In order to acquire the upper bound we need to calculate

ð Þ¼ *<sup>β</sup> <sup>κ</sup><sup>η</sup>* Γð Þ*η*

<sup>¼</sup> *<sup>n</sup>ακ* � *<sup>η</sup>*

and, writing Θ <sup>1</sup> the random variable associated with *Gamma nα,*

*<sup>β</sup><sup>η</sup>* exp f g �*κβ* � *<sup>f</sup>*

1 *xi*

> <sup>¼</sup> *<sup>κ</sup><sup>η</sup>* Γð Þ*η*

> > � *<sup>n</sup><sup>α</sup>* <sup>þ</sup> *<sup>η</sup>* P*<sup>n</sup> i*¼1 1 *xi* þ *κ*

> > > � � � � � �

*Gamma n<sup>α</sup>,*

P*<sup>n</sup> i*¼1 1 *xi*

> X*n i*¼1

( )

1 *xi*

<sup>¼</sup> *<sup>β</sup>nα*�<sup>1</sup> exp �*<sup>β</sup>*

X*n i*¼1

( )

<sup>Γ</sup>ð Þ*<sup>η</sup> <sup>β</sup><sup>η</sup>*�<sup>1</sup> exp f g �*κβ* , where the shape and scale

<sup>¼</sup> *<sup>β</sup>nα*þ*η*�<sup>1</sup> exp �*<sup>β</sup>* <sup>X</sup>*<sup>n</sup>*

*<sup>β</sup><sup>η</sup>* exp f g �*κβ :*

� � � � �

� � � � � �

*:* (5)

P*<sup>n</sup> i*¼1 1 *xi* � � and

*dβ* (7)

� �ð Þ *β dβ* (6)

1 *xi*

> P*<sup>n</sup> i*¼1 1 *xi* � �.

> > *i*¼1

P*<sup>n</sup> i*¼1 1 *xi* þ *κ*

� �.

1 *xi* þ *κ*

(3)

(4)

( ) !

hence an exact result. The reader notices the sharpness of these bounds given that they contain the same quantities in both the upper and lower bounds; this fact is further underpinned by the equality Eq. (2). Finally we wish to stress that the functions *τi*ð Þ *θ*; *x , i* ¼ 1*,* 2*,* from Eq. (1) are called Stein kernel in the Stein method literature and that these functions are always positive and vanish at the boundaries of the support.

#### **3. Applications and illustrations**

Numerous examples have been treated in [3, 4], such as priors for the location parameter of a normal distribution, the scale parameter of a normal distribution, the success parameter of a binomial or the event-enumerating parameter of the Poisson distribution, to cite but these. In this section we will, on the one hand, investigate a new example, namely the scale parameter of an inverse gamma distribution, and, on the other hand, revisit the binomial case. Besides providing the bounds, we will also for the first time plot numerical values for the bounds and hence shed new intuitive light on this measure of the impact of the choice of the prior.

#### **3.1 Priors for the scale parameter of the inverse gamma (IG) distribution**

The inverse gamma (IG) distribution has the probability density function

$$\varkappa \to \frac{\beta^a}{\Gamma(a)} \varkappa^{-a-1} \exp\left\{-\frac{\beta}{\varkappa}\right\}, \text{  $\varkappa > 0$ .}$$

where *α* and *β* are the positive shape and scale parameters, respectively. This distribution corresponds to the reciprocal of a gamma distribution (if *X* � *Gamma* ð Þ *<sup>α</sup>, <sup>β</sup>* then <sup>1</sup> *<sup>X</sup>* � *IG*ð Þ *α, β* ) and is frequently encountered in domains such as machine learning, survival analysis and reliability theory. Within Bayesian Inference, it is a popular choice as prior for the scale parameter of a normal distribution. In the present setting, we consider *θ* ¼ *β* as the parameter of interest and *α* is fixed. The observations sampled from this distribution are written *x*1*,* …*, xn*.

The first prior is the popular noninformative Jeffreys prior. It is invariant under reparameterization and is proportional to the square root of the Fisher information quantity associated with the parameter of interest. In the present setting simple calculations show that it is proportional to <sup>1</sup> *<sup>β</sup>*. The resulting posterior *P*<sup>1</sup> then has a density of the form

*On the Impact of the Choice of the Prior in Bayesian Statistics DOI: http://dx.doi.org/10.5772/intechopen.88994*

and, if the variance of Θ <sup>1</sup> exists,

*Bayesian Inference on Complicated Data*

bounds do coincide, leading to

**3. Applications and illustrations**

of the support.

ð Þ *<sup>α</sup>, <sup>β</sup>* then <sup>1</sup>

density of the form

**6**

where k k� <sup>∞</sup> stands for the infinity norm.

*d*Wð Þ¼ *P*1; *P*<sup>2</sup>

∣*μ*<sup>1</sup> � *μ*2∣ ≤ *d*Wð Þ *P*1*, P*<sup>2</sup> ≤ *ρ*<sup>0</sup> k k<sup>∞</sup>

This result quantifies in all generality the measure of the difference between two priors *p*<sup>1</sup> and *p*2, and comprises of course the special case where one prior is flat uniform. Quite nicely, if *ρ* is a monotone increasing or decreasing function, the

hence an exact result. The reader notices the sharpness of these bounds given that they contain the same quantities in both the upper and lower bounds; this fact is further underpinned by the equality Eq. (2). Finally we wish to stress that the functions *τi*ð Þ *θ*; *x , i* ¼ 1*,* 2*,* from Eq. (1) are called Stein kernel in the Stein method literature and that these functions are always positive and vanish at the boundaries

Numerous examples have been treated in [3, 4], such as priors for the location parameter of a normal distribution, the scale parameter of a normal distribution, the success parameter of a binomial or the event-enumerating parameter of the Poisson distribution, to cite but these. In this section we will, on the one hand, investigate a new example, namely the scale parameter of an inverse gamma distribution, and, on the other hand, revisit the binomial case. Besides providing the bounds, we will also for the first time plot numerical values for the bounds and hence shed new

intuitive light on this measure of the impact of the choice of the prior.

*<sup>x</sup>* ! *<sup>β</sup><sup>α</sup>*

observations sampled from this distribution are written *x*1*,* …*, xn*.

calculations show that it is proportional to <sup>1</sup>

**3.1 Priors for the scale parameter of the inverse gamma (IG) distribution**

The inverse gamma (IG) distribution has the probability density function

<sup>Γ</sup>ð Þ *<sup>α</sup> <sup>x</sup>*�*α*�<sup>1</sup> exp � *<sup>β</sup>*

where *α* and *β* are the positive shape and scale parameters, respectively. This distribution corresponds to the reciprocal of a gamma distribution (if *X* � *Gamma*

learning, survival analysis and reliability theory. Within Bayesian Inference, it is a popular choice as prior for the scale parameter of a normal distribution. In the present setting, we consider *θ* ¼ *β* as the parameter of interest and *α* is fixed. The

The first prior is the popular noninformative Jeffreys prior. It is invariant under reparameterization and is proportional to the square root of the Fisher information quantity associated with the parameter of interest. In the present setting simple

*x* 

*<sup>X</sup>* � *IG*ð Þ *α, β* ) and is frequently encountered in domains such as machine

*, x*>0*,*

*<sup>β</sup>*. The resulting posterior *P*<sup>1</sup> then has a

E *τ*1ð Þj Θ 1; *x ρ* ½ � <sup>0</sup>

E½ � *ρ*ð Þ Θ <sup>1</sup>

ð Þj Θ <sup>1</sup>

*,* (2)

V*ar*½ � Θ <sup>1</sup> E½ � *ρ*ð Þ Θ <sup>1</sup>

$$p\_1(\beta|\mathbf{x}) \propto \frac{1}{\beta} \beta^{na} \exp\left\{-\beta \sum\_{i=1}^n \frac{1}{\mathbf{x}\_i}\right\} = \beta^{na-1} \exp\left\{-\beta \sum\_{i=1}^n \frac{1}{\mathbf{x}\_i}\right\}$$

which is none other than a gamma distribution with parameters *nα,* P*<sup>n</sup> i*¼1 1 *xi* � �. Now, the gamma distribution happens to be the conjugate prior for the scale parameter of an IG distribution. We consider thus as second prior a general gamma distribution with density *β*↦ *<sup>κ</sup><sup>η</sup>* <sup>Γ</sup>ð Þ*<sup>η</sup> <sup>β</sup><sup>η</sup>*�<sup>1</sup> exp f g �*κβ* , where the shape and scale parameters *η* and *κ* are strictly positive. The ensuing posterior distribution *P*<sup>2</sup> has then the density

$$p\_2(\boldsymbol{\beta}|\mathbf{x}) \propto \beta^{\eta - 1} \exp\left\{-\kappa \boldsymbol{\beta}\right\} \times \beta^{na} \exp\left\{-\beta \sum\_{i=1}^n \frac{\mathbf{1}}{\mathbf{x}\_i}\right\} = \beta^{na + \eta - 1} \exp\left\{-\beta \left(\sum\_{i=1}^n \frac{\mathbf{1}}{\mathbf{x}\_i} + \kappa\right)\right\}$$

which is a gamma distribution with updated parameters *nα* þ *η,* P*<sup>n</sup> i*¼1 1 *xi* þ *κ* � �. Considering Jeffreys prior as *p*<sup>1</sup> and the gamma prior as *p*<sup>2</sup> leads to the ratio

$$\rho(\beta) = \frac{p\_2(\beta)}{p\_1(\beta)} \propto \frac{\frac{\kappa^\eta}{\Gamma(\eta)} \beta^{\eta - 1} \exp\left\{-\kappa \beta\right\}}{\frac{1}{\beta}} = \frac{\kappa^\eta}{\Gamma(\eta)} \beta^\eta \exp\left\{-\kappa \beta\right\}.$$

One can easily check that all conditions of Theorem 1.1 are fulfilled, hence we can calculate the bounds. The lower bound is directly obtained as follows:

$$d\_{\mathcal{W}}(P\_1, P\_2) \ge |\mu\_1 - \mu\_2| = \left| \frac{na}{\sum\_{i=1}^n \frac{1}{x\_i}} - \frac{na + \eta}{\sum\_{i=1}^n \frac{1}{x\_i} + \kappa} \right| \tag{3}$$

$$\mathbf{x} = \left| \frac{na\sum\_{i=1}^{n} \frac{1}{x\_i} + na\kappa - na\sum\_{i=1}^{n} \frac{1}{x\_i} - \eta \sum\_{i=1}^{n} \frac{1}{x\_i}}{\sum\_{i=1}^{n} \frac{1}{x\_i} \left(\sum\_{i=1}^{n} \frac{1}{x\_i} + \kappa\right)} \right| \tag{4}$$

$$= \left| \frac{n\alpha\kappa - \eta \sum\_{i=1}^{n} \frac{1}{x\_i}}{\sum\_{i=1}^{n} \frac{1}{\chi\_i} \left(\sum\_{i=1}^{n} \frac{1}{\chi\_i} + \kappa\right)} \right|. \tag{5}$$

In order to acquire the upper bound we need to calculate

$$\rho'(\beta) = \frac{\kappa^{\eta}}{\Gamma(\eta)} \beta^{\eta - 1} \exp\left(-\kappa \beta\right) [\eta - \kappa \beta]^2$$

and, writing Θ <sup>1</sup> the random variable associated with *Gamma nα,* P*<sup>n</sup> i*¼1 1 *xi* � � and *f Gamma n<sup>α</sup>,* P*<sup>n</sup> i*¼1 1 *xi* � �ð Þ *β* the related density, we get

$$\mathbb{E}[\rho(\Theta\_1)] = \int\_0^\infty \frac{\kappa^\eta}{\Gamma(\eta)} \beta^\eta \exp\left\{-\kappa \beta\right\} \times f\_{\text{Gamma}}\Big|\_{\text{Gamma}} \Big|\_{\text{nu}} (\beta) d\beta \tag{6}$$

$$=\frac{\kappa^{\eta}}{\Gamma(\eta)}\frac{\left(\sum\_{i=1}^{n}\frac{1}{\mathbf{x}\_{i}}\right)^{na}}{\Gamma(na)}\int\_{0}^{\infty}\beta^{\eta}\exp\left\{-\kappa\beta\right\}\beta^{na-1}\exp\left\{-\beta\sum\_{i=1}^{n}\frac{1}{\mathbf{x}\_{i}}\right\}d\beta\tag{7}$$

*Bayesian Inference on Complicated Data*

$$\hat{\rho} = \frac{\kappa^{\eta}}{\Gamma(\eta)} \frac{\left(\sum\_{i=1}^{n} \frac{1}{\mathbf{x}\_{i}}\right)^{na}}{\Gamma(na)} \int\_{0}^{\infty} \beta^{na+\eta-1} \exp\left\{-\beta \left(\sum\_{i=1}^{n} \frac{1}{\mathbf{x}\_{i}} + \kappa\right)\right\} d\beta \tag{8}$$

$$\frac{\mathbf{x}}{\Gamma(\eta)} \frac{\left(\sum\_{i=1}^{n} \frac{1}{x\_i}\right)^{na}}{\Gamma(na)} \frac{\Gamma(n\alpha + \eta)}{\left(\sum\_{i=1}^{n} \frac{1}{x\_i} + \kappa\right)^{na + \eta}} \tag{9}$$

<sup>¼</sup> <sup>1</sup> P*<sup>n</sup> i*¼1 1 *xi*

*On the Impact of the Choice of the Prior in Bayesian Statistics*

*DOI: http://dx.doi.org/10.5772/intechopen.88994*

� � � � � �

bounded as

� � � � � �

getting slower.

**Figure 1.**

**9**

*while the fixed parameter* α *equals 0.5.*

*nακ* � *η*

P*<sup>n</sup> i*¼1 1 *xi* � � P*<sup>n</sup>*

P*<sup>n</sup> i*¼1 1 *xi*

*i*¼1 1 *xi* þ *κ* � �

however the curves become noticeably smoother.

*η* þ *κ*

conjugate gamma prior for the scale parameter *β* of the IG distribution is thus

≤ *d*Wð Þ *<sup>P</sup>*1*, <sup>P</sup>*<sup>2</sup> ≤

The Wasserstein distance between the posteriors based on the Jeffreys prior and

It can be seen that both the lower and upper bound are of the order of *O n*�<sup>1</sup> ð Þ. In addition, it is noticeable that for the larger observations, the rate of convergence is

In order to show the performance of the methodology which leads to have the lower and upper bounds, we have conducted a simulation study including two parts. First we simulate *N* ¼ 100 samples for each sample size *n* ¼ 10*,* 11*,* ⋯*,* 100 from the inverse gamma distribution with parameters ð Þ¼ *α, β* ð Þ 0*:*5*,* 1 in each iteration. For each of these samples we calculate the lower and upper bounds of the Wasserstein distance and calculate the average over all *N* replications, together with the difference between the bounds. Finally we plot these values for each sample size in **Figure 1**. We repeat the same process for *N* ¼ 1000 samples with the same sizes. The hyperparameters from the prior gamma distribution are ð Þ¼ *κ, η* ð Þ 0*:*2*,* 2 . We clearly observe how fast these values decrease with the sample size. Of course, augmenting the number of replications does not increase the speed of convergence,

This methodology not only can help the practitioners to make a decision between existing priors in theory, but also helps them to know from what sample size on the effect of choosing one prior becomes less important, especially in situations when the cost and time matter. This can be particularly useful when the

*(a) Shows the bounds and the distances between the bounds for N* ¼ *100 iterations for each sample size 10–100 by steps of 1, and (b) illustrates the same situation for N* ¼ *1000. The hyperparameters are κ* ¼ *0:2 and η* ¼ *2,*

*nα* þ *η* P*<sup>n</sup> i*¼1 1 *xi* þ *κ*

> 1 P*<sup>n</sup> i*¼1 1 *xi*

*η* þ *κ*

*:* (20)

*nα* þ *η <sup>κ</sup>* <sup>þ</sup> <sup>P</sup>*<sup>n</sup> i*¼1 1 *xi*

*:*

!

!

$$=\frac{\kappa^{\eta}}{\operatorname{Beta}(n\alpha,\eta)}\frac{\left(\sum\_{i=1}^{n}\frac{1}{x\_{i}}\right)^{n\alpha}}{\left(\sum\_{i=1}^{n}\frac{1}{x\_{i}}+\kappa\right)^{n\alpha+\eta}}.\tag{10}$$

From the Stein literature we know that the Stein kernel for the gamma distribution with parameters *nα,* P*<sup>n</sup> i*¼1 1 *xi* � � corresponds to *τ β*ð Þ¼ *;x* <sup>P</sup> *β n i*¼1 1 *xi* . Employing the triangular inequality we have thus

$$\mathbb{E}[\pi(\boldsymbol{\Theta}\_{1};\boldsymbol{\chi})|\rho'(\boldsymbol{\Theta}\_{1})] = \mathbb{E}\left[\frac{\boldsymbol{\Theta}\_{1}}{\sum\_{i=1}^{n}\frac{1}{\boldsymbol{\chi}\_{i}}\boldsymbol{\Gamma}(\boldsymbol{\eta})}\boldsymbol{\Theta}\_{1}^{\eta-1}\exp\left\{-\kappa\boldsymbol{\Theta}\_{1}\right\}|\boldsymbol{\eta}-\kappa\boldsymbol{\Theta}\_{1}|\right] \tag{11}$$

$$\zeta \le \frac{\kappa^{\eta}}{\left(\sum\_{i=1}^{n} \frac{1}{x\_i}\right) \Gamma(\eta)} \mathbb{E}\left[\Theta\_1^{\eta} \exp\left\{-\kappa \Theta\_1\right\}(\eta + \kappa \Theta\_1)\right].\tag{12}$$

Now we need to calculate the expectation

$$\mathbb{E}\left[\boldsymbol{\Theta}\_{1}^{\eta}\exp\left\{-\kappa\boldsymbol{\Theta}\_{1}\right\}(\eta+\kappa\boldsymbol{\Theta}\_{1})\right] \tag{13}$$

$$I = \int\_0^\infty \beta^\eta \exp\left\{-\kappa \beta\right\} (\eta + \kappa \beta) \times f\_{Gamma} \Big\{\_{na\_2} \exp\left(\_{\iota \ast 1} \sum\_{i=1}^n \frac{1}{\gamma\_i}\right)} (\beta) d\beta \tag{14}$$

$$=\frac{\left(\sum\_{i=1}^{n}\frac{1}{\mathbf{x}\_{i}}\right)^{na}}{\Gamma(na)}\int\_{0}^{\infty}\eta\rho^{na+\eta-1}\exp\left\{-\beta\left(\sum\_{i=1}^{n}\frac{1}{\mathbf{x}\_{i}}+\kappa\right)\right\}d\beta\tag{15}$$

$$+\frac{\left(\sum\_{i=1}^{n}\frac{1}{x\_i}\right)^{na}}{\Gamma(na)}\int\_0^\infty \kappa \beta^{na+\eta} \exp\left\{-\beta\left(\sum\_{i=1}^{n}\frac{1}{x\_i}+\kappa\right)\right\}d\beta\tag{16}$$

$$=\frac{\left(\sum\_{i=1}^{n}\frac{1}{x\_i}\right)^{na}}{\Gamma(na)}\left(\eta\frac{\Gamma(na+\eta)}{\left(\sum\_{i=1}^{n}\frac{1}{x\_i}+\kappa\right)^{na+\eta}}+\kappa\frac{\Gamma(na+\eta+1)}{\left(\sum\_{i=1}^{n}\frac{1}{x\_i}+\kappa\right)^{na+\eta+1}}\right).\tag{17}$$

The final expression for the upper bound then corresponds to

$$d\_{\mathcal{W}}(P\_1, P\_2) \le \frac{\frac{\mathsf{r}^{\eta}}{\left(\sum\_{i=1:\mathcal{V}\_i}^{\bullet}\right) \Gamma(\eta)} \times \frac{\left(\sum\_{i=1:\mathcal{V}\_i}^{\bullet}\right)^{an}}{\Gamma(na)} \left[\eta \frac{\Gamma(na+\eta)}{\left(\sum\_{i=1:\mathcal{V}\_i}^{\bullet}\right)^{an+\eta}} + \kappa \frac{\frac{\Gamma(na+\eta+1)}{\Gamma(na+\eta+1)}}{\left(\sum\_{i=1:\mathcal{V}\_i}^{\bullet}\right)^{an+\eta+1}}\right]}{\frac{\mathsf{r}^{\eta}}{\mathsf{Beta}(na,\eta,\eta)} \times \frac{\left(\sum\_{i=1:\mathcal{V}\_i}^{\bullet}\right)^{an}}{\left(\sum\_{i=1:\mathcal{V}\_i}^{\bullet}\right)^{an+\eta}}}\tag{18}$$

$$=\frac{\text{Beta}(na,\eta)\left(\sum\_{i=1}^{n}\frac{1}{\mathbf{x}\_{i}}+\kappa\right)^{na+\eta}}{\frac{\Gamma(na)\Gamma(\eta)}{\Gamma(na+\eta)}\left(\sum\_{i=1}^{n}\frac{1}{\mathbf{x}\_{i}}\right)}\times\frac{\mathbf{1}}{\left(\sum\_{i=1}^{n}\frac{1}{\mathbf{x}\_{i}}+\kappa\right)^{na+\eta}}\left(\eta+\kappa\frac{na+\eta}{\sum\_{i=1}^{n}\frac{1}{\mathbf{x}\_{i}}+\kappa}\right)\tag{19}$$

*On the Impact of the Choice of the Prior in Bayesian Statistics DOI: http://dx.doi.org/10.5772/intechopen.88994*

<sup>¼</sup> *<sup>κ</sup><sup>η</sup>* Γð Þ*η*

*Bayesian Inference on Complicated Data*

tion with parameters *nα,*

triangular inequality we have thus

¼ ð<sup>∞</sup> 0

¼

þ

<sup>Γ</sup>ð Þ *<sup>n</sup><sup>α</sup> <sup>η</sup>*

*κη* P*<sup>n</sup> i*¼1 1 *xi* � �

*i*¼1 1 *xi* þ *κ* � �*<sup>n</sup>α*þ*<sup>η</sup>*

P*<sup>n</sup> i*¼1 1 *xi* � � �

Γð Þ*η* �

¼

*d*Wð Þ *P*1*, P*<sup>2</sup> ≤

*Beta n*ð Þ *<sup>α</sup>, <sup>η</sup>* <sup>P</sup>*<sup>n</sup>*

Γð Þ *nα* Γð Þ*η* Γð Þ *nα*þ*η*

¼

**8**

P*<sup>n</sup> <sup>i</sup>*¼<sup>1</sup> <sup>1</sup> *xi* � �*<sup>n</sup><sup>α</sup>*

P*<sup>n</sup> i*¼1 1 *xi* � �*n<sup>α</sup>*

Γð Þ *nα*

Γð Þ *nα*

P*<sup>n</sup> <sup>i</sup>*¼<sup>1</sup> <sup>1</sup> *xi* þ *κ*

P*<sup>n</sup> <sup>i</sup>*¼<sup>1</sup> <sup>1</sup> *xi* � �*<sup>n</sup><sup>α</sup>*

> 0 B@

*<sup>τ</sup>*ð Þj <sup>Θ</sup> 1; *<sup>x</sup> <sup>ρ</sup>*<sup>0</sup> <sup>½</sup> ð Þj <sup>Θ</sup> <sup>1</sup> � ¼ <sup>P</sup>

<sup>≤</sup> *<sup>κ</sup><sup>η</sup>* P*<sup>n</sup> i*¼1 1 *xi* � �

Now we need to calculate the expectation

P*<sup>n</sup> i*¼1 1 *xi* � �*n<sup>α</sup>*

Γð Þ *nα*

<sup>¼</sup> *<sup>κ</sup><sup>η</sup>* Γð Þ*η*

> <sup>¼</sup> *<sup>κ</sup><sup>η</sup> Beta n*ð Þ *α, η*

P*<sup>n</sup> i*¼1 1 *xi*

� �

ð<sup>∞</sup> 0

P*<sup>n</sup> i*¼1 1 *xi* � �*n<sup>α</sup>*

> Θ <sup>1</sup> *n i*¼1 1 *xi*

Γð Þ*η*

*<sup>β</sup><sup>η</sup>* exp f g �*κβ* ð Þ� *<sup>η</sup>* <sup>þ</sup> *κβ <sup>f</sup>*

ð<sup>∞</sup> 0

> ð<sup>∞</sup> 0

Γð Þ *nα* þ *η*

The final expression for the upper bound then corresponds to

P*<sup>n</sup> i*¼1 1 *xi* � �*<sup>n</sup><sup>α</sup>*

Θ *<sup>η</sup>*

*κη* Γð Þ*η*

Θ *<sup>η</sup>*

Γð Þ *nα*

*<sup>β</sup>nα*þ*η*�<sup>1</sup> exp �*<sup>β</sup>* <sup>X</sup>*<sup>n</sup>*

P*<sup>n</sup> i*¼1 1 *xi* þ *κ*

P*<sup>n</sup> i*¼1 1 *xi* � �*n<sup>α</sup>*

P*<sup>n</sup> i*¼1 1 *xi* þ *κ*

From the Stein literature we know that the Stein kernel for the gamma distribu-

Θ *<sup>η</sup>*�<sup>1</sup>

<sup>1</sup> exp f g �*κ* Θ <sup>1</sup> ð Þ *η* þ *κ* Θ <sup>1</sup>

*ηβnα*þ*η*�<sup>1</sup> exp �*<sup>β</sup>* <sup>X</sup>*<sup>n</sup>*

*κβ<sup>n</sup>α*þ*<sup>η</sup>* exp �*<sup>β</sup>* <sup>X</sup>*<sup>n</sup>*

� �*<sup>n</sup>α*þ*<sup>η</sup>* <sup>þ</sup> *<sup>κ</sup>* <sup>Γ</sup>ð Þ *<sup>n</sup><sup>α</sup>* <sup>þ</sup> *<sup>η</sup>* <sup>þ</sup> <sup>1</sup>

<sup>Γ</sup>ð Þ *<sup>n</sup><sup>α</sup> <sup>η</sup>* <sup>Γ</sup>ð Þ *<sup>n</sup>α*þ*<sup>η</sup>* P*<sup>n</sup> i*¼1 1 *xi* þ*κ*

2 6 4

*κη Beta n*ð Þ *<sup>α</sup>, <sup>η</sup>* �

> P*<sup>n</sup> i*¼1 1 *xi* þ *κ* � �*<sup>n</sup>α*þ*<sup>η</sup> <sup>η</sup>* <sup>þ</sup> *<sup>κ</sup>*

corresponds to *τ β*ð Þ¼ *;x* <sup>P</sup>

" #

<sup>1</sup> exp f g �*κ* Θ <sup>1</sup> ð Þ *η* þ *κ* Θ <sup>1</sup>

*Gamma nα,*

� � (13)

*i*¼1

( ) !

*i*¼1

P*<sup>n</sup> <sup>i</sup>*¼<sup>1</sup> <sup>1</sup> *xi* þ *κ* � �*<sup>n</sup>α*þ*η*þ<sup>1</sup>

P*<sup>n</sup> i*¼1 1 *xi* � �*<sup>n</sup><sup>α</sup>*

P*<sup>n</sup> i*¼1 1 *xi* þ*κ* � �*<sup>n</sup>α*þ*<sup>η</sup>*

1

P*<sup>n</sup> i*¼1 1 *xi*

1 *xi* þ *κ*

( ) !

1 *xi* þ *κ*

� �*<sup>n</sup>α*þ*<sup>η</sup>* <sup>þ</sup> *<sup>κ</sup>* <sup>Γ</sup>ð Þ *<sup>n</sup>α*þ*η*þ<sup>1</sup>

P*<sup>n</sup> i*¼1 1 *xi* þ*κ* � �*<sup>n</sup>α*þ*η*þ<sup>1</sup>

� �ð Þ *β dβ* (14)

1

*nα* þ *η* P*<sup>n</sup> i*¼1 1 *xi* þ *κ*

!

*dβ* (15)

*dβ* (16)

CA*:* (17)

3 7 5

(18)

(19)

*i*¼1

Γð Þ *nα* þ *η*

1 *xi* þ *κ*

� �*nα*þ*<sup>η</sup>* (9)

� �*nα*þ*<sup>η</sup> :* (10)

*β n i*¼1 1 *xi*

<sup>1</sup> exp f gj �*κ* Θ <sup>1</sup> *η* � *κ* Θ <sup>1</sup>j

� �*:* (12)

*dβ* (8)

. Employing the

(11)

( ) !

$$=\frac{1}{\sum\_{i=1}^{n}\frac{1}{x\_i}}\left(\eta+\kappa\frac{na+\eta}{\sum\_{i=1}^{n}\frac{1}{x\_i}+\kappa}\right).\tag{20}$$

The Wasserstein distance between the posteriors based on the Jeffreys prior and conjugate gamma prior for the scale parameter *β* of the IG distribution is thus bounded as

$$\left| \frac{na\kappa - \eta \sum\_{i=1}^n \frac{1}{x\_i}}{\left(\sum\_{i=1}^n \frac{1}{x\_i}\right)\left(\sum\_{i=1}^n \frac{1}{x\_i} + \kappa\right)} \right| \le d\_{\mathcal{W}(P\_1, P\_2)} \le \frac{1}{\sum\_{i=1}^n \frac{1}{x\_i}} \left(\eta + \kappa \frac{na + \eta}{\kappa + \sum\_{i=1}^n \frac{1}{x\_i}}\right).$$

It can be seen that both the lower and upper bound are of the order of *O n*�<sup>1</sup> ð Þ. In addition, it is noticeable that for the larger observations, the rate of convergence is getting slower.

In order to show the performance of the methodology which leads to have the lower and upper bounds, we have conducted a simulation study including two parts. First we simulate *N* ¼ 100 samples for each sample size *n* ¼ 10*,* 11*,* ⋯*,* 100 from the inverse gamma distribution with parameters ð Þ¼ *α, β* ð Þ 0*:*5*,* 1 in each iteration. For each of these samples we calculate the lower and upper bounds of the Wasserstein distance and calculate the average over all *N* replications, together with the difference between the bounds. Finally we plot these values for each sample size in **Figure 1**. We repeat the same process for *N* ¼ 1000 samples with the same sizes. The hyperparameters from the prior gamma distribution are ð Þ¼ *κ, η* ð Þ 0*:*2*,* 2 . We clearly observe how fast these values decrease with the sample size. Of course, augmenting the number of replications does not increase the speed of convergence, however the curves become noticeably smoother.

This methodology not only can help the practitioners to make a decision between existing priors in theory, but also helps them to know from what sample size on the effect of choosing one prior becomes less important, especially in situations when the cost and time matter. This can be particularly useful when the

#### **Figure 1.**

*(a) Shows the bounds and the distances between the bounds for N* ¼ *100 iterations for each sample size 10–100 by steps of 1, and (b) illustrates the same situation for N* ¼ *1000. The hyperparameters are κ* ¼ *0:2 and η* ¼ *2, while the fixed parameter* α *equals 0.5.*

hesitation is between a simple, closed-form prior and a more complicated one. It is advisable to use the simpler one when there is no considerable difference between the effect of the two priors.

#### **3.2 The impact of priors for the success parameter of the binomial model**

The probability mass function of a binomial distribution is given by

$$x \mapsto \binom{n}{x} \theta^x (1 - \theta)^{n - x}$$

where *x*∈f g 0*,* 1*,* ⋯*, n* is the number of observed successes, the natural number *n* indicates the number of binary trials and *θ* ∈ð Þ 0*,* 1 stands for the success parameter. In this setting we suppose *n* is fixed and the underlying parameter of interest is *θ*.

A comprehensive comparison of various priors for the binomial distribution including a beta prior, the Haldane prior and Jeffreys prior, has been done in [9], based on the methodology described above. Therefore, since there is a complete reference for the reader in this case, we use the binomial distribution as a second example to show numerical results.

The theoretical lower and upper bounds between a *Beta*ð Þ *α, β* prior and the flat uniform prior are given by

$$\left| \frac{x+1}{n+2} \left( \frac{a+\beta-2}{n+a+\beta} \right) - \frac{a-1}{n+a+\beta} \right| \le d\_{\mathcal{W}}(P\_1, P\_2) \le \frac{1}{n+2} \left\{ |a-1| + \frac{\varkappa+a}{n+a+\beta} (|\beta-1| - |a-1|) \right\},$$

**Hyperparameters (***α***,** *β***) Average of the lower bounds Average of the upper bounds**

*(a) Shows the lower and upper bounds and the distances for the number of trials {n = 10,...,1000} for one iteration. (b) Shows the same situation, however this time based on averages obtained for 50 iterations. In both*

(0.2, 0.4) 0*:*002561383 0*:*003726728 (0.2, 0.8) 0*:*00296002 0*:*003344393 (2, 2) 0*:*002699325 0*:*00490119 (2, 5) 0*:*0008115384 0*:*007984289 ð Þ 2*,* 10 0*:*004506271 0*:*01241273 ð Þ 2*,* 15 0*:*008208887 0*:*01626326 ð Þ 2*,* 30 0*:*01750177 0*:*02581062 ð Þ 2*,* 50 0*:*02739205 0*:*0359027 ð Þ 2*,* 100 0*:*04592235 0*:*05470826 ð Þ 2*,* 200 0*:*07071766 0*:*07976386 ð Þ 2*,* 500 0*:*1103048 0*:*1196464 ð Þ 2*,* 1000 0*:*1399961 0*:*1495087 ð Þ 10*,* 2 0*:*02813367 0*:*03132908 ð Þ 35*,* 2 0*:*08571115 0*:*09033568 ð Þ 50*,* 2 0*:*1127136 0*:*1178113 ð Þ 100*,* 2 0*:*1830272 0*:*189071 ð Þ 200*,* 2 0*:*2783722 0*:*2853418 ð Þ 400*,* 2 0*:*3933338 0*:*401145 ð Þ 700*,* 2 0*:*4901209 0*:*4985089 ð Þ 1000*,* 2 0*:*5482869 0*:*5569829

*situations the hyperparameters from the beta prior are α* ¼ *2 and β* ¼ *4.*

*On the Impact of the Choice of the Prior in Bayesian Statistics*

*DOI: http://dx.doi.org/10.5772/intechopen.88994*

*The summary of upper and lower bounds for different hyperparameters, with p* ¼ *0:2 and for N = 50*

**Table 1.**

**Figure 2.**

*iterations.*

**11**

where *x* is the observed number of successes. We see that both lower and upper bounds are of the order of *O n*�<sup>1</sup> ð Þ. This rate of convergence remains even in the extreme cases *x* ¼ 0 and *x* ¼ *n*. We invite the reader to see [3, 9] for more details.

In order to illustrate the behavior of the lower and upper bounds and the distances between them, we have conducted a two-part simulation study for the binomial distribution. First, we consider 100 sample sizes (number of trials in the binomial distribution) varying from 10 to 1000 by steps of 10, and generate binomial data exactly once for every sample size (with *θ* ¼ 0*:*2). The results of the bounds, obtained for hyperparameters ð Þ¼ *α, β* ð Þ 2*,* 4 from the beta prior, are reported in **Figure 2a** and we can see that, even with only one iteration, when the number of trials (the sample size) increases the lower and upper bound become closer, which is a numerical quantification of the fact that the influence of the choice of the prior wanes asymptotically. This becomes also visible from the distance between the two bounds. Sampling only once for each sample size leads to slightly unpleasant variations in the lower bounds (non-monotone behavior), which however nearly disappear in the second considered scenario. Indeed, in **Figure 2b** we increased the number of iterations to 50 for the same different sample sizes and took averages. A better smoothness is the consequence. This simulation study not only provides the reader with numerical values for the bounds, to which he/she can compare his/her bounds obtained for real data, but also gives a nice visualization of the impact of the choice of the prior at fixed sample size. The main conclusion is that the impact drops fast at small sample sizes, and the bounds start to become very close for medium-to-large sample sizes.

Finally, we investigate the impact of the hyperparameters on the upper and lower bounds. To this end, we varied both *α* and *β* in **Table 1**. The situation with *α* fixed to two and relatively small *β* corresponds well with *p* ¼ 0*:*2, which explains why the upper and lower bounds, and hence the Wasserstein distance and thus the impact of the prior, are the smallest. Increasing *β* more augments the distance.

#### **Figure 2.**

hesitation is between a simple, closed-form prior and a more complicated one. It is advisable to use the simpler one when there is no considerable difference between

*<sup>θ</sup>x*ð Þ <sup>1</sup> � *<sup>θ</sup> <sup>n</sup>*�*<sup>x</sup>*

where *x*∈f g 0*,* 1*,* ⋯*, n* is the number of observed successes, the natural number *n* indicates the number of binary trials and *θ* ∈ð Þ 0*,* 1 stands for the success parameter. In this setting we suppose *n* is fixed and the underlying parameter of interest is *θ*. A comprehensive comparison of various priors for the binomial distribution including a beta prior, the Haldane prior and Jeffreys prior, has been done in [9], based on the methodology described above. Therefore, since there is a complete reference for the reader in this case, we use the binomial distribution as a second

The theoretical lower and upper bounds between a *Beta*ð Þ *α, β* prior and the flat

1

where *x* is the observed number of successes. We see that both lower and upper bounds are of the order of *O n*�<sup>1</sup> ð Þ. This rate of convergence remains even in the extreme cases *x* ¼ 0 and *x* ¼ *n*. We invite the reader to see [3, 9] for more details. In order to illustrate the behavior of the lower and upper bounds and the distances between them, we have conducted a two-part simulation study for the binomial distribution. First, we consider 100 sample sizes (number of trials in the binomial distribution) varying from 10 to 1000 by steps of 10, and generate binomial data exactly once for every sample size (with *θ* ¼ 0*:*2). The results of the bounds, obtained for hyperparameters ð Þ¼ *α, β* ð Þ 2*,* 4 from the beta prior, are reported in **Figure 2a** and we can see that, even with only one iteration, when the number of trials (the sample size) increases the lower and upper bound become closer, which is a numerical quantification of the fact that the influence of the choice of the prior wanes asymptotically. This becomes also visible from the distance between the two bounds. Sampling only once for each sample size leads to slightly unpleasant variations in the lower bounds (non-monotone behavior), which however nearly disappear in the second considered scenario. Indeed, in **Figure 2b** we increased the number of iterations to 50 for the same different sample sizes and took averages. A better smoothness is the consequence. This simulation study not only provides the reader with numerical values for the bounds, to which he/she can compare his/her bounds obtained for real data, but also gives a nice visualization of the impact of the choice of the prior at fixed sample size. The main conclusion is that the impact drops fast at small sample sizes, and the bounds start to become

Finally, we investigate the impact of the hyperparameters on the upper and lower bounds. To this end, we varied both *α* and *β* in **Table 1**. The situation with *α* fixed to two and relatively small *β* corresponds well with *p* ¼ 0*:*2, which explains why the upper and lower bounds, and hence the Wasserstein distance and thus the impact of the prior, are the smallest. Increasing *β* more augments the distance.

*<sup>n</sup>* <sup>þ</sup> <sup>2</sup> <sup>j</sup>*<sup>α</sup>* � <sup>1</sup>jþ *<sup>x</sup>* <sup>þ</sup> *<sup>α</sup>*

*<sup>n</sup>* <sup>þ</sup> *<sup>α</sup>* <sup>þ</sup> *<sup>β</sup>* ð Þ <sup>j</sup>*<sup>β</sup>* � <sup>1</sup>j�j*<sup>α</sup>* � <sup>1</sup><sup>j</sup>

*,*

**3.2 The impact of priors for the success parameter of the binomial model**

The probability mass function of a binomial distribution is given by

*<sup>x</sup>*<sup>↦</sup> *<sup>n</sup> x* 

the effect of the two priors.

*Bayesian Inference on Complicated Data*

example to show numerical results.

� *<sup>α</sup>* � <sup>1</sup> *n* þ *α* þ *β*

very close for medium-to-large sample sizes.

 

≤ *d*Wð Þ *P*1*, P*<sup>2</sup> ≤

uniform prior are given by

*α* þ *β* � 2 *n* þ *α* þ *β* 

*x* þ 1 *n* þ 2

 

**10**

*(a) Shows the lower and upper bounds and the distances for the number of trials {n = 10,...,1000} for one iteration. (b) Shows the same situation, however this time based on averages obtained for 50 iterations. In both situations the hyperparameters from the beta prior are α* ¼ *2 and β* ¼ *4.*


#### **Table 1.**

*The summary of upper and lower bounds for different hyperparameters, with p* ¼ *0:2 and for N = 50 iterations.*

choices of hyperparameters together with the bounds and the related prior density

In this chapter we have presented a recently developed measure for the impact of the choice of the prior distribution in Bayesian statistics. We have presented the general theoretical result, explained how to use it in a particular example and provided some graphics to illustrate it numerically. The practical importance of this study is when practitioners hesitate between two proposed priors in a given situation. For instance, Kavetski et al. [10] considered a storm depth multiplier model to represent rainfall uncertainty where the errors appear under multiplicative form and are assumed to be normal. They fix the mean, but state that "less is understood about the degree of rainfall uncertainty," i.e., the multiplier variance, and therefore studied various priors for the variance. Knowledge of the tools presented in this

In case of missing data, the present methodology can still be used. Either the data get imputed, in which case nothing changes, or the missing data simply are left out from the calculation of upper and lower bounds, whose expression does of

Further developments on this new measure might lead to a more concrete quantification of words such as "informative, weakly informative, noninformative" priors, and we hope to have stimulated interest in this promising new line of

This research is supported by a BOF Starting Grant of Ghent University.

Department of Applied Mathematics, Computer Science and Statistics, Ghent

© 2019 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium,

in **Figure 3**. This will help understanding our conclusions.

*On the Impact of the Choice of the Prior in Bayesian Statistics*

*DOI: http://dx.doi.org/10.5772/intechopen.88994*

chapter would have simplified the decision process.

**4. Conclusions**

course not alter.

**Acknowledgements**

**Author details**

**13**

University, Ghent, Belgium

research within Bayesian Inference.

Fatemeh Ghaderinezhad and Christophe Ley\*

provided the original work is properly cited.

\*Address all correspondence to: christophe.ley@ugent.be


#### **Figure 3.**

*Plots of the beta prior densities together with the average lower and upper bounds (and their difference) on the Wasserstein distance between the data-based posterior and the posterior resulting from each beta prior.*

On the contrary, fixing *β* ¼ 2 yields priors rather centered around large values of *p* and hence bigger distances. Moreover, the more *α* is increased, the more the distance augments, as the prior is further away from the data and hence impacts more on the posterior at a fixed sample size. For the sake of illustration, we present three choices of hyperparameters together with the bounds and the related prior density in **Figure 3**. This will help understanding our conclusions.

### **4. Conclusions**

In this chapter we have presented a recently developed measure for the impact of the choice of the prior distribution in Bayesian statistics. We have presented the general theoretical result, explained how to use it in a particular example and provided some graphics to illustrate it numerically. The practical importance of this study is when practitioners hesitate between two proposed priors in a given situation. For instance, Kavetski et al. [10] considered a storm depth multiplier model to represent rainfall uncertainty where the errors appear under multiplicative form and are assumed to be normal. They fix the mean, but state that "less is understood about the degree of rainfall uncertainty," i.e., the multiplier variance, and therefore studied various priors for the variance. Knowledge of the tools presented in this chapter would have simplified the decision process.

In case of missing data, the present methodology can still be used. Either the data get imputed, in which case nothing changes, or the missing data simply are left out from the calculation of upper and lower bounds, whose expression does of course not alter.

Further developments on this new measure might lead to a more concrete quantification of words such as "informative, weakly informative, noninformative" priors, and we hope to have stimulated interest in this promising new line of research within Bayesian Inference.

### **Acknowledgements**

This research is supported by a BOF Starting Grant of Ghent University.

#### **Author details**

Fatemeh Ghaderinezhad and Christophe Ley\* Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Ghent, Belgium

\*Address all correspondence to: christophe.ley@ugent.be

© 2019 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

On the contrary, fixing *β* ¼ 2 yields priors rather centered around large values of *p* and hence bigger distances. Moreover, the more *α* is increased, the more the distance augments, as the prior is further away from the data and hence impacts more on the posterior at a fixed sample size. For the sake of illustration, we present three

*Plots of the beta prior densities together with the average lower and upper bounds (and their difference) on the Wasserstein distance between the data-based posterior and the posterior resulting from each beta prior.*

**Figure 3.**

*Bayesian Inference on Complicated Data*

**12**

#### **References**

[1] Diaconis F, Freedman D. On the consistency of Bayes estimates (with discussion and rejoinder by the authors). The Annals of Statistics. 1986; **14**:1-67

[2] Diaconis F, Freedman D. On inconsistent Bayes estimates of location. The Annals of Statistics. 1986;**14**:68-87

[3] Ley C, Reinert G, Swan Y. Distances between nested densities and a measure of the impact of the prior in Bayesian statistics. Annals of Applied Probability. 2017;**27**:216-241

[4] Ghaderinezhad F, Ley C. Quantification of the impact of priors in Bayesian statistics via Stein's method. Statistics & Probability Letters. 2019; **146**:206-212

[5] Rüschendorf L. Wasserstein metric. In: Michiel H, editor. Encyclopedia of Mathematics. Netherlands: Springer Science+Business Media B.V./Kluwer Academic Publishers; 2001

[6] Stein C. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In: Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Univ. California, Berkeley, CA, 1970/ 1971. 1972. pp. 583-602

[7] Ross N. Fundamentals of Stein's method. Probability Surveys. 2011;**8**: 210-293

[8] Ley C, Reinert G, Swan Y. Stein's method for comparison of univariate distributions. Probability Surveys. 2017; **14**:1-52

[9] Ghaderinezhad F. New insights into the impact of the choice of the prior for the success parameter of binomial distributions. Journal of Mathematics, Statistics and Operations Research, forthcoming

[10] Kavetski D, Kuczera G, Franks SW. Bayesian analysis of input uncertainty in hydrological modeling: 1. Theory. Water Resources Research. 2006;**42**:W03407

Section 2

Some Advances on

Sampling Methods

**15**

Section 2

## Some Advances on Sampling Methods

**References**

**14**:1-67

2017;**27**:216-241

**146**:206-212

[1] Diaconis F, Freedman D. On the consistency of Bayes estimates (with discussion and rejoinder by the

*Bayesian Inference on Complicated Data*

[2] Diaconis F, Freedman D. On

[4] Ghaderinezhad F, Ley C.

Academic Publishers; 2001

normal approximation to the distribution of a sum of dependent random variables. In: Proceedings of the

Sixth Berkeley Symposium on

1971. 1972. pp. 583-602

210-293

**14**:1-52

forthcoming

**14**

authors). The Annals of Statistics. 1986;

[10] Kavetski D, Kuczera G, Franks SW. Bayesian analysis of input uncertainty in hydrological modeling: 1. Theory. Water Resources Research. 2006;**42**:W03407

inconsistent Bayes estimates of location. The Annals of Statistics. 1986;**14**:68-87

[3] Ley C, Reinert G, Swan Y. Distances between nested densities and a measure of the impact of the prior in Bayesian statistics. Annals of Applied Probability.

Quantification of the impact of priors in Bayesian statistics via Stein's method. Statistics & Probability Letters. 2019;

[5] Rüschendorf L. Wasserstein metric. In: Michiel H, editor. Encyclopedia of Mathematics. Netherlands: Springer Science+Business Media B.V./Kluwer

[6] Stein C. A bound for the error in the

Mathematical Statistics and Probability, Univ. California, Berkeley, CA, 1970/

[7] Ross N. Fundamentals of Stein's method. Probability Surveys. 2011;**8**:

[8] Ley C, Reinert G, Swan Y. Stein's method for comparison of univariate distributions. Probability Surveys. 2017;

[9] Ghaderinezhad F. New insights into the impact of the choice of the prior for the success parameter of binomial distributions. Journal of Mathematics, Statistics and Operations Research,

**Chapter 2**

**Abstract**

**1. Introduction**

**17**

A Brief Tour of Bayesian

Unlike in the past, the modern Bayesian analyst has many options for approximating intractable posterior distributions. This chapter briefly summarizes the class of posterior sampling methods known as Markov chain Monte Carlo, a type of dependent sampling strategy. Varieties of algorithms exist for constructing chains, and we review some of them here. Such methods are quite flexible and are now used routinely, even for relatively complicated statistical models. In addition, extensions of the algorithms have been developed for various goals. General-purpose software is currently also available to automate the construction of samplers, freeing the

Sampling Methods

*Michelle Y. Wang and Trevor Park*

analyst to focus on model formulation and inference.

many different approximation methods.

**Keywords:** Markov chain Monte Carlo, Gibbs sampler, slice sampler,

generally cannot be reduced by applying more intensive computation.

Metropolis-Hastings, Hamiltonian Monte Carlo, cluster sampling, JAGS, Stan

Modern Bayesian data analysis is enabled by specialized computational tools. Except in relatively simple models, explicit solutions for quantities relevant to Bayesian inference are not available. This limitation has sparked the development of

Some approximation methods, such as Laplace approximation [1] and variational Bayes [2], are based on replacing the Bayesian posterior density with a computationally convenient approximation. Such methods may have the advantage of relatively quick computation and scalability, but they leave open the question of how much the resulting approximate Bayesian inference can be trusted to reflect the actual Bayesian inference. There is an inherent bias in the approximation that

When accuracy is important, simulation-based (stochastic) methods offer an attractive alternative. The goal of these methods is to produce a simulation sample (though not necessarily an independent one) from the (joint) posterior distribution. A simulation sample can be used to approximate almost any quantity relevant to Bayesian inference, including posterior expectations, variances, quantiles, and marginal densities. Since the approximations become more exact as more samples are used, accuracy tends to be limited only by the computational resources available. Random variates from a general probability distribution that has a known density may be simulated using many classical methods, such as accept/reject and importance sampling. However, such methods tend to be efficient only in special cases and often require analytical insight to improve efficiency. The past three

**Chapter 2**

## A Brief Tour of Bayesian Sampling Methods

*Michelle Y. Wang and Trevor Park*

#### **Abstract**

Unlike in the past, the modern Bayesian analyst has many options for approximating intractable posterior distributions. This chapter briefly summarizes the class of posterior sampling methods known as Markov chain Monte Carlo, a type of dependent sampling strategy. Varieties of algorithms exist for constructing chains, and we review some of them here. Such methods are quite flexible and are now used routinely, even for relatively complicated statistical models. In addition, extensions of the algorithms have been developed for various goals. General-purpose software is currently also available to automate the construction of samplers, freeing the analyst to focus on model formulation and inference.

**Keywords:** Markov chain Monte Carlo, Gibbs sampler, slice sampler, Metropolis-Hastings, Hamiltonian Monte Carlo, cluster sampling, JAGS, Stan

#### **1. Introduction**

Modern Bayesian data analysis is enabled by specialized computational tools. Except in relatively simple models, explicit solutions for quantities relevant to Bayesian inference are not available. This limitation has sparked the development of many different approximation methods.

Some approximation methods, such as Laplace approximation [1] and variational Bayes [2], are based on replacing the Bayesian posterior density with a computationally convenient approximation. Such methods may have the advantage of relatively quick computation and scalability, but they leave open the question of how much the resulting approximate Bayesian inference can be trusted to reflect the actual Bayesian inference. There is an inherent bias in the approximation that generally cannot be reduced by applying more intensive computation.

When accuracy is important, simulation-based (stochastic) methods offer an attractive alternative. The goal of these methods is to produce a simulation sample (though not necessarily an independent one) from the (joint) posterior distribution. A simulation sample can be used to approximate almost any quantity relevant to Bayesian inference, including posterior expectations, variances, quantiles, and marginal densities. Since the approximations become more exact as more samples are used, accuracy tends to be limited only by the computational resources available.

Random variates from a general probability distribution that has a known density may be simulated using many classical methods, such as accept/reject and importance sampling. However, such methods tend to be efficient only in special cases and often require analytical insight to improve efficiency. The past three

decades have seen interest dramatically increase in the category of Markov chain Monte Carlo (MCMC) methods. Unlike most classical methods, MCMC can often be efficiently automated, even for moderately complicated models. A variety of MCMC methods are available, giving the analyst flexibility in implementation. Moreover, general software is now available that automates most computational details, allowing the analyst to focus on model formulation and inference.

The purpose of this chapter is to offer an introduction to Bayesian simulation methods, with emphasis on MCMC. The motivation and popularity of posterior sampling are illustrated in Section 2. Section 3 describes MCMC and the associated issues including convergence monitoring, mixing, and thinning. Varieties of specific sampling methods are provided in Sections 4 and 5, with the general-purpose software implementing them described in Section 6.

#### **2. Posterior sampling**

Bayesian inference requires access to the posterior distribution. Let *y* denote all of the data to be modeled, and suppose its sampling distribution is in a parametric family with density *π*ð Þ *y*j*θ* , where *θ* represents the parameter (usually a vector), including any hyperparameters. If the prior on *θ* has density *π θ*ð Þ then, according to Bayes' rule, the posterior distribution has density

$$
\pi(\theta|\mathcal{y}) = \frac{\pi(\mathcal{y}|\theta)\pi(\theta)}{\pi(\mathcal{y})} \propto\_{\theta} \pi(\mathcal{y}|\theta)\,\pi(\theta) \tag{1}
$$

**3. Sampling with Markov chains**

*A Brief Tour of Bayesian Sampling Methods DOI: http://dx.doi.org/10.5772/intechopen.91451*

from the posterior.

distribution arbitrarily close to the posterior distribution.

method, convergence is usually checked empirically.

state, which becomes the starting point for sampling.

**4. Constructing Markov chains for sampling**

required for storage of the values.

available for most cases. See [6].

Eq. (1). For brevity, we denote it as

**19**

MCMC is a type of dependent sampling in which the samples are obtained from successive states of a discrete-time Markov chain [3]. The Markov chain is designed to be easy to simulate and intended to (eventually) produce samples that have a

Specifically, the Markov chain is designed to have a particular *stationary distri-*

Since starting the chain in the stationary distribution is difficult, MCMC relies on the stationary distribution also being the (unique) *limiting distribution*: the distribution to which the states converge (in law) as the time index increases. Conditions under which the chain converges are technical (e.g., [4]) and can be difficult to verify analytically in complicated models. Thus, though convergence properties may benefit from following some general guidelines in specifying the MCMC

General convergence monitoring tools and techniques are available to determine

by what time point convergence has been practically achieved, so that accurate samples can be collected thereafter. See [5] for an overview. Some tools rely on simulating the chain several times, independently, from different starting points. Running the chain(s) until declaring convergence is called *burn-in*, or sometimes *warm up*. All values generated during burn-in are discarded, except for the final

The degree of dependence within a Markov chain determines the number of samples needed for a given level of approximation. Most MCMC methods produce chains with positive dependence, requiring a larger number of samples to be taken than if independent sampling were used. Chains that are highly dependent exhibit slow *mixing*: the decay rate of dependence between the states of the chain at two time points as the time lag increases. In extreme cases, slow mixing makes MCMC computationally prohibitive, since an enormous number of samples may be needed to achieve a reasonable approximation. Methods with fast mixing are typically preferred. When sampling is highly dependent, using only a regularly spaced subsample of the generated values may be almost as accurate as using all of the values. Retaining only the regularly spaced subsample is called *thinning*. Although it does not reduce the amount of computation required, it can dramatically reduce the time and space

Characterizing Monte Carlo error in approximations from an MCMC sample is more difficult than from an independent sample. However, effective methods are

This section briefly summarizes the most practical and frequently used methods for forming a Markov Chain appropriate for sampling from a posterior distribution. All of them need only a function proportional to the posterior density of *θ*, as in

*f*ð Þ*θ* ∝*<sup>θ</sup> π*ð Þ *y*j*θ π θ*ð Þ (2)

*bution*: a distribution on the state space of the chain that is preserved by the transition kernel. If the chain is started in the stationary distribution, all successive states will have the stationary distribution. In the most basic case, the state space will be the range of *θ*, and the stationary distribution will be the posterior distribution. A collection of successive states can then be regarded as a (dependent) sample

where the proportionality is in *θ* (not *y*). (An improper prior can be used, provided the posterior is proper.) The normalizing constant *π*ð Þ*y* is notoriously difficult to compute, so methods that avoid using it are preferred.

Since *π*ð Þ *y*j*θ* and *π θ*ð Þ are typically specified by the analyst, the (unnormalized) posterior density is readily available and typically easy to evaluate. Nonetheless, most quantities used in Bayesian inference (posterior expected values, quantiles, marginal densities, etc.) are defined by integrals involving the posterior density, which are usually intractable and are difficult to deterministically approximate when *θ* has more than a few components.

This explains the popularity of posterior sampling. Given a sample from the posterior of sufficient effective size, posterior expected values can be approximated by sample means, posterior quantiles by sample quantiles, posterior marginal densities by sample-based density estimates, and so forth. Most posterior inference is readily accomplished if an efficient method of sampling from the posterior is available.

*Independent* sampling from the posterior is seemingly ideal, since relatively few samples are required to obtain a good approximation in most cases, and the approximation error is relatively easy to characterize. Unfortunately, methods for independent sampling have proven difficult to implement in a general way that efficiently scales with the dimension of *θ*. For example, rejection sampling (accept/ reject) is efficient only if the posterior is tightly bounded by a known function proportional to a density that is easy to sample. Finding such a function is generally difficult, and even adaptive variants struggle in high-dimensional situations.

Currently, the most efficient generally adaptable methods use *dependent* sampling. Dependent sampling usually incurs a computational cost of acquiring a larger number of samples to attain a given accuracy, but the flexibility of these methods and their scalability to higher dimensions offset this disadvantage.

#### **3. Sampling with Markov chains**

decades have seen interest dramatically increase in the category of Markov chain Monte Carlo (MCMC) methods. Unlike most classical methods, MCMC can often be efficiently automated, even for moderately complicated models. A variety of MCMC methods are available, giving the analyst flexibility in implementation. Moreover, general software is now available that automates most computational details, allowing the analyst to focus on model formulation and inference.

The purpose of this chapter is to offer an introduction to Bayesian simulation methods, with emphasis on MCMC. The motivation and popularity of posterior sampling are illustrated in Section 2. Section 3 describes MCMC and the associated issues including convergence monitoring, mixing, and thinning. Varieties of specific sampling methods are provided in Sections 4 and 5, with the general-purpose

Bayesian inference requires access to the posterior distribution. Let *y* denote all of the data to be modeled, and suppose its sampling distribution is in a parametric family with density *π*ð Þ *y*j*θ* , where *θ* represents the parameter (usually a vector), including any hyperparameters. If the prior on *θ* has density *π θ*ð Þ then, according to

*<sup>π</sup>*ð Þ*<sup>y</sup>* <sup>∝</sup>*θπ*ð Þ *<sup>y</sup>*j*<sup>θ</sup> π θ*ð Þ (1)

software implementing them described in Section 6.

Bayes' rule, the posterior distribution has density

when *θ* has more than a few components.

available.

**18**

*π θ*ð Þ¼ <sup>j</sup> *<sup>y</sup> <sup>π</sup>*ð Þ *<sup>y</sup>*j*<sup>θ</sup> π θ*ð Þ

difficult to compute, so methods that avoid using it are preferred.

where the proportionality is in *θ* (not *y*). (An improper prior can be used, provided the posterior is proper.) The normalizing constant *π*ð Þ*y* is notoriously

Since *π*ð Þ *y*j*θ* and *π θ*ð Þ are typically specified by the analyst, the (unnormalized) posterior density is readily available and typically easy to evaluate. Nonetheless, most quantities used in Bayesian inference (posterior expected values, quantiles, marginal densities, etc.) are defined by integrals involving the posterior density, which are usually intractable and are difficult to deterministically approximate

This explains the popularity of posterior sampling. Given a sample from the posterior of sufficient effective size, posterior expected values can be approximated by sample means, posterior quantiles by sample quantiles, posterior marginal densities by sample-based density estimates, and so forth. Most posterior inference is readily accomplished if an efficient method of sampling from the posterior is

*Independent* sampling from the posterior is seemingly ideal, since relatively few

Currently, the most efficient generally adaptable methods use *dependent* sampling. Dependent sampling usually incurs a computational cost of acquiring a larger number of samples to attain a given accuracy, but the flexibility of these methods

samples are required to obtain a good approximation in most cases, and the approximation error is relatively easy to characterize. Unfortunately, methods for independent sampling have proven difficult to implement in a general way that efficiently scales with the dimension of *θ*. For example, rejection sampling (accept/ reject) is efficient only if the posterior is tightly bounded by a known function proportional to a density that is easy to sample. Finding such a function is generally difficult, and even adaptive variants struggle in high-dimensional situations.

and their scalability to higher dimensions offset this disadvantage.

**2. Posterior sampling**

*Bayesian Inference on Complicated Data*

MCMC is a type of dependent sampling in which the samples are obtained from successive states of a discrete-time Markov chain [3]. The Markov chain is designed to be easy to simulate and intended to (eventually) produce samples that have a distribution arbitrarily close to the posterior distribution.

Specifically, the Markov chain is designed to have a particular *stationary distribution*: a distribution on the state space of the chain that is preserved by the transition kernel. If the chain is started in the stationary distribution, all successive states will have the stationary distribution. In the most basic case, the state space will be the range of *θ*, and the stationary distribution will be the posterior distribution. A collection of successive states can then be regarded as a (dependent) sample from the posterior.

Since starting the chain in the stationary distribution is difficult, MCMC relies on the stationary distribution also being the (unique) *limiting distribution*: the distribution to which the states converge (in law) as the time index increases. Conditions under which the chain converges are technical (e.g., [4]) and can be difficult to verify analytically in complicated models. Thus, though convergence properties may benefit from following some general guidelines in specifying the MCMC method, convergence is usually checked empirically.

General convergence monitoring tools and techniques are available to determine by what time point convergence has been practically achieved, so that accurate samples can be collected thereafter. See [5] for an overview. Some tools rely on simulating the chain several times, independently, from different starting points.

Running the chain(s) until declaring convergence is called *burn-in*, or sometimes *warm up*. All values generated during burn-in are discarded, except for the final state, which becomes the starting point for sampling.

The degree of dependence within a Markov chain determines the number of samples needed for a given level of approximation. Most MCMC methods produce chains with positive dependence, requiring a larger number of samples to be taken than if independent sampling were used. Chains that are highly dependent exhibit slow *mixing*: the decay rate of dependence between the states of the chain at two time points as the time lag increases. In extreme cases, slow mixing makes MCMC computationally prohibitive, since an enormous number of samples may be needed to achieve a reasonable approximation. Methods with fast mixing are typically preferred.

When sampling is highly dependent, using only a regularly spaced subsample of the generated values may be almost as accurate as using all of the values. Retaining only the regularly spaced subsample is called *thinning*. Although it does not reduce the amount of computation required, it can dramatically reduce the time and space required for storage of the values.

Characterizing Monte Carlo error in approximations from an MCMC sample is more difficult than from an independent sample. However, effective methods are available for most cases. See [6].

#### **4. Constructing Markov chains for sampling**

This section briefly summarizes the most practical and frequently used methods for forming a Markov Chain appropriate for sampling from a posterior distribution. All of them need only a function proportional to the posterior density of *θ*, as in Eq. (1). For brevity, we denote it as

$$f(\theta) \propto\_{\theta} \pi(y|\theta)\pi(\theta) \tag{2}$$

where the proportionality is in *θ* only, and the dependence on *y* has been suppressed in the notation.

#### **4.1 Gibbs sampling**

Consider a partition of *θ* into *K* pieces (which may themselves be vectors):

$$\theta = (\theta\_1, \dots, \theta\_K) \tag{3}$$

intermediaries in a hierarchical structure, that make full conditionals easier to sample. *Parameter expansion* involves creating extra dimensions in the parameter space that do not affect the Bayesian model, but allow a faster-mixing Markov chain

Data augmentation is natural in models that are defined using random effects. The random effects simply become latent variables to be sampled with the parameters. But it can also be used to add purely artificial latent variables designed to make full conditionals easy to sample. For example, data modeled with a locationscale *t*-distribution lacks any direct partial conjugacy properties. Nonetheless, a *t*distribution can be represented as a scale mixture of normal distributions, with an inverse gamma distribution for the scale (variance). The scale variables then become the latent variables. Both the normal and inverse gamma distributions enjoy partial conjugacy properties that make full conditionals easy to sample. See [7],

Parameter expansion involves defining a redundant parameter *ρ* unrelated to the model itself and supplying it with an arbitrary prior density. The expanded parameterization ð Þ *θ*, *ρ* is then reparameterized in a way specially chosen to improve Gibbs sampler performance. A basic example can be found in Section 12.1 of [7]. It is sometimes possible to use an improper prior on *ρ*. This leads to a Gibbs sampler that lacks a stationary distribution, but may still be able to produce valid posterior

Parameter expansion is typically used in conjunction with data augmentation, whence it is known as parameter expansion-data augmentation (PX-DA) [12].

One general-purpose method to sample from an arbitrary univariate continuous density is to first sample uniformly from the bivariate (unit area) region beneath its graph and then retain only the horizontal coordinate. The uniform sampling could be performed by a simple two-step Gibbs sampler, alternating between vertical and horizontal sampling. This general approach is called *slice sampling* [13]. It can be interpreted as a special auxiliary variables method, with the vertical coordinate

For a multivariate *θ*, slice sampling can be performed on one univariate piece at a time, as in a Gibbs sampler. Specifically, if *θ<sup>k</sup>* is continuous and univariate, then the slice sampler first samples *v* uniformly from interval 0, ð Þ *f*ð Þ*θ* , then samples *θ<sup>k</sup>* uniformly from f g *θ<sup>k</sup>* : *f*ð Þ*θ* >*v* . Sampling is simplest when the latter set is an interval with easily computed endpoints, but adaptive methods are available for when this is

Though multivariate versions of slice sampling exist (e.g., [14]), practical implementations are often univariate and implemented as a single step within a Gibbs sampler framework, for continuous pieces that would otherwise be difficult

A general approach to posterior sampling is to perform a carefully controlled random walk over the parameter space. The steps are chosen such that the resulting Markov chain has the posterior as its stationary distribution. This is accomplished

In one popular version, the properties of the algorithm are determined by the choice of a random walk. The choice is arbitrary, but it is often such that each step is

to be constructed.

*A Brief Tour of Bayesian Sampling Methods DOI: http://dx.doi.org/10.5772/intechopen.91451*

Section 12.1, for details.

samples (see [11]).

**4.3 Slice sampling**

not the case [13].

**4.4 Metropolis-Hastings**

by the *Metropolis-Hastings algorithm*.

to sample.

**21**

representing the auxiliary variable.

The *full conditional* (or *conditional posterior*) distribution of *θ<sup>k</sup>* is its posterior distribution conditional on all the other pieces *θ*�*k*, i.e., the distribution with density

$$
\pi(\theta\_k|\theta\_{-k}, \mathcal{Y}) \tag{4}
$$

Gibbs sampling, in its purest form, is sequential sampling from the full conditional distributions of *θk*, *k* ¼ 1, … ,*K*, each time conditioning upon the most recently sampled value for each component of *θ*�*<sup>k</sup>*. Each complete cycle of this process produces a single sampled value of *θ*, and these successive values form a Markov chain whose stationary distribution (if unique) is the posterior distribution (since each step in the cycle preserves the posterior distribution of *θ*).

Essentially, Gibbs sampling reduces the problem of sampling *θ* to the problem of conditionally sampling each of its pieces. It relies on each full conditional being easy to sample. Because the pieces are of lower dimension (perhaps even onedimensional), they may be easier to sample by conventional methods. Moreover, it is often possible to choose a prior distribution such that many of the full conditionals are easy to sample. For example, when conditional priors are chosen from easily sampled families that are *partially conjugate* to the sampling model (see, e.g., [7]), the Gibbs sampler is easy to construct. Even if a full conditional cannot be directly sampled, its density is proportional to *f*ð Þ*θ* , since

$$
\pi(\theta\_k | \theta\_{-k}, y) = \frac{\pi(\theta | y)}{\pi(\theta\_{-k} | y)} \propto\_{\theta\_k} f(\theta) \tag{5}
$$

where the proportionality is in *θ<sup>k</sup>* only (for fixed *θ*�*<sup>k</sup>*). The density of the full conditional is therefore known (up to a constant scaling), so techniques described in the following subsections may be used.

Performance of Gibbs sampling can sometimes be improved by modifying the algorithm. For example, the order in which the pieces are sampled can affect the mixing rate (e.g., [8]). Also, replacing some of the full conditional distributions with (partial) posterior marginals results in a *partially collapsed* Gibbs sampler, which may have better sampling properties [9], though must be implemented carefully to preserve the stationary distribution (e.g., [10]).

Even when a Gibbs sampler is easy to implement, its mixing can be arbitrarily slow. This happens especially when there is a high degree of posterior dependence among the pieces of *θ*, such as when some pieces are highly correlated, or when the posterior density exhibits multiple modes offset "diagonally" from each other. Mixing may be improved by alternating Gibbs sampler cycles with steps of some other kind of MCMC, or by special modifications described in the following subsections.

#### **4.2 Auxiliary variables**

Gibbs sampling can be facilitated by techniques that involve sampling more than just the parameter *θ*. *Data augmentation* involves adding latent variables, usually as

#### *A Brief Tour of Bayesian Sampling Methods DOI: http://dx.doi.org/10.5772/intechopen.91451*

where the proportionality is in *θ* only, and the dependence on *y* has been

Consider a partition of *θ* into *K* pieces (which may themselves be vectors):

The *full conditional* (or *conditional posterior*) distribution of *θ<sup>k</sup>* is its posterior distribution conditional on all the other pieces *θ*�*k*, i.e., the distribution with density

Gibbs sampling, in its purest form, is sequential sampling from the full condi-

Essentially, Gibbs sampling reduces the problem of sampling *θ* to the problem of conditionally sampling each of its pieces. It relies on each full conditional being easy

*π θ*ð Þ j*y*

where the proportionality is in *θ<sup>k</sup>* only (for fixed *θ*�*<sup>k</sup>*). The density of the full conditional is therefore known (up to a constant scaling), so techniques described in

Performance of Gibbs sampling can sometimes be improved by modifying the algorithm. For example, the order in which the pieces are sampled can affect the mixing rate (e.g., [8]). Also, replacing some of the full conditional distributions with (partial) posterior marginals results in a *partially collapsed* Gibbs sampler, which may have better sampling properties [9], though must be implemented

Even when a Gibbs sampler is easy to implement, its mixing can be arbitrarily slow. This happens especially when there is a high degree of posterior dependence among the pieces of *θ*, such as when some pieces are highly correlated, or when the posterior density exhibits multiple modes offset "diagonally" from each other. Mixing may be improved by alternating Gibbs sampler cycles with steps of some other kind of MCMC, or by special modifications described in the following subsections.

Gibbs sampling can be facilitated by techniques that involve sampling more than just the parameter *θ*. *Data augmentation* involves adding latent variables, usually as

dimensional), they may be easier to sample by conventional methods. Moreover, it is often possible to choose a prior distribution such that many of the full conditionals are easy to sample. For example, when conditional priors are chosen from easily sampled families that are *partially conjugate* to the sampling model (see, e.g., [7]), the Gibbs sampler is easy to construct. Even if a full conditional cannot be

tional distributions of *θk*, *k* ¼ 1, … ,*K*, each time conditioning upon the most recently sampled value for each component of *θ*�*<sup>k</sup>*. Each complete cycle of this process produces a single sampled value of *θ*, and these successive values form a Markov chain whose stationary distribution (if unique) is the posterior distribution

(since each step in the cycle preserves the posterior distribution of *θ*).

to sample. Because the pieces are of lower dimension (perhaps even one-

directly sampled, its density is proportional to *f*ð Þ*θ* , since

carefully to preserve the stationary distribution (e.g., [10]).

the following subsections may be used.

**4.2 Auxiliary variables**

**20**

*π θ*ð Þ¼ *<sup>k</sup>*j*θ*�*<sup>k</sup>*, *y*

*θ* ¼ ð Þ *θ*1, … , *θ<sup>K</sup>* (3)

*π θ*ð Þ *<sup>k</sup>*j*θ*�*k*, *y* (4)

*π θ*ð Þ �*<sup>k</sup>*j*<sup>y</sup>* <sup>∝</sup>*θ<sup>k</sup> <sup>f</sup>*ð Þ*<sup>θ</sup>* (5)

suppressed in the notation.

*Bayesian Inference on Complicated Data*

**4.1 Gibbs sampling**

intermediaries in a hierarchical structure, that make full conditionals easier to sample. *Parameter expansion* involves creating extra dimensions in the parameter space that do not affect the Bayesian model, but allow a faster-mixing Markov chain to be constructed.

Data augmentation is natural in models that are defined using random effects. The random effects simply become latent variables to be sampled with the parameters. But it can also be used to add purely artificial latent variables designed to make full conditionals easy to sample. For example, data modeled with a locationscale *t*-distribution lacks any direct partial conjugacy properties. Nonetheless, a *t*distribution can be represented as a scale mixture of normal distributions, with an inverse gamma distribution for the scale (variance). The scale variables then become the latent variables. Both the normal and inverse gamma distributions enjoy partial conjugacy properties that make full conditionals easy to sample. See [7], Section 12.1, for details.

Parameter expansion involves defining a redundant parameter *ρ* unrelated to the model itself and supplying it with an arbitrary prior density. The expanded parameterization ð Þ *θ*, *ρ* is then reparameterized in a way specially chosen to improve Gibbs sampler performance. A basic example can be found in Section 12.1 of [7]. It is sometimes possible to use an improper prior on *ρ*. This leads to a Gibbs sampler that lacks a stationary distribution, but may still be able to produce valid posterior samples (see [11]).

Parameter expansion is typically used in conjunction with data augmentation, whence it is known as parameter expansion-data augmentation (PX-DA) [12].

#### **4.3 Slice sampling**

One general-purpose method to sample from an arbitrary univariate continuous density is to first sample uniformly from the bivariate (unit area) region beneath its graph and then retain only the horizontal coordinate. The uniform sampling could be performed by a simple two-step Gibbs sampler, alternating between vertical and horizontal sampling. This general approach is called *slice sampling* [13]. It can be interpreted as a special auxiliary variables method, with the vertical coordinate representing the auxiliary variable.

For a multivariate *θ*, slice sampling can be performed on one univariate piece at a time, as in a Gibbs sampler. Specifically, if *θ<sup>k</sup>* is continuous and univariate, then the slice sampler first samples *v* uniformly from interval 0, ð Þ *f*ð Þ*θ* , then samples *θ<sup>k</sup>* uniformly from f g *θ<sup>k</sup>* : *f*ð Þ*θ* >*v* . Sampling is simplest when the latter set is an interval with easily computed endpoints, but adaptive methods are available for when this is not the case [13].

Though multivariate versions of slice sampling exist (e.g., [14]), practical implementations are often univariate and implemented as a single step within a Gibbs sampler framework, for continuous pieces that would otherwise be difficult to sample.

#### **4.4 Metropolis-Hastings**

A general approach to posterior sampling is to perform a carefully controlled random walk over the parameter space. The steps are chosen such that the resulting Markov chain has the posterior as its stationary distribution. This is accomplished by the *Metropolis-Hastings algorithm*.

In one popular version, the properties of the algorithm are determined by the choice of a random walk. The choice is arbitrary, but it is often such that each step is easy to simulate and can transition from any point in the parameter space to any other point. Let *T θ*<sup>0</sup> ð Þ j*θ* be its *transition kernel* for a step from *θ* to *θ*<sup>0</sup> . For example, if *θ* is chosen according to some continuous distribution with density *π θ* ~ð Þ, then taking one step of the random walk from *θ* to *θ*<sup>0</sup> will result in *θ*<sup>0</sup> having density

$$\int T(\theta'|\theta)\,\tilde{\pi}(\theta)d\theta \tag{6}$$

The scale and shape of the proposal can be tuned in an automated manner, by making a preliminary run of the algorithm during which features of the proposal are modified adaptively to improve efficiency. This stage of *adaptation* occurs prior to burn-in: The algorithm is not a Markov chain when the proposal distribution is changed based on the sampling history, so it may not be converging to the posterior. Once adaptation is declared complete, the proposal distribution is kept fixed for

Although it may not be obvious, exact Gibbs sampling can actually be viewed as a special case of Metropolis-Hastings (e.g., [3]). The *α* turns out to always equal 1 for this situation, so no tuning is needed. Also, in a Gibbs sampler context, when a piece of *θ* cannot be easily simulated using conventional methods, its Gibbs step may be replaced with an easier step of Metropolis for the full conditional of that

Since the posterior density is analytically available (up to a constant factor), its

local properties may suggest an efficient choice of proposal distribution. For a continuous posterior, *Langevin* methods use the gradient of the log posterior density at the current point to adaptively choose the proposal distribution (e.g., [16]). This provides higher optimal acceptance rates and better scaling properties than pure Metropolis, though at the expense of more computation for each step. While this is an important improvement, modern practice has evolved even further to use more global properties of the posterior density, as detailed in the next subsection.

Hamiltonian Monte Carlo (HMC), also called *hybrid* Monte Carlo, can be regarded as a special case of Metropolis-Hastings that uses a proposal involving a special set of auxiliary variables and the path of a carefully devised differential equation [17]. Computing each proposal is complicated, and perhaps expensive, but this is often compensated by achieving a high acceptance rate even when the step size is large. This results in a sequence of samples that are less dependent, and hence

HMC can be applied directly to *θ* if the posterior is continuous and its density is continuously differentiable. Let *p* represent a vector of auxiliary variables having the same size as *θ*, but independent of *θ*. Specify for *p* an easily-sampled continuous distribution (often multivariate normal) with a continuously differentiable density

Then apply the Metropolis algorithm to sample ð Þ *θ*, *p* jointly, with proposals

2. Starting from *<sup>θ</sup>*old, *<sup>p</sup>* , follow the path ð Þ *<sup>θ</sup>*ð Þ*<sup>t</sup>* , *p t*ð Þ of the differential equation

for each element *θ<sup>k</sup>* of *θ* and corresponding element *pk* of *p*, up to a

*dpk*

*dt* ¼ � *<sup>∂</sup><sup>H</sup> ∂θ<sup>k</sup>*

(10)

*H*ð Þ¼� *θ*, *p* ln *f*ð Þ� *θ* ln *g p*ð Þ (9)

fewer are needed to achieve high approximation accuracy.

1.Directly generate *p* (independently of *θ*).

*dθ<sup>k</sup> dt* <sup>¼</sup> *<sup>∂</sup><sup>H</sup> ∂pk*

burn-in and for sampling.

*A Brief Tour of Bayesian Sampling Methods DOI: http://dx.doi.org/10.5772/intechopen.91451*

**4.5 Hamiltonian Monte Carlo**

proportional to *g p*ð Þ. Define

system defined by

predetermined point *tL*.

generated as follows:

**23**

piece.

(We assume *T* is time-invariant, although this is not necessary, provided the time dependence does not depend on the history of the Markov chain.) The density *T*ð Þ �j*θ* defines the *proposal distribution* when the current state is *θ*. Values of this density (up to a constant factor that does not depend on *θ*) must be computable.

The transitions of the Markov chain are determined by the following algorithm: let *θ*old be the current state of the chain. Then

1.Sample *proposal θ*<sup>0</sup> from the proposal distribution at *θ*old.

2.Compute

$$a = \frac{f(\theta') / T(\theta'|\theta^{\text{old}})}{f(\theta^{\text{old}}) / T(\theta^{\text{old}}|\theta')} \tag{7}$$

3. Set the next state of the chain to be

$$\theta^{\text{new}} = \begin{cases} \theta' & \text{with probability } \min\left(a, \mathbf{1}\right) \\ \theta^{\text{old}} & \text{otherwise} \end{cases} \tag{8}$$

Note the possibility that the next state of the chain will be identical to the previous state, even if *θ* is continuous under the posterior. If *θ*<sup>0</sup> actually becomes the next state of the chain, we say that the proposal is *accepted*. The long-run fraction of times the proposal is accepted is the *acceptance rate*.

General proof that this algorithm produces a Markov chain with the posterior as its stationary distribution can be found in, for example, [15]. Convergence properties have been extensively studied [4].

One important special case is the *Metropolis algorithm*, in which the transition kernel is symmetric: *T θ*<sup>0</sup> ð Þ¼ j*θ T θ*j*θ*<sup>0</sup> ð Þ. In this case, *T* cancels from Eq. (7), so there is no need to compute its values. If parameter *θ* is continuous on an open subset of a space of real vectors, a typical example is a multivariate normal proposal distribution centered at the current value (*θ*old). The covariance matrix is arbitrary and can be chosen to make the sampling more efficient.

Proposal distributions often admit a choice of scaling that can be tuned to improve sampling efficiency. Setting the scale too large leads to a low acceptance rate, hence slow mixing due to many repeated values. Setting the scale too small leads to a high acceptance rate, but each proposal will be close to the current value, and hence the mixing will also be slow. In some cases, theoretical results are available to guide the choice of scale. For example, for the Metropolis algorithm, research suggests that the optimal acceptance rate is about 0.44 for a onedimensional *θ* and quickly falls to about 0.23 as the dimension of *θ* increases [16].

In addition, the shape of the proposal distribution can often be tuned. Perhaps the best shapes are ones that approximate the shape of the posterior distribution, since then proposals will tend to be in directions in which the posterior is wider. While the exact shape of the posterior may not be obvious, it may still be possible to choose a proposal that has a similar covariance structure.

#### *A Brief Tour of Bayesian Sampling Methods DOI: http://dx.doi.org/10.5772/intechopen.91451*

easy to simulate and can transition from any point in the parameter space to any

*θ* is chosen according to some continuous distribution with density *π θ* ~ð Þ, then taking

(We assume *T* is time-invariant, although this is not necessary, provided the time dependence does not depend on the history of the Markov chain.) The density *T*ð Þ �j*θ* defines the *proposal distribution* when the current state is *θ*. Values of this density (up to a constant factor that does not depend on *θ*) must be computable. The transitions of the Markov chain are determined by the following algorithm:

*T θ*<sup>0</sup> ð Þ j*θ π θ* ~ð Þ*dθ* (6)

*<sup>f</sup> <sup>θ</sup>*old � �*=<sup>T</sup> <sup>θ</sup>*oldj*θ*<sup>0</sup> � � (7)

. For example, if

(8)

other point. Let *T θ*<sup>0</sup> ð Þ j*θ* be its *transition kernel* for a step from *θ* to *θ*<sup>0</sup>

one step of the random walk from *θ* to *θ*<sup>0</sup> will result in *θ*<sup>0</sup> having density

ð

1.Sample *proposal θ*<sup>0</sup> from the proposal distribution at *θ*old.

*<sup>α</sup>* <sup>¼</sup> *<sup>f</sup> <sup>θ</sup>*<sup>0</sup> ð Þ*=<sup>T</sup> <sup>θ</sup>*<sup>0</sup>

*<sup>θ</sup>*new <sup>¼</sup> *<sup>θ</sup>*<sup>0</sup> with probability min ð Þ *<sup>α</sup>*, 1 *θ*old otherwise

Note the possibility that the next state of the chain will be identical to the previous state, even if *θ* is continuous under the posterior. If *θ*<sup>0</sup> actually becomes the next state of the chain, we say that the proposal is *accepted*. The long-run fraction of

General proof that this algorithm produces a Markov chain with the posterior as its stationary distribution can be found in, for example, [15]. Convergence proper-

One important special case is the *Metropolis algorithm*, in which the transition kernel is symmetric: *T θ*<sup>0</sup> ð Þ¼ j*θ T θ*j*θ*<sup>0</sup> ð Þ. In this case, *T* cancels from Eq. (7), so there is no need to compute its values. If parameter *θ* is continuous on an open subset of a space of real vectors, a typical example is a multivariate normal proposal distribution centered at the current value (*θ*old). The covariance matrix is arbitrary and can

Proposal distributions often admit a choice of scaling that can be tuned to improve sampling efficiency. Setting the scale too large leads to a low acceptance rate, hence slow mixing due to many repeated values. Setting the scale too small leads to a high acceptance rate, but each proposal will be close to the current value, and hence the mixing will also be slow. In some cases, theoretical results are available to guide the choice of scale. For example, for the Metropolis algorithm, research suggests that the optimal acceptance rate is about 0.44 for a one-

dimensional *θ* and quickly falls to about 0.23 as the dimension of *θ* increases [16]. In addition, the shape of the proposal distribution can often be tuned. Perhaps the best shapes are ones that approximate the shape of the posterior distribution, since then proposals will tend to be in directions in which the posterior is wider. While the exact shape of the posterior may not be obvious, it may still be possible to

<sup>j</sup>*θ*old � �

let *θ*old be the current state of the chain. Then

*Bayesian Inference on Complicated Data*

3. Set the next state of the chain to be

�

times the proposal is accepted is the *acceptance rate*.

be chosen to make the sampling more efficient.

choose a proposal that has a similar covariance structure.

ties have been extensively studied [4].

2.Compute

**22**

The scale and shape of the proposal can be tuned in an automated manner, by making a preliminary run of the algorithm during which features of the proposal are modified adaptively to improve efficiency. This stage of *adaptation* occurs prior to burn-in: The algorithm is not a Markov chain when the proposal distribution is changed based on the sampling history, so it may not be converging to the posterior. Once adaptation is declared complete, the proposal distribution is kept fixed for burn-in and for sampling.

Although it may not be obvious, exact Gibbs sampling can actually be viewed as a special case of Metropolis-Hastings (e.g., [3]). The *α* turns out to always equal 1 for this situation, so no tuning is needed. Also, in a Gibbs sampler context, when a piece of *θ* cannot be easily simulated using conventional methods, its Gibbs step may be replaced with an easier step of Metropolis for the full conditional of that piece.

Since the posterior density is analytically available (up to a constant factor), its local properties may suggest an efficient choice of proposal distribution. For a continuous posterior, *Langevin* methods use the gradient of the log posterior density at the current point to adaptively choose the proposal distribution (e.g., [16]). This provides higher optimal acceptance rates and better scaling properties than pure Metropolis, though at the expense of more computation for each step. While this is an important improvement, modern practice has evolved even further to use more global properties of the posterior density, as detailed in the next subsection.

#### **4.5 Hamiltonian Monte Carlo**

Hamiltonian Monte Carlo (HMC), also called *hybrid* Monte Carlo, can be regarded as a special case of Metropolis-Hastings that uses a proposal involving a special set of auxiliary variables and the path of a carefully devised differential equation [17]. Computing each proposal is complicated, and perhaps expensive, but this is often compensated by achieving a high acceptance rate even when the step size is large. This results in a sequence of samples that are less dependent, and hence fewer are needed to achieve high approximation accuracy.

HMC can be applied directly to *θ* if the posterior is continuous and its density is continuously differentiable. Let *p* represent a vector of auxiliary variables having the same size as *θ*, but independent of *θ*. Specify for *p* an easily-sampled continuous distribution (often multivariate normal) with a continuously differentiable density proportional to *g p*ð Þ. Define

$$H(\theta, p) = -\ln f(\theta) - \ln g(p) \tag{9}$$

Then apply the Metropolis algorithm to sample ð Þ *θ*, *p* jointly, with proposals generated as follows:


$$\frac{d\theta\_k}{dt} = \frac{\partial H}{\partial p\_k} \qquad\qquad \frac{dp\_k}{dt} = -\frac{\partial H}{\partial \theta\_k} \tag{10}$$

for each element *θ<sup>k</sup>* of *θ* and corresponding element *pk* of *p*, up to a predetermined point *tL*.

3.Let ð Þ *θ*ð Þ *tL* , *p t*ð Þ*<sup>L</sup>* be the new proposed value.

In the Metropolis acceptance step, we use

$$a = \exp\left\{ H(\theta^{\text{old}}, p) - H(\theta(t\_L), p(t\_L)) \right\} \tag{11}$$

1. If *σ<sup>i</sup>* ¼ *σ <sup>j</sup>*, a bond *nij* ¼ 1 is created stochastically between neighbor sites *i* and *j* with a probability of 1 � ð Þ*<sup>e</sup>* �*K*. Otherwise, no bond will be present and the

2.Clusters are identified as sets of sites connected by bonds (otherwise isolated sites). If there is a connected path of bonds joining two sites, they are said to be in the same cluster. A new Potts value is assigned to each cluster, chosen with equal probability among 1 to *q*. The new Potts variable *σ*<sup>0</sup> is determined as the

With this approach, every state can be reached from any other state in one move with a non-zero probability. The two steps leave the probability distribution invari-

One variation of the SW method is generalizing it to arbitrary sampling probabilities defined on graph partitions, which is achieved through considering it as a Metropolis-Hastings algorithm and computing the acceptance probability of the proposed Monte Carlo move [20]. The new inference algorithm begins by calculating graph edge weights using local image features and then is followed by two iterative steps: *Cluster Graph*: cutting the edges probability using their weights, to form connected components; *Relabel Graph*: selecting one connected component, and simultaneously flipping the partition of all its vertices in a probabilistic way. Accordingly, instead of flipping a single vertex as in Gibbs sampler, the split, merge, and re-grouping of a chunk of the graph are realized with this strategy.

The generalized cluster sampling implements ergodic and reversible Markov chain jumps on graph partitions. It is applicable to arbitrary posterior probabilities or energy functions in the space of graphs. Examples in image analysis (e.g., image segmentation) demonstrate that the cluster Monte Carlo is more efficient than the classical Gibbs sampler and performs better than the graph cuts and belief propagation.

In the statistics community, the first development of practical general-purpose software for MCMC was the BUGS (Bayesian inference using Gibbs sampling) project, starting in 1989. The original implementation, designed for the Windows operating system, was WinBUGS, which included a graphical interface. When development of WinBUGS ended, the OpenBUGS project was created as a successor. This software uses a special model specification language, the "BUGS language," that is remarkably flexible. Usually, the analyst only needs to specify the model in the BUGS language and then leave the construction of appropriate samplers to the software. The basic structure is a Gibbs sampler, but the pieces may be

Inspired by BUGS, a parallel effort called JAGS (Just another Gibbs sampler) was developed. Like BUGS, it is based on Gibbs sampling and, in principle, requires the analyst to specify only a model (written in a variant of the BUGS language), leaving the construction of samplers to an automated engine. It tends to be faster than OpenBUGS, is more actively developed, and features better integration with the R language. It also incorporates efficient slice samplers in some of its steps. JAGS is

entirely open-source and has versions for many operating systems.

PyMC's development was an effort to generalize the process of building Metropolis-Hastings samplers, making MCMC more accessible to non-statisticians. It is now a Python package helping users define stochastic models and construct

ant and the method generates an equilibrium distribution Eq. (12).

bond variable is set to *nij* ¼ 0.

*A Brief Tour of Bayesian Sampling Methods DOI: http://dx.doi.org/10.5772/intechopen.91451*

value of the cluster it belongs to.

**6. Software implementation**

sampled using specialized methods.

**25**

Actually, if the path is followed exactly, the acceptance probability will always be 1, since value of *H* is constant along any differential equation path [17]. The Metropolis step is needed only because, in practice, a numerical approximation is used to solve the differential equation.

To follow the differential equation numerically, we use the *leapfrog* method [17]. This method has a number of advantages over competing methods, including stability (better preservation of *H*) and volume preservation, which makes Metropolis valid (i.e., makes the joint transition kernel defined by this process symmetric).

If *θ* is not entirely continuous, HMC may still be applicable to the continuous pieces of *θ*, for example, when used as part of a Gibbs sampler. Also, if the posterior density is nonzero only over a certain region, HMC can be adapted for that situation. For example, it is possible to place lower and upper bounds on the elements of *θ* [17].

The differential equation path of an HMC proposal has a tendency to loop back on itself, making the efficiency sensitive to the length of the path (i.e., the choice of *tL*). The no-U-turn sampler (NUTS) [18] is a modification of HMC designed to avoid this behavior. Essentially, it allows for adaptive choice of the leapfrog algorithm's step size and number of steps.

In theory, the computational cost of HMC scales better with the dimension of *θ* than does the computational cost of ordinary (random-walk) Metropolis methods. An extensive theoretical comparison can be found in [17].

#### **5. Cluster sampling and variation**

The first non-local or cluster sampling for Monte Carlo simulation for large systems is the *Swendsen-Wang* (SW) *algorithm* [19]. It was designed for the Ising and Potts models and was later generalized to other systems. The main component was the random cluster model, represented via percolation models of connecting bonds. Let us start with a spin configuration f g*σ* and generate a percolation configuration based on the spin configuration. Next, the old spin configuration is forgotten and a new spin configuration *σ*<sup>0</sup> f g is generated according to percolation. The rule for the process is defined in order for the detailed balance condition to be satisfied. In this way, the transition leaves the equilibrium probability invariant.

Consider a Potts model with probability distribution

$$\log(\sigma) = \frac{1}{Z} \exp\left(K \sum\_{} \left(\delta\_{\sigma i, \sigma j} - \mathbf{1}\right)\right) \tag{12}$$

where *K* is the coupling strength; the spins take on the values 1, 2, … , *q*, e.g. *σ<sup>i</sup>* ¼ 1, 2, … , *q*; *δσi*,*<sup>σ</sup> <sup>j</sup>* is the Kronecker delta, which equals one whenever *σ<sup>i</sup>* ¼ *σ <sup>j</sup>* and zero otherwise; the summation goes through nearest neighbor pairs; and *Z* is the partition function.

A SW Monte Carlo move is based on the following two steps: the first step transforms a Potts configuration to a bond configuration, and the second transforms back from bond to a new Potts configuration.

#### *A Brief Tour of Bayesian Sampling Methods DOI: http://dx.doi.org/10.5772/intechopen.91451*

3.Let ð Þ *θ*ð Þ *tL* , *p t*ð Þ*<sup>L</sup>* be the new proposed value.

*<sup>α</sup>* <sup>¼</sup> exp *<sup>H</sup> <sup>θ</sup>*old, *<sup>p</sup>* � � � *<sup>H</sup>*ð Þ *<sup>θ</sup>*ð Þ *tL* , *p t*ð Þ*<sup>L</sup>*

Actually, if the path is followed exactly, the acceptance probability will always be 1, since value of *H* is constant along any differential equation path [17]. The Metropolis step is needed only because, in practice, a numerical approximation is

To follow the differential equation numerically, we use the *leapfrog* method [17]. This method has a number of advantages over competing methods, including stability (better preservation of *H*) and volume preservation, which makes Metropolis valid (i.e., makes the joint transition kernel defined by this process symmetric). If *θ* is not entirely continuous, HMC may still be applicable to the continuous pieces of *θ*, for example, when used as part of a Gibbs sampler. Also, if the posterior density is nonzero only over a certain region, HMC can be adapted for that situation. For example, it is possible to place lower and upper bounds on the elements

The differential equation path of an HMC proposal has a tendency to loop back on itself, making the efficiency sensitive to the length of the path (i.e., the choice of *tL*). The no-U-turn sampler (NUTS) [18] is a modification of HMC designed to avoid this behavior. Essentially, it allows for adaptive choice of the leapfrog algo-

In theory, the computational cost of HMC scales better with the dimension of *θ* than does the computational cost of ordinary (random-walk) Metropolis methods.

The first non-local or cluster sampling for Monte Carlo simulation for large systems is the *Swendsen-Wang* (SW) *algorithm* [19]. It was designed for the Ising and Potts models and was later generalized to other systems. The main component was the random cluster model, represented via percolation models of connecting bonds. Let us start with a spin configuration f g*σ* and generate a percolation configuration based on the spin configuration. Next, the old spin configuration is forgotten and a new spin configuration *σ*<sup>0</sup> f g is generated according to percolation. The rule for the process is defined in order for the detailed balance condition to be satisfied. In this way, the transition leaves the equilibrium probability invariant.

*<sup>Z</sup>* exp *<sup>K</sup>* <sup>X</sup>

<*i*, *j* >

where *K* is the coupling strength; the spins take on the values 1, 2, … , *q*, e.g. *σ<sup>i</sup>* ¼ 1, 2, … , *q*; *δσi*,*<sup>σ</sup> <sup>j</sup>* is the Kronecker delta, which equals one whenever *σ<sup>i</sup>* ¼ *σ <sup>j</sup>* and zero otherwise; the summation goes through nearest neighbor pairs; and *Z* is the

A SW Monte Carlo move is based on the following two steps: the first step transforms a Potts configuration to a bond configuration, and the second trans-

*δσi*,*<sup>σ</sup> <sup>j</sup>* � 1 � � !

(12)

� � (11)

In the Metropolis acceptance step, we use

used to solve the differential equation.

*Bayesian Inference on Complicated Data*

rithm's step size and number of steps.

**5. Cluster sampling and variation**

partition function.

**24**

An extensive theoretical comparison can be found in [17].

Consider a Potts model with probability distribution

*<sup>g</sup>*ð Þ¼ *<sup>σ</sup>* <sup>1</sup>

forms back from bond to a new Potts configuration.

of *θ* [17].


With this approach, every state can be reached from any other state in one move with a non-zero probability. The two steps leave the probability distribution invariant and the method generates an equilibrium distribution Eq. (12).

One variation of the SW method is generalizing it to arbitrary sampling probabilities defined on graph partitions, which is achieved through considering it as a Metropolis-Hastings algorithm and computing the acceptance probability of the proposed Monte Carlo move [20]. The new inference algorithm begins by calculating graph edge weights using local image features and then is followed by two iterative steps: *Cluster Graph*: cutting the edges probability using their weights, to form connected components; *Relabel Graph*: selecting one connected component, and simultaneously flipping the partition of all its vertices in a probabilistic way. Accordingly, instead of flipping a single vertex as in Gibbs sampler, the split, merge, and re-grouping of a chunk of the graph are realized with this strategy.

The generalized cluster sampling implements ergodic and reversible Markov chain jumps on graph partitions. It is applicable to arbitrary posterior probabilities or energy functions in the space of graphs. Examples in image analysis (e.g., image segmentation) demonstrate that the cluster Monte Carlo is more efficient than the classical Gibbs sampler and performs better than the graph cuts and belief propagation.

#### **6. Software implementation**

In the statistics community, the first development of practical general-purpose software for MCMC was the BUGS (Bayesian inference using Gibbs sampling) project, starting in 1989. The original implementation, designed for the Windows operating system, was WinBUGS, which included a graphical interface. When development of WinBUGS ended, the OpenBUGS project was created as a successor. This software uses a special model specification language, the "BUGS language," that is remarkably flexible. Usually, the analyst only needs to specify the model in the BUGS language and then leave the construction of appropriate samplers to the software. The basic structure is a Gibbs sampler, but the pieces may be sampled using specialized methods.

Inspired by BUGS, a parallel effort called JAGS (Just another Gibbs sampler) was developed. Like BUGS, it is based on Gibbs sampling and, in principle, requires the analyst to specify only a model (written in a variant of the BUGS language), leaving the construction of samplers to an automated engine. It tends to be faster than OpenBUGS, is more actively developed, and features better integration with the R language. It also incorporates efficient slice samplers in some of its steps. JAGS is entirely open-source and has versions for many operating systems.

PyMC's development was an effort to generalize the process of building Metropolis-Hastings samplers, making MCMC more accessible to non-statisticians. It is now a Python package helping users define stochastic models and construct

Bayesian posterior samples. A large number of problems are suitable for PyMC due to its flexibility and extensibility. Key features include ability to fit Bayesian statistical models via MCMC and other algorithms; a large set of well-documented statistical distributions; a module for modeling Gaussian processes; sampling loops can be manipulated manually, etc.

**References**

**81**(393):82-86

183-233

pp. 3-48

[1] Tierney L, Kadane JB. Accurate approximations for posterior moments and marginal densities. Journal of the American Statistical Association. 1986;

*A Brief Tour of Bayesian Sampling Methods DOI: http://dx.doi.org/10.5772/intechopen.91451*

> methods. Journal of the American Statistical Association. 2008;**103**(482):

[10] Van Dyk DA, Jiao X. Metropolis-Hastings within partially collapsed

Computational and Graphical Statistics.

[11] Hobert JP. Stability relationships among the Gibbs sampler and its subchains. Journal of Computational and Graphical Statistics. 2001;**10**(2):

[12] Liu JS, Wu YN. Parameter expansion for data augmentation. Journal of the American Statistical Association. 1999;**94**:1264-1274

[13] Neal RM. Slice sampling. Ann.

[14] Damien P, Wakefield J, Walker S. Gibbs sampling for Bayesian nonconjugate and hierarchical models by using auxiliary variables. Journal of the Royal Statistical Society, Series B: Statistical Methodology. 1999;**61**(2):

[15] Tierney L. A note on Metropolis-Hastings kernels for general state spaces. Ann. Appl. Probab. 1998;**8**(1):1-9

[16] Roberts GO, Rosenthal JS. Optimal scaling for various Metropolis-Hastings algorithms. Statistical Science. 2001;

[17] Neal RM. MCMC using Hamiltonial dynamics. In: Brooks S, Gelman A, Jones GL, Meng X-L, editors. Handbook of Markov Chain Monte Carlo. Boca Raton: Chapman & Hall; 2011.

[18] Hoffman MD, Gelman A. The No-U-turn sampler: Adaptively setting path

Statist. 2003;**31**(3):705-741

Gibbs samplers. Journal of

2015;**24**(2):301-327

790-796

185-205

331-344

**16**(4):351-367

pp. 113-162

[2] Jordan MI, Ghahramani Z,

Jaakkola TS, Saul LK. An introduction to variational methods for graphical models. Machine Learning. 1999;**37**:

[3] Geyer CJ. Introduction to Markov chain Monte Carlo. In: Brooks S,

[4] Tierney L. Markov chains for exploring posterior distributions. Ann.

Statist. 1994;**22**(4):1701-1728

simulations and monitoring

Press; 2011. pp. 175-197

**59**(2):291-317

**27**

Gelman A, Jones GL, Meng X-L, editors. Handbook of Markov Chain Monte Carlo. Boca Raton: CRC Press; 2011.

[5] Gelman A, Shirley K. Inference from

convergence. In: Brooks S, Gelman A, Jones GL, Meng X-L, editors. Handbook of Markov Chain Monte Carlo. Boca Raton: CRC Press; 2011. pp. 163-174

[6] Flegal JM, Jones GL. Implementing MCMC: Estimating with confidence. In: Brooks S, Gelman A, Jones GL, Meng X-L, editors. Handbook of Markov Chain Monte Carlo. Boca Raton: CRC

[7] Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB. Bayesian Data Analysis. 3rd ed. Boca Raton: CRC Press; 2013. p. 667

[8] Roberts GO, Sahu SK. Updating schemes, correlation structure, blocking and parameterization for the Gibbs sampler. Journal of the Royal Statistical Society: Series B: Methodological. 1997;

[9] Van Dyk DA, Park T. Partially collapsed Gibbs samplers: Theory and

A more recently introduced tool (since 2012) is the language Stan, which remains under active development (as of this writing). Stan allows model specification, but in an inherently more flexible way than BUGS or its variants. Software for compiling Stan includes the option for MCMC using HMC and NUTS. It therefore tends to produce more nearly independent samples than software based on Gibbs sampling. (There are also options for inference not based on sampling, such as variational methods.) The Stan software integrates with R, Python, MATLAB, Julia, and Stata.

#### **7. Conclusion**

This chapter has merely touched upon the important concepts and methods of modern MCMC. Routine-use software automating the construction of samplers is also introduced. There are many good references that provide more detailed theoretical or practical treatment and further extensions, based on which future research can be developed.

#### **Author details**

Michelle Y. Wang1,2,3\* and Trevor Park1

1 Department of Statistics, University of Illinois at Urbana-Champaign, USA


\*Address all correspondence to: ymw@illinois.edu

<sup>© 2020</sup> The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

*A Brief Tour of Bayesian Sampling Methods DOI: http://dx.doi.org/10.5772/intechopen.91451*

#### **References**

Bayesian posterior samples. A large number of problems are suitable for PyMC due to its flexibility and extensibility. Key features include ability to fit Bayesian statistical models via MCMC and other algorithms; a large set of well-documented statistical distributions; a module for modeling Gaussian processes; sampling loops can

A more recently introduced tool (since 2012) is the language Stan, which remains under active development (as of this writing). Stan allows model specification, but in an inherently more flexible way than BUGS or its variants. Software for compiling Stan includes the option for MCMC using HMC and NUTS. It therefore tends to produce more nearly independent samples than software based on Gibbs sampling. (There are also options for inference not based on sampling, such as variational methods.) The Stan software integrates with R, Python, MATLAB,

This chapter has merely touched upon the important concepts and methods of modern MCMC. Routine-use software automating the construction of samplers is also introduced. There are many good references that provide more detailed theoretical or practical treatment and further extensions, based on which future

1 Department of Statistics, University of Illinois at Urbana-Champaign, USA

2 Department of Psychology, University of Illinois at Urbana-Champaign, USA

3 Department of Bioengineering, University of Illinois at Urbana-Champaign, USA

© 2020 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium,

be manipulated manually, etc.

*Bayesian Inference on Complicated Data*

Julia, and Stata.

**7. Conclusion**

**Author details**

**26**

Michelle Y. Wang1,2,3\* and Trevor Park1

\*Address all correspondence to: ymw@illinois.edu

provided the original work is properly cited.

research can be developed.

[1] Tierney L, Kadane JB. Accurate approximations for posterior moments and marginal densities. Journal of the American Statistical Association. 1986; **81**(393):82-86

[2] Jordan MI, Ghahramani Z, Jaakkola TS, Saul LK. An introduction to variational methods for graphical models. Machine Learning. 1999;**37**: 183-233

[3] Geyer CJ. Introduction to Markov chain Monte Carlo. In: Brooks S, Gelman A, Jones GL, Meng X-L, editors. Handbook of Markov Chain Monte Carlo. Boca Raton: CRC Press; 2011. pp. 3-48

[4] Tierney L. Markov chains for exploring posterior distributions. Ann. Statist. 1994;**22**(4):1701-1728

[5] Gelman A, Shirley K. Inference from simulations and monitoring convergence. In: Brooks S, Gelman A, Jones GL, Meng X-L, editors. Handbook of Markov Chain Monte Carlo. Boca Raton: CRC Press; 2011. pp. 163-174

[6] Flegal JM, Jones GL. Implementing MCMC: Estimating with confidence. In: Brooks S, Gelman A, Jones GL, Meng X-L, editors. Handbook of Markov Chain Monte Carlo. Boca Raton: CRC Press; 2011. pp. 175-197

[7] Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB. Bayesian Data Analysis. 3rd ed. Boca Raton: CRC Press; 2013. p. 667

[8] Roberts GO, Sahu SK. Updating schemes, correlation structure, blocking and parameterization for the Gibbs sampler. Journal of the Royal Statistical Society: Series B: Methodological. 1997; **59**(2):291-317

[9] Van Dyk DA, Park T. Partially collapsed Gibbs samplers: Theory and methods. Journal of the American Statistical Association. 2008;**103**(482): 790-796

[10] Van Dyk DA, Jiao X. Metropolis-Hastings within partially collapsed Gibbs samplers. Journal of Computational and Graphical Statistics. 2015;**24**(2):301-327

[11] Hobert JP. Stability relationships among the Gibbs sampler and its subchains. Journal of Computational and Graphical Statistics. 2001;**10**(2): 185-205

[12] Liu JS, Wu YN. Parameter expansion for data augmentation. Journal of the American Statistical Association. 1999;**94**:1264-1274

[13] Neal RM. Slice sampling. Ann. Statist. 2003;**31**(3):705-741

[14] Damien P, Wakefield J, Walker S. Gibbs sampling for Bayesian nonconjugate and hierarchical models by using auxiliary variables. Journal of the Royal Statistical Society, Series B: Statistical Methodology. 1999;**61**(2): 331-344

[15] Tierney L. A note on Metropolis-Hastings kernels for general state spaces. Ann. Appl. Probab. 1998;**8**(1):1-9

[16] Roberts GO, Rosenthal JS. Optimal scaling for various Metropolis-Hastings algorithms. Statistical Science. 2001; **16**(4):351-367

[17] Neal RM. MCMC using Hamiltonial dynamics. In: Brooks S, Gelman A, Jones GL, Meng X-L, editors. Handbook of Markov Chain Monte Carlo. Boca Raton: Chapman & Hall; 2011. pp. 113-162

[18] Hoffman MD, Gelman A. The No-U-turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research. 2014;**15**:1593-1623

**Chapter 3**

**Abstract**

**1. Introduction**

its stationary distribution.

**29**

*Hongsheng Dai*

A Review on the Exact

and other perfect Monte Carlo sampling methods.

Monte Carlo Simulation

Perfect Monte Carlo sampling refers to sampling random realizations exactly from the target distributions (without any statistical error). Although many different methods have been developed and various applications have been implemented in the area of perfect Monte Carlo sampling, it is mostly referred by researchers to coupling from the past (CFTP) which can correct the statistical errors for the Monte Carlo samples generated by Markov chain Monte Carlo (MCMC) algorithms. This paper provides a brief review on the recent developments and applications in CFTP

**Keywords:** coupling from the past, diffusion, Monte Carlo, perfect sampling

In the past 30 years, substantial progress has been made in popularizing Bayesian methods for the statistical analysis of complex data sets. An important driving force has been the availability of different types of Bayesian computational methods, such as Markov chain Monte Carlo (MCMC), sequential Monte Carlo (SMC), approximate Bayesian computation (ABC) and so on. For many practical examples, these computational methods can provide rapid and reliable approxima-

The basic idea that lies behind these methods is to obtain Monte Carlo samples from the distribution of interest, in particular the posterior distribution. In Bayesian analysis of complex statistical models, the calculation of posterior normalizing constants and the evaluation of posterior estimates are typically infeasible either analytically or by numerical quadrature. Monte Carlo simulation provides an alternative. One of the most popular Bayesian computational methods is MCMC, which is based on the idea of constructing a Markov chain with the desired distribution as

By running a Markov chain, MCMC methods generate statistically dependent and approximate realizations from the limiting (target) distribution. A potential weakness of these methods is that the simulated trajectory of a Markov chain will depend on its initial state. A common practical recommendation is to ignore the early stages, the so-called *burn-in* phase, before collecting realizations of the state of the chain. How to choose the length of the *burn-in* phase is an active research area. Many methods have been proposed for *convergence diagnostics*; [1] gave a comparative review. Rigorous application of diagnostic methods requires either substantial

empirical analysis of the Markov chain or complex mathematical analysis.

tions to the posterior distributions for unknown parameters.

[19] Swendsen RH, Wang JS. Nonuniversal critical dynamics in Monte Carlo simulations. Physical Review Letters. 1987;**58**(2):86-88

[20] Barbu A, Zhu SC. Generalizing Swendsen-Wang to sampling arbitrary posterior probabilities. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2005;**27**(8): 1239-1253

#### **Chapter 3**

lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research.

*Bayesian Inference on Complicated Data*

[20] Barbu A, Zhu SC. Generalizing Swendsen-Wang to sampling arbitrary

Transactions on Pattern Analysis and Machine Intelligence. 2005;**27**(8):

2014;**15**:1593-1623

1239-1253

**28**

[19] Swendsen RH, Wang JS. Nonuniversal critical dynamics in Monte Carlo simulations. Physical Review Letters. 1987;**58**(2):86-88

posterior probabilities. IEEE

### A Review on the Exact Monte Carlo Simulation

*Hongsheng Dai*

#### **Abstract**

Perfect Monte Carlo sampling refers to sampling random realizations exactly from the target distributions (without any statistical error). Although many different methods have been developed and various applications have been implemented in the area of perfect Monte Carlo sampling, it is mostly referred by researchers to coupling from the past (CFTP) which can correct the statistical errors for the Monte Carlo samples generated by Markov chain Monte Carlo (MCMC) algorithms. This paper provides a brief review on the recent developments and applications in CFTP and other perfect Monte Carlo sampling methods.

**Keywords:** coupling from the past, diffusion, Monte Carlo, perfect sampling

#### **1. Introduction**

In the past 30 years, substantial progress has been made in popularizing Bayesian methods for the statistical analysis of complex data sets. An important driving force has been the availability of different types of Bayesian computational methods, such as Markov chain Monte Carlo (MCMC), sequential Monte Carlo (SMC), approximate Bayesian computation (ABC) and so on. For many practical examples, these computational methods can provide rapid and reliable approximations to the posterior distributions for unknown parameters.

The basic idea that lies behind these methods is to obtain Monte Carlo samples from the distribution of interest, in particular the posterior distribution. In Bayesian analysis of complex statistical models, the calculation of posterior normalizing constants and the evaluation of posterior estimates are typically infeasible either analytically or by numerical quadrature. Monte Carlo simulation provides an alternative. One of the most popular Bayesian computational methods is MCMC, which is based on the idea of constructing a Markov chain with the desired distribution as its stationary distribution.

By running a Markov chain, MCMC methods generate statistically dependent and approximate realizations from the limiting (target) distribution. A potential weakness of these methods is that the simulated trajectory of a Markov chain will depend on its initial state. A common practical recommendation is to ignore the early stages, the so-called *burn-in* phase, before collecting realizations of the state of the chain. How to choose the length of the *burn-in* phase is an active research area. Many methods have been proposed for *convergence diagnostics*; [1] gave a comparative review. Rigorous application of diagnostic methods requires either substantial empirical analysis of the Markov chain or complex mathematical analysis.

In practice, judgments about convergence are often made by visual inspection of the realized chain or the application of simple rules of thumb.

*Mg x*ð Þ, where *f x*ð Þ≤ *Mg x*ð Þ and *g x*ð Þ are a probability density function from which samples can be readily simulated. The basic rejection sampling algorithm is as

Sample *x* from *g x*ð Þ and *U* from Unif 0, 1 ½ �. 01

*Mg x*ð Þ, <sup>02</sup> accept *x* as a realisation of *f x*ð Þ and stop; 03 else 04 reject the value of *x* and go back to step 01. 05

Many other perfect sampling methods are actually equivalent to rejection sam-

The efficiency of rejection sampling depends on the acceptance probability, which is 1*=M*. To perform rejection sampling efficiently, it is very important to find hat functions which provides higher acceptance probabilities. In other words, we shall choose *M* as small as possible [17]; however for many complicated problems, there is no easy way to find *M* small enough to guarantee high acceptance proba-

log *h*ð Þ *λx* þ ð Þ 1 � *λ y* ≥*λ* log *h x*ð Þþ ð Þ 1 � *λ* log *h y* ð Þ,

for all *x*, *y* and *λ*∈½ � 0, 1 . For the special class of log-concave densities, Gilks and wild [3] developed the adaptive rejection sampling (ARS) method. The method constructs an envelope function for the logarithm of the target density, *f x*ð Þ, by using tangents to log *f x*ð Þ at an increasing number, *n*, of points. The envelope function *un*ð Þ *x* is the piecewise linear *upper hull* formed from the tangents. Note that, the envelope can be easily constructed due to the concavity of log *f x*ð Þ. The method also constructs a squeeze function *ln*ð Þ *x* which is formed from the chords of the tangent points. The sampling steps of the ARS algorithm are as

Initialise *un*ð Þ *x* and *ln*ð Þ *x* by using tangents at several points. 01 Sample *<sup>x</sup>*<sup>∗</sup> from density <sup>∝</sup> exp ð Þ *un*ð Þ *<sup>x</sup>* and *<sup>U</sup>* � Unif 0, 1 ð Þ <sup>02</sup> If *<sup>U</sup>* <sup>≤</sup> exp *ln <sup>x</sup>*<sup>∗</sup> ð Þ� *un <sup>x</sup>*<sup>∗</sup> ð Þ ð Þ , Output *<sup>x</sup>*<sup>∗</sup> ; <sup>03</sup> else if *<sup>U</sup>* <sup>≤</sup>*f x*<sup>∗</sup> ð Þ*<sup>=</sup>* exp *un <sup>x</sup>*<sup>∗</sup> ð Þ ð Þ , <sup>04</sup> Output *<sup>x</sup>*<sup>∗</sup> ; Update ð Þ *un*, *ln* to ð Þ *un*þ1, *ln*þ<sup>1</sup> using *<sup>x</sup>*<sup>∗</sup> ; <sup>05</sup> Goto 02 06

The ARS algorithm is adaptive and the sampling density is modified whenever *f x*<sup>∗</sup> ð Þ is evaluated. In this way, the method becomes more efficient as the sampling

continues. Leydold [18] extends ARS to log-concave multivariate densities.

pling. For example, ratio-of-uniform (RoU) method [16] may have to be

bility. A number of sophisticated rejection sampling methods have been suggested for dealing with complex Bayesian posterior densities, which we

follows:

If *U* ≤ *f x*ð Þ

discuss below.

follows.

**31**

**2.1 Log-concave densities**

**Algorithm 2.1 (Basic rejection sampling)**

*A Review on the Exact Monte Carlo Simulation DOI: http://dx.doi.org/10.5772/intechopen.88619*

implemented via a rejection sampling approach.

A function *h x*ð Þ is called log-concave if

**Algorithm 2.2 (Adaptive rejection sampling).** Outputs a stream of perfect samples from *f x*ð Þ.

Concerns about the *quality* of the sampled realizations of the simulated Markov chains have motivated the search for Monte Carlo methods that can be guaranteed to provide samples from the target distribution. This is usually referred to as *perfect sampling* or *exact sampling*. In some cases, for example, the multivariate normal, perfect samples are readily available. For more complicated distributions, perfect sampling can be achieved, in principle, by the *rejection method*. This involves sampling from a density that bounds a suitable multiple of the target density, followed by acceptance or rejection of the sampled value. The difficulty here is in finding a bounding density that is amenable to rapid sampling while at the same time providing sample values that are accepted with high probability. In general this is a challenging task, although efficient rejection sampling methods have been developed for the special class of log-concave densities; see, for example, [2, 3].

A surprising breakthrough in the search for perfect sampling methods was made by [4]. The method is known as *coupling from the past* (CFTP). This is a sophisticated MCMC-based algorithm that produces realizations exactly from the target distribution. CFTP transfers the difficulty of running the Markov chain for extensive periods (to ensure convergence) to the difficulty of establishing whether a potentially large number of coupled Markov chains have coalesced. An efficient CFTP algorithm depends on finding an appropriate Markov chain construction that will ensure fast coalescence. There have been a few further novel theoretical developments following the breakthrough of CFTP, including [2, 5, 6]. Since then, perfect sampling methods have attracted great attention in various Bayesian computational problems and applied probability areas.

Apart from coupling from the past, many other perfect sampling methods were proposed for specific problems, for example, perfect sampling for random spanning trees [7, 8] and path-space rejection sampler for diffusion processes [9–11]. Very recently, a type of divide-and-conquer method has been developed in [12, 13]. These methods use the technique for the exact simulation of diffusions and samples from simple sub-densities to obtain perfect samples from the target distribution.

Perfect samples are useful in Bayesian applications either as a complete replacement for MCMC-generated values or as a source of initial values that will guarantee that a conventional MCMC algorithm runs in equilibrium. Perfect samples can also be used as a means of quality control in judging a proposed MCMC implementation when there are questions about the speed of convergence of the MCMC algorithm or whether the chain is capable of exploring all parts of the sample space. Of course, when perfect samples can be obtained speedily, they will be preferred to MCMC values, since they eliminate such doubts. In addition, sophisticated perfect sampling methodology often motivates efficient approximate algorithms and computational techniques. For example, [14] uses regenerated Markov chains to obtain simple standard error estimates for importance sampling under MCMC context. The condition required there will allow us to carry out perfect sampling via multigamma coupling approach [15].

This paper will present a brief review for perfect Monte Carlo sampling and explain the advantages and drawbacks of different types of methods. Section 2 will present rejection sampling techniques, and then CFTP will be covered in Section 3. Recent divide-and-conquer methods will be reviewed in Section 4. The paper ends with a discussion in Section 5.

#### **2. Rejection sampling techniques**

Rejection sampling, also known as 'acceptance-rejection sampling', generates realizations from a target probability density function *f x*ð Þ by using a hat function

#### *A Review on the Exact Monte Carlo Simulation DOI: http://dx.doi.org/10.5772/intechopen.88619*

In practice, judgments about convergence are often made by visual inspection of the

Concerns about the *quality* of the sampled realizations of the simulated Markov chains have motivated the search for Monte Carlo methods that can be guaranteed to provide samples from the target distribution. This is usually referred to as *perfect sampling* or *exact sampling*. In some cases, for example, the multivariate normal, perfect samples are readily available. For more complicated distributions, perfect sampling can be achieved, in principle, by the *rejection method*. This involves sampling from a density that bounds a suitable multiple of the target density, followed by acceptance or rejection of the sampled value. The difficulty here is in finding a bounding density that is amenable to rapid sampling while at the same time providing sample values that are accepted with high probability. In general this is a challenging task, although efficient rejection sampling methods have been developed for the special class of log-concave densities; see, for example, [2, 3].

A surprising breakthrough in the search for perfect sampling methods was made by [4]. The method is known as *coupling from the past* (CFTP). This is a sophisticated MCMC-based algorithm that produces realizations exactly from the target distribution. CFTP transfers the difficulty of running the Markov chain for extensive periods (to ensure convergence) to the difficulty of establishing whether a potentially large number of coupled Markov chains have coalesced. An efficient CFTP algorithm depends on finding an appropriate Markov chain construction that will ensure fast coalescence. There have been a few further novel theoretical developments following the breakthrough of CFTP, including [2, 5, 6]. Since then, perfect sampling methods have attracted great attention in various Bayesian com-

Apart from coupling from the past, many other perfect sampling methods were proposed for specific problems, for example, perfect sampling for random spanning trees [7, 8] and path-space rejection sampler for diffusion processes [9–11]. Very recently, a type of divide-and-conquer method has been developed in [12, 13]. These methods use the technique for the exact simulation of diffusions and samples from simple sub-densities to obtain perfect samples from the target distribution. Perfect samples are useful in Bayesian applications either as a complete replacement for MCMC-generated values or as a source of initial values that will guarantee that a conventional MCMC algorithm runs in equilibrium. Perfect samples can also be used as a means of quality control in judging a proposed MCMC implementation when there are questions about the speed of convergence of the MCMC algorithm or whether the chain is capable of exploring all parts of the sample space. Of course, when perfect samples can be obtained speedily, they will be preferred to MCMC values, since they eliminate such doubts. In addition, sophisticated perfect sampling methodology often motivates efficient approximate algorithms and computational techniques. For example, [14] uses regenerated Markov chains to obtain simple standard error estimates for importance sampling under MCMC context. The condition required there will allow us

to carry out perfect sampling via multigamma coupling approach [15].

with a discussion in Section 5.

**30**

**2. Rejection sampling techniques**

This paper will present a brief review for perfect Monte Carlo sampling and explain the advantages and drawbacks of different types of methods. Section 2 will present rejection sampling techniques, and then CFTP will be covered in Section 3. Recent divide-and-conquer methods will be reviewed in Section 4. The paper ends

Rejection sampling, also known as 'acceptance-rejection sampling', generates realizations from a target probability density function *f x*ð Þ by using a hat function

realized chain or the application of simple rules of thumb.

*Bayesian Inference on Complicated Data*

putational problems and applied probability areas.

*Mg x*ð Þ, where *f x*ð Þ≤ *Mg x*ð Þ and *g x*ð Þ are a probability density function from which samples can be readily simulated. The basic rejection sampling algorithm is as follows:


Many other perfect sampling methods are actually equivalent to rejection sampling. For example, ratio-of-uniform (RoU) method [16] may have to be implemented via a rejection sampling approach.

The efficiency of rejection sampling depends on the acceptance probability, which is 1*=M*. To perform rejection sampling efficiently, it is very important to find hat functions which provides higher acceptance probabilities. In other words, we shall choose *M* as small as possible [17]; however for many complicated problems, there is no easy way to find *M* small enough to guarantee high acceptance probability. A number of sophisticated rejection sampling methods have been suggested for dealing with complex Bayesian posterior densities, which we discuss below.

#### **2.1 Log-concave densities**

A function *h x*ð Þ is called log-concave if

$$
\log h(\lambda \mathbf{x} + (\mathbf{1} - \lambda)\mathbf{y}) \ge \lambda \log h(\mathbf{x}) + (\mathbf{1} - \lambda)\log h(\mathbf{y}),
$$

for all *x*, *y* and *λ*∈½ � 0, 1 . For the special class of log-concave densities, Gilks and wild [3] developed the adaptive rejection sampling (ARS) method. The method constructs an envelope function for the logarithm of the target density, *f x*ð Þ, by using tangents to log *f x*ð Þ at an increasing number, *n*, of points. The envelope function *un*ð Þ *x* is the piecewise linear *upper hull* formed from the tangents. Note that, the envelope can be easily constructed due to the concavity of log *f x*ð Þ. The method also constructs a squeeze function *ln*ð Þ *x* which is formed from the chords of the tangent points. The sampling steps of the ARS algorithm are as follows.


The ARS algorithm is adaptive and the sampling density is modified whenever *f x*<sup>∗</sup> ð Þ is evaluated. In this way, the method becomes more efficient as the sampling continues. Leydold [18] extends ARS to log-concave multivariate densities.

Leydold's algorithm is based on decomposing the domain of the density into cones and then computing tangent hyperplanes for these cones. Generic computer code for sampling from a multivariate log-concave density is available on the author's website; it is only necessary to code a subroutine for the target density. Leydold's algorithm works well for simple low-dimensional densities. The drawback of ARS algorithm is that it only works for log-concave densities, which is a very small class of posteriors in practice. Also computationally it is very challenging to implement ARS algorithm for high-dimensional densities [19].

#### **2.2 Fill's rejection sampling algorithm**

We consider a discrete Markov chain with transition probability **P**ð Þ *x*, *y* and stationary distribution *<sup>π</sup>*ð Þ *<sup>x</sup>* , *<sup>x</sup>*<sup>∈</sup> *<sup>S</sup>*. Let **<sup>P</sup>**~ð Þ¼ *<sup>x</sup>*, *<sup>y</sup> <sup>π</sup>*ð Þ*<sup>y</sup>* **<sup>P</sup>**ð Þ *<sup>y</sup>*, *<sup>x</sup> <sup>=</sup>π*ð Þ *<sup>x</sup>* be the transition probability for the time-reversed chain. Suppose that there is a partial ordering on the states *S*, denoted by *x*≼*y*. Let 0 and ^ ^1 be unique extremal points of the partial ordering.

To demonstrate the algorithm given by [2], we will assume that there are update functions *<sup>ϕ</sup>* and *<sup>ϕ</sup>*<sup>~</sup> both mapping *<sup>S</sup>* � ½ � 0, 1 to *<sup>S</sup>* such that

$$\mathbf{P}(\mathbf{x}, \mathbf{y}) = P(\phi(\mathbf{x}, U) = \mathbf{y}), \tag{1}$$

**3. Coupling from the past**

*A Review on the Exact Monte Carlo Simulation DOI: http://dx.doi.org/10.5772/intechopen.88619*

**3.1 Basic CFTP algorithms**

Let *ft*

for *t*<sup>1</sup> <*t*2.

must follow that *X*<sup>∗</sup>

<sup>0</sup> . Then *X*<sup>∗</sup>

**Algorithm 3.1 (Basic CFTP)**

arrive at *X*<sup>∗</sup>

*F*0 *<sup>t</sup>* <sup>¼</sup> *<sup>F</sup>*<sup>0</sup>

until *F*<sup>0</sup>

return *F*<sup>0</sup>

**3.2 Read-once CFTP**

sequence f g *Ut* �<sup>1</sup>

procedure.

**33**

Coupling from the past was introduced in the landmark paper of [4] which showed

Let f g *Xt* be an ergodic Markov chain with state space X ¼ f g 1, ⋯, *n* , where the probability of going from *i* to *j* is *pij* and the stationary distribution is *π*. Suppose we design an updating function *ϕ*ð Þ �, *U* , which satisfies *P*½ �¼ *ϕ*ð Þ¼ *i*, *U j pij*, where *ϕ* is a deterministic function and *U* is a random variable. To simulate the next state *Y* of the Markov chain, currently in state *i*, we draw a random variable *U* and let *Y* ¼ *ϕ*ð Þ *i*, *U* .

*<sup>t</sup>*<sup>1</sup> <sup>¼</sup> *ft*2�<sup>1</sup>∘*ft*2�<sup>2</sup>∘⋯∘*ft*1þ<sup>1</sup>∘*ft*<sup>1</sup>

Proposition 3.1 [4] Suppose there exists a time *t* ¼ �*T*, the backward coupling time, such that chains starting from any state in X ¼ f g 1, ⋯, *n* , at time *t* ¼ �*T*, and

This proposition is easy to prove. If we run an ergodic Markov chain from time *t* ¼ �∞ and with the sequence *Ut* f g , *t* ¼ �*T*, ⋯, �1 after �*T*, the Markov chain will

*t* ¼ 0 01 repeat 02 *t* ¼ *t* � 1 03 generate *Ut* 04

*<sup>t</sup>*þ<sup>1</sup>∘*ϕ*ð Þ �, *Ut* <sup>05</sup>

*<sup>t</sup>* ð Þ� is a constant <sup>06</sup>

Propp and Wilson [4] also proved that this algorithm is certain to terminate. The

The basic CFTP algorithm needs to restart the Markov chains at some points in

�<sup>∞</sup> when we restart the Markov chains. In Monte Carlo simulations,

the past if they have not coalesced by time 0. We must use the same random

we usually use *pseudo*random number generators, which are deterministic algorithms. So if we give the same random seed, we will get the same random sequence. This means that the same sequence f g *Ut* can be regenerated in each coupling

This is because *T* is a stopping time, which is not independent of the Markov chain.

idea of simulating from the past is important. Note that if we collect *F<sup>T</sup>*

where *T* is the smallest value that makes *F<sup>T</sup>*

*<sup>t</sup>* ð Þ� <sup>07</sup>

<sup>0</sup> comes exactly from *π* since it is collected at time 0 and the

with the same sequence *Ut* f g , *<sup>t</sup>* ¼ �*T*, <sup>⋯</sup>, �<sup>1</sup> , arrive at the same state *<sup>X</sup>*<sup>∗</sup>

, (3)

<sup>0</sup> . Then it

<sup>0</sup> ð Þ� as the result,

<sup>0</sup> ð Þ� a constant, we will get a biased sample.

how to provide perfect samples from the limiting distribution of a Markov chain.

ðÞ¼ *i ϕ*ð Þ *i*, *Ut* , and define the composition

*Ft*2

<sup>0</sup> comes from *π*.

Markov chain started from �∞. The algorithm is as follows:

$$
\tilde{\mathbf{P}}(\mathbf{x}, \mathbf{y}) = P(\tilde{\phi}(\mathbf{x}, U) = \mathbf{y}),
\tag{2}
$$

where *U* � Unif 0, 1 ½ � and

$$
\infty \preccurlyeq \nu \Rightarrow \tilde{\phi}(\varkappa, u) \precsim \tilde{\phi}(\jmath, u) \quad a.e. \quad u \in [0, 1].
$$

The algorithm is as follows:

**Algorithm 2.3 (Fill's algorithm)**

1. Choose an integer *<sup>t</sup>* <sup>&</sup>gt;0. Fix *<sup>x</sup>*<sup>0</sup> <sup>¼</sup> 0 and ^ *<sup>y</sup>*<sup>0</sup> <sup>¼</sup> ^1. 2. Generate *xk* ¼ *ϕ*ð Þ *xk*�1, *Uk* , *k* ¼ 1, ⋯, *t*, where f g *Uk* are i.i.d. Unif 0, 1 ½ �. 3. Generate *<sup>U</sup>*~*<sup>k</sup>* from the conditional distribution of *<sup>U</sup>* given *<sup>ϕ</sup>*<sup>~</sup> *xt*�*k*þ<sup>1</sup> ð Þ¼ , *<sup>U</sup> xt*�*<sup>k</sup>*, *<sup>k</sup>* <sup>¼</sup> 1, <sup>⋯</sup>, *<sup>t</sup>*. 4. Generate *yk* <sup>¼</sup> *<sup>ϕ</sup>*<sup>~</sup> *yk*�<sup>1</sup>, *<sup>U</sup>*~*<sup>k</sup>* , *<sup>k</sup>* <sup>¼</sup> 1, <sup>⋯</sup>, *<sup>t</sup>*.

5. If *yt* ¼ *x*<sup>0</sup> then accept *xt*; else double *t* and repeat from step 2.

In Algorithm 2.3 (step 2) *<sup>z</sup>* <sup>≔</sup> *xt* is sampled from **<sup>P</sup>***<sup>t</sup>* 0, ^ � . If we find an upper bound *<sup>M</sup>* for *<sup>π</sup>*ð Þ*<sup>z</sup> <sup>=</sup>***P***<sup>t</sup>* 0, ^ *<sup>z</sup>* , then we can use rejection sampling. Fill [2] finds a bounding constant given by *<sup>M</sup>* <sup>¼</sup> *<sup>π</sup>* <sup>0</sup>^ *=***P**~*<sup>t</sup>* ^1, 0^ and proves that steps from 3 to 5 of Algorithm 2.3 are to accept *z* with probability *<sup>π</sup>*ð Þ*<sup>z</sup> M***P***<sup>t</sup>* ð Þ 0, ^ *<sup>z</sup>* . The output of this algorithm is indeed a perfect sample from *π*.

From Algorithm 2.3, we can see that rejection sampling can still be possible, even if we do not have a closed form of the hat function. The first limitation of Algorithm 2.3 is that it works only if the time-reversed chain is monotone, but [5] has extended the algorithm theoretically for general chains. The second limitation is that step 3 of Algorithm 2.3 is difficult to perform [2]. For these reasons, Fill's algorithm is not practical for realistic problems.

*A Review on the Exact Monte Carlo Simulation DOI: http://dx.doi.org/10.5772/intechopen.88619*

#### **3. Coupling from the past**

Leydold's algorithm is based on decomposing the domain of the density into cones and then computing tangent hyperplanes for these cones. Generic computer code for sampling from a multivariate log-concave density is available on the author's website; it is only necessary to code a subroutine for the target density. Leydold's algorithm works well for simple low-dimensional densities. The drawback of ARS algorithm is that it only works for log-concave densities, which is a very small class of posteriors in practice. Also computationally it is very challenging to implement

We consider a discrete Markov chain with transition probability **P**ð Þ *x*, *y* and stationary distribution *<sup>π</sup>*ð Þ *<sup>x</sup>* , *<sup>x</sup>*<sup>∈</sup> *<sup>S</sup>*. Let **<sup>P</sup>**~ð Þ¼ *<sup>x</sup>*, *<sup>y</sup> <sup>π</sup>*ð Þ*<sup>y</sup>* **<sup>P</sup>**ð Þ *<sup>y</sup>*, *<sup>x</sup> <sup>=</sup>π*ð Þ *<sup>x</sup>* be the transition probability for the time-reversed chain. Suppose that there is a partial ordering on the states *S*, denoted by *x*≼*y*. Let 0 and ^ ^1 be unique extremal points of the partial

To demonstrate the algorithm given by [2], we will assume that there are update

*<sup>x</sup>*≼*<sup>y</sup>* ) *<sup>ϕ</sup>*~ð Þ *<sup>x</sup>*, *<sup>u</sup>* <sup>≼</sup>*ϕ*~ð Þ *<sup>y</sup>*, *<sup>u</sup> <sup>a</sup>:e: <sup>u</sup>* <sup>∈</sup>½ � 0, 1 *:*

**P**ð Þ¼ *x*, *y P*ð Þ *ϕ*ð Þ¼ *x*, *U y* , (1)

**<sup>P</sup>**~ð Þ¼ *<sup>x</sup>*, *<sup>y</sup> <sup>P</sup> <sup>ϕ</sup>*~ð Þ¼ *<sup>x</sup>*, *<sup>U</sup> <sup>y</sup>* , (2)

^ � . If we find an upper

^ *<sup>z</sup>* . The output of this algorithm

*=***P**~*<sup>t</sup>* ^1, 0^ and proves that steps from 3 to 5 of

ARS algorithm for high-dimensional densities [19].

functions *<sup>ϕ</sup>* and *<sup>ϕ</sup>*<sup>~</sup> both mapping *<sup>S</sup>* � ½ � 0, 1 to *<sup>S</sup>* such that

**2.2 Fill's rejection sampling algorithm**

*Bayesian Inference on Complicated Data*

where *U* � Unif 0, 1 ½ � and

The algorithm is as follows:

**Algorithm 2.3 (Fill's algorithm)**

4. Generate *yk* <sup>¼</sup> *<sup>ϕ</sup>*<sup>~</sup> *yk*�<sup>1</sup>, *<sup>U</sup>*~*<sup>k</sup>*

bound *<sup>M</sup>* for *<sup>π</sup>*ð Þ*<sup>z</sup> <sup>=</sup>***P***<sup>t</sup>* 0,

bounding constant given by *<sup>M</sup>* <sup>¼</sup> *<sup>π</sup>* <sup>0</sup>^

is indeed a perfect sample from *π*.

practical for realistic problems.

**32**

1. Choose an integer *<sup>t</sup>* <sup>&</sup>gt;0. Fix *<sup>x</sup>*<sup>0</sup> <sup>¼</sup> 0 and ^ *<sup>y</sup>*<sup>0</sup> <sup>¼</sup> ^1.

 , *<sup>k</sup>* <sup>¼</sup> 1, <sup>⋯</sup>, *<sup>t</sup>*. 5. If *yt* ¼ *x*<sup>0</sup> then accept *xt*; else double *t* and repeat from step 2.

Algorithm 2.3 are to accept *z* with probability *<sup>π</sup>*ð Þ*<sup>z</sup>*

2. Generate *xk* ¼ *ϕ*ð Þ *xk*�1, *Uk* , *k* ¼ 1, ⋯, *t*, where f g *Uk* are i.i.d. Unif 0, 1 ½ �.

In Algorithm 2.3 (step 2) *z* ≔ *xt* is sampled from **P***<sup>t</sup>* 0,

3. Generate *<sup>U</sup>*~*<sup>k</sup>* from the conditional distribution of *<sup>U</sup>* given *<sup>ϕ</sup>*<sup>~</sup> *xt*�*k*þ<sup>1</sup> ð Þ¼ , *<sup>U</sup> xt*�*<sup>k</sup>*, *<sup>k</sup>* <sup>¼</sup> 1, <sup>⋯</sup>, *<sup>t</sup>*.

^ *z* , then we can use rejection sampling. Fill [2] finds a

*M***P***<sup>t</sup>* ð Þ 0,

From Algorithm 2.3, we can see that rejection sampling can still be possible, even if we do not have a closed form of the hat function. The first limitation of Algorithm 2.3 is that it works only if the time-reversed chain is monotone, but [5] has extended the algorithm theoretically for general chains. The second limitation is that step 3 of Algorithm 2.3 is difficult to perform [2]. For these reasons, Fill's algorithm is not

ordering.

Coupling from the past was introduced in the landmark paper of [4] which showed how to provide perfect samples from the limiting distribution of a Markov chain.

#### **3.1 Basic CFTP algorithms**

Let f g *Xt* be an ergodic Markov chain with state space X ¼ f g 1, ⋯, *n* , where the probability of going from *i* to *j* is *pij* and the stationary distribution is *π*. Suppose we design an updating function *ϕ*ð Þ �, *U* , which satisfies *P*½ �¼ *ϕ*ð Þ¼ *i*, *U j pij*, where *ϕ* is a deterministic function and *U* is a random variable. To simulate the next state *Y* of the Markov chain, currently in state *i*, we draw a random variable *U* and let *Y* ¼ *ϕ*ð Þ *i*, *U* .

Let *ft* ðÞ¼ *i ϕ*ð Þ *i*, *Ut* , and define the composition

$$F\_{t\_1}^{t\_2} = f\_{t\_2 - 1} \mathfrak{f}\_{t\_2 - 2} \bullet \cdots \mathfrak{f}\_{t\_1 + 1} \mathfrak{f}\_{t\_1},\tag{3}$$

for *t*<sup>1</sup> <*t*2.

Proposition 3.1 [4] Suppose there exists a time *t* ¼ �*T*, the backward coupling time, such that chains starting from any state in X ¼ f g 1, ⋯, *n* , at time *t* ¼ �*T*, and with the same sequence *Ut* f g , *<sup>t</sup>* ¼ �*T*, <sup>⋯</sup>, �<sup>1</sup> , arrive at the same state *<sup>X</sup>*<sup>∗</sup> <sup>0</sup> . Then it must follow that *X*<sup>∗</sup> <sup>0</sup> comes from *π*.

This proposition is easy to prove. If we run an ergodic Markov chain from time *t* ¼ �∞ and with the sequence *Ut* f g , *t* ¼ �*T*, ⋯, �1 after �*T*, the Markov chain will arrive at *X*<sup>∗</sup> <sup>0</sup> . Then *X*<sup>∗</sup> <sup>0</sup> comes exactly from *π* since it is collected at time 0 and the Markov chain started from �∞. The algorithm is as follows:


Propp and Wilson [4] also proved that this algorithm is certain to terminate. The idea of simulating from the past is important. Note that if we collect *F<sup>T</sup>* <sup>0</sup> ð Þ� as the result, where *T* is the smallest value that makes *F<sup>T</sup>* <sup>0</sup> ð Þ� a constant, we will get a biased sample. This is because *T* is a stopping time, which is not independent of the Markov chain.

#### **3.2 Read-once CFTP**

The basic CFTP algorithm needs to restart the Markov chains at some points in the past if they have not coalesced by time 0. We must use the same random sequence f g *Ut* �<sup>1</sup> �<sup>∞</sup> when we restart the Markov chains. In Monte Carlo simulations, we usually use *pseudo*random number generators, which are deterministic algorithms. So if we give the same random seed, we will get the same random sequence. This means that the same sequence f g *Ut* can be regenerated in each coupling procedure.

If we can run the Markov chain forward starting at 0 and collect a perfect sample in the future, we will not have to regenerate f g *Ut* . Wilson [20] developed a readonce CFTP method to implement the forward coupling idea. A simple example is provided by [21]. In fact, the multigamma coupler in [15] can be implemented via the more efficient read-once CFTP algorithm.

example, the Ising model [4]. Also [31] applied CFTP to ancestral selection graph to simulate samples from population genetic models. Refs. [32, 33] applied CFTP to a class of fork-join type queuing system problems. Connor and Kendal [34] applied CFTP for the perfect simulation of M/G/c queues. CFTP also finds its application in

CFTP for continuous state space Markov chains is very challenging, since a random map from an interval to a finite number of points is required. In recent years, many methods have been developed for unbounded continuous state space Markov chains, such as perfect slice sampler in [23], multigamma coupler and the bounded M-H coupler in [15, 24]. Wilson [36] developed a layered multi-shift coupling, which shifts states in an interval to a finite number of points. However,

Recently, a new type of perfect Monte Carlo sampling method based on the decomposition of the target density *f*, as *f*ðÞ¼ � *g*1ð Þ� *g*2ð Þ� , was proposed in [12], where *g*<sup>1</sup> and *g*<sup>2</sup> are also (proportional to) proper density functions. Note that here *g*<sup>1</sup> and *g*<sup>2</sup> are continuous density functions which are easy to simulate from. Suppose that *q*-dimensional vector values *x*<sup>1</sup> and *x*<sup>2</sup> are independently drawn from *g*<sup>1</sup> and *g*2, respectively. If the two independent samples are equal, i.e. *x*<sup>1</sup> ¼ *x*<sup>2</sup> ¼ *y* then we have *y* must be from *f*ð Þ� ∝*g*1ð Þ� *g*2ð Þ� . Note that such a naive approach may be practical for discrete random variables with low-dimensional state space, but for continuous random variables, it is impossible since Pð Þ¼ *x*<sup>1</sup> ¼ *x*<sup>2</sup> 0. Dai [12] proposed a novel approach to deal with this, which is explained in the following

**4.1 Perfect distributed Monte Carlo without using hat functions**

*<sup>α</sup>*ð Þ¼ *<sup>x</sup> <sup>α</sup>*ð Þ<sup>1</sup> , <sup>⋯</sup>, *<sup>α</sup>*ð Þ*<sup>q</sup> tr*

consider a *q*-dimensional diffusion process *X<sup>t</sup> ω*

*q*-dimensional continuous function space **Ω**, given by:

initial probability distribution *<sup>B</sup>*<sup>0</sup> <sup>¼</sup> *<sup>w</sup>*<sup>0</sup> � *<sup>f</sup>* <sup>1</sup>ðÞ¼ � *<sup>g</sup>*<sup>2</sup>

! <sup>¼</sup> *<sup>ω</sup><sup>t</sup>* is a Brownian motion and *<sup>ω</sup>*

*ω*<sup>0</sup> � *f* <sup>1</sup>ð Þ� , i.e. under we have *X<sup>t</sup>* � *f* <sup>1</sup>ð Þ *x* for any *t*∈½ � 0, *T* .

simulate from the target function *f*. It is defined as follows.

First we introduce the following notations related to the logarithm of the

where ∇ is the partial derivative operator for each component of *x*. Then we

element of **Ω**. Let be the probability measure for a Brownian motion with the

Clearly *X<sup>t</sup>* has the invariant distribution *f* <sup>1</sup>ð Þ *x* (using the Langevin diffusion results [37]). Let be the probability law induced by *Xt*, *t*∈½ � 0, *T* , with *X*<sup>0</sup> ¼

The idea in [12] is to use a *biased* diffusion process **<sup>X</sup>** <sup>¼</sup> *<sup>X</sup>t*; 0 <sup>≤</sup>*<sup>t</sup>* <sup>≤</sup>*<sup>T</sup>* to

ð Þ¼ *x* ∇ log *g*1ð Þ *x* (4)

! , *<sup>t</sup>*<sup>∈</sup> ½ � 0, *<sup>T</sup>* (*<sup>T</sup>* <sup>&</sup>lt; <sup>∞</sup>), defined on the

! <sup>¼</sup> *<sup>ω</sup><sup>t</sup>* f g , *<sup>t</sup>* <sup>∈</sup>½ � 0, *<sup>T</sup>* is a typical

*dX<sup>t</sup>* ¼ *α*ð Þ *X<sup>t</sup> dt* þ *dBt*, (5)

<sup>1</sup>ð Þ� .

none of these methods can solve any practical problems.

**4. Recent advances in perfect sampling**

*A Review on the Exact Monte Carlo Simulation DOI: http://dx.doi.org/10.5772/intechopen.88619*

signal processing [35].

subsection.

sub-densities:

where *B<sup>t</sup> ω*

**35**

#### **3.3 Improvement of CFTP algorithms**

Propp and Wilson [4] showed that the computational cost of the algorithm can be reduced if there is a partial order for the state space X that is preserved by the update function *ϕ*, i.e. if *x*≤*y* then *ϕ*ð Þ *x*, *U* ≤ *ϕ*ð Þ *y*, *U* . Their procedure is outlined in Algorithm 3.2, whereas before 0 and ^ ^1 are the unique extremals. Note that their algorithm needs a monotone update function *ϕ* for the Markov chain, while Algorithm 2.3 requires a monotone update function *ϕ*~ for the time-reversed chain.


Algorithm 3.2 is much simpler than Algorithm 3.1, since only two chains have to be run at the same time, but the requirement of monotonicity is very restrictive. Markov chains with transitions given by independent Metropolis-Hastings and perfect slice sampling have been shown to have this property, by [22, 23], respectively. However [23, 24] have also noticed that such independent M-H CFTP is equivalent to simple rejection sampler.

In general it is hard to code perfect slice samplers correctly. For example, Hörmann and Leydold [25] have pointed out that the perfect slice samplers in [26, 27] are incorrect. The challenge of monotone CFTP is usually to construct the detailed updating function with a guarantee of preserving the partial order.

Finding a partial order preserved by the Markov chains is a non-trivial task in many cases. An alternative improvement is to use CFTP with bounding chains, such as that in [28, 29]. If the bounding chains, which bound all the Markov chains, coalesce, then all Markov chains coalesce. Thus if only a few bounding chains are required, the efficiency of the CFTP algorithm can be improved significantly. Sometimes, it may be impossible to define an explicit bounding chain (the maximum of the state space may be infinity, and the upper bound chain cannot start from infinity), but it is possible to use a dominated process to bound all Markov chains [30].

#### **3.4 Applications and challenges**

Although CFTP is extremely challenging to be implemented for many practical problems, it did find a few applications in certain discrete state space problems, for *A Review on the Exact Monte Carlo Simulation DOI: http://dx.doi.org/10.5772/intechopen.88619*

If we can run the Markov chain forward starting at 0 and collect a perfect sample in the future, we will not have to regenerate f g *Ut* . Wilson [20] developed a readonce CFTP method to implement the forward coupling idea. A simple example is provided by [21]. In fact, the multigamma coupler in [15] can be implemented via

Propp and Wilson [4] showed that the computational cost of the algorithm can be reduced if there is a partial order for the state space X that is preserved by the update function *ϕ*, i.e. if *x*≤*y* then *ϕ*ð Þ *x*, *U* ≤ *ϕ*ð Þ *y*, *U* . Their procedure is outlined in Algorithm 3.2, whereas before 0 and ^ ^1 are the unique extremals. Note that their algorithm needs a monotone update function *ϕ* for the Markov chain, while Algorithm 2.3 requires a monotone update function *ϕ*~ for the time-reversed chain.

*T* ¼ 1 01 Repeat 02 upper <sup>¼</sup> ^<sup>1</sup> <sup>03</sup> lower <sup>¼</sup> <sup>0</sup>^ <sup>04</sup> for *t* ¼ �*T* to *t* ¼ �1 05 upper ¼ *ϕt*ð Þ upper, *Ut* 06 lower ¼ *ϕt*ð Þ lower, *Ut* 07 *T* ¼ 2*T* 08 until upper ¼ lower 09 return upper 10

Algorithm 3.2 is much simpler than Algorithm 3.1, since only two chains have to be run at the same time, but the requirement of monotonicity is very restrictive. Markov chains with transitions given by independent Metropolis-Hastings and perfect slice sampling have been shown to have this property, by [22, 23], respectively. However [23, 24] have also noticed that such independent M-H CFTP is

In general it is hard to code perfect slice samplers correctly. For example, Hörmann and Leydold [25] have pointed out that the perfect slice samplers in [26, 27] are incorrect. The challenge of monotone CFTP is usually to construct the detailed updating function with a guarantee of preserving the partial order.

Finding a partial order preserved by the Markov chains is a non-trivial task in many cases. An alternative improvement is to use CFTP with bounding chains, such as that in [28, 29]. If the bounding chains, which bound all the Markov chains, coalesce, then all Markov chains coalesce. Thus if only a few bounding chains are required, the efficiency of the CFTP algorithm can be improved significantly. Sometimes, it may be impossible to define an explicit bounding chain (the maximum of the state space may be infinity, and the upper bound chain cannot start from infinity), but it is possible to use a dominated process to bound all Markov

Although CFTP is extremely challenging to be implemented for many practical problems, it did find a few applications in certain discrete state space problems, for

the more efficient read-once CFTP algorithm.

**3.3 Improvement of CFTP algorithms**

*Bayesian Inference on Complicated Data*

**Algorithm 3.2 (Monotone CFTP)**

equivalent to simple rejection sampler.

**3.4 Applications and challenges**

chains [30].

**34**

example, the Ising model [4]. Also [31] applied CFTP to ancestral selection graph to simulate samples from population genetic models. Refs. [32, 33] applied CFTP to a class of fork-join type queuing system problems. Connor and Kendal [34] applied CFTP for the perfect simulation of M/G/c queues. CFTP also finds its application in signal processing [35].

CFTP for continuous state space Markov chains is very challenging, since a random map from an interval to a finite number of points is required. In recent years, many methods have been developed for unbounded continuous state space Markov chains, such as perfect slice sampler in [23], multigamma coupler and the bounded M-H coupler in [15, 24]. Wilson [36] developed a layered multi-shift coupling, which shifts states in an interval to a finite number of points. However, none of these methods can solve any practical problems.

#### **4. Recent advances in perfect sampling**

Recently, a new type of perfect Monte Carlo sampling method based on the decomposition of the target density *f*, as *f*ðÞ¼ � *g*1ð Þ� *g*2ð Þ� , was proposed in [12], where *g*<sup>1</sup> and *g*<sup>2</sup> are also (proportional to) proper density functions. Note that here *g*<sup>1</sup> and *g*<sup>2</sup> are continuous density functions which are easy to simulate from. Suppose that *q*-dimensional vector values *x*<sup>1</sup> and *x*<sup>2</sup> are independently drawn from *g*<sup>1</sup> and *g*2, respectively. If the two independent samples are equal, i.e. *x*<sup>1</sup> ¼ *x*<sup>2</sup> ¼ *y* then we have *y* must be from *f*ð Þ� ∝*g*1ð Þ� *g*2ð Þ� . Note that such a naive approach may be practical for discrete random variables with low-dimensional state space, but for continuous random variables, it is impossible since Pð Þ¼ *x*<sup>1</sup> ¼ *x*<sup>2</sup> 0. Dai [12] proposed a novel approach to deal with this, which is explained in the following subsection.

#### **4.1 Perfect distributed Monte Carlo without using hat functions**

First we introduce the following notations related to the logarithm of the sub-densities:

$$\mathfrak{a}(\mathfrak{x}) = \left(a^{(1)}, \cdots, a^{(q)}\right)^{tr}(\mathfrak{x}) = \nabla \log \mathfrak{g}\_1(\mathfrak{x}) \tag{4}$$

where ∇ is the partial derivative operator for each component of *x*. Then we consider a *q*-dimensional diffusion process *X<sup>t</sup> ω* ! , *<sup>t</sup>*<sup>∈</sup> ½ � 0, *<sup>T</sup>* (*<sup>T</sup>* <sup>&</sup>lt; <sup>∞</sup>), defined on the *q*-dimensional continuous function space **Ω**, given by:

$$d\mathbf{X}\_t = \mathbf{a}(\mathbf{X}\_t)dt + d\mathbf{B}\_t,\tag{5}$$

where *B<sup>t</sup> ω* ! <sup>¼</sup> *<sup>ω</sup><sup>t</sup>* is a Brownian motion and *<sup>ω</sup>* ! <sup>¼</sup> *<sup>ω</sup><sup>t</sup>* f g , *<sup>t</sup>* <sup>∈</sup>½ � 0, *<sup>T</sup>* is a typical element of **Ω**. Let be the probability measure for a Brownian motion with the initial probability distribution *<sup>B</sup>*<sup>0</sup> <sup>¼</sup> *<sup>w</sup>*<sup>0</sup> � *<sup>f</sup>* <sup>1</sup>ðÞ¼ � *<sup>g</sup>*<sup>2</sup> <sup>1</sup>ð Þ� .

Clearly *X<sup>t</sup>* has the invariant distribution *f* <sup>1</sup>ð Þ *x* (using the Langevin diffusion results [37]). Let be the probability law induced by *Xt*, *t*∈½ � 0, *T* , with *X*<sup>0</sup> ¼ *ω*<sup>0</sup> � *f* <sup>1</sup>ð Þ� , i.e. under we have *X<sup>t</sup>* � *f* <sup>1</sup>ð Þ *x* for any *t*∈½ � 0, *T* .

The idea in [12] is to use a *biased* diffusion process **<sup>X</sup>** <sup>¼</sup> *<sup>X</sup>t*; 0 <sup>≤</sup>*<sup>t</sup>* <sup>≤</sup>*<sup>T</sup>* to simulate from the target function *f*. It is defined as follows.

Definition 4.1 (Biased Langevin diffusions) The joint density for the pair *X*0, *X<sup>T</sup>* � � (the starting and ending points of the biased diffusion process), evaluated at point *<sup>x</sup>*, *<sup>y</sup>* � �, is *<sup>f</sup>* <sup>1</sup>ð Þ *<sup>x</sup>* **<sup>t</sup>** <sup>∗</sup> *<sup>y</sup>*j*<sup>x</sup>* � �*<sup>f</sup>* <sup>2</sup> *<sup>y</sup>* � �, where **<sup>t</sup>** <sup>∗</sup> *<sup>y</sup>*j*<sup>x</sup>* � � is the transition density for the diffusion process *X* defined in Eq. (6) from *X*<sup>0</sup> ¼ *x* to *X<sup>T</sup>* ¼ *y* and *f* <sup>2</sup> *y* � � <sup>¼</sup> *<sup>g</sup>*<sup>2</sup> *<sup>y</sup>* � �*=g*<sup>1</sup> *<sup>y</sup>* � �.

uses *g*<sup>2</sup> as the proposal density function, which does not have to bound the target *f*. However, it requires a bound for the derivatives of the logarithm of the subdensities (see Condition 4.1). This is usually easier to get in practice, since the logarithm of the posterior is usually easy to deal with. Also [12] noted that we should choose sub-densities *g*<sup>1</sup> and *g*<sup>2</sup> as similar as possible, in order to achieve high

Dai [12] focused on the simple decomposition of *f* ¼ *g*1*g*2, although it mentioned that for more general decomposition of *f* ¼ *g*1*g*2⋯*gC*, a recursive method can be used. Unfortunately, a naive recursive method is very inefficient. A more sophisti-

A more efficient and sophisticated methods were proposed recently in [13],

where each *gc*ð Þ *x* (*c*∈f g 1, … ,*C* ) is a density (up to a multiplicative constant).

Dai et al. [13] considered simulating from the following density on extended

*g*2

which admits the marginal density *<sup>f</sup>* for *<sup>y</sup>*. Here *pc <sup>y</sup>*j*x*ð Þ*<sup>c</sup>* � � is the transition density from *x*ð Þ*<sup>c</sup>* to *y* for the Langevin diffusion defined in Eq. (6) associated with

Dai et al. [13] considered a rejection sampling approach with proposal density

*<sup>c</sup>*¼<sup>1</sup>*x*ð Þ*<sup>c</sup>* and *<sup>T</sup>* is an arbitrary positive constant.

The above fusion approach arises in modern statistical methodologies for 'big data'. A full dataset will be artificially split into a large number of smaller data sets, and inference is then conducted on each smaller data set and combined (see, for instance, [39–46]). The advantage for such an approach is that inference on each small data set can be conducted in parallel. Then the heavy computational cost

Simulation from the proposal *h* can be achieved directly. In particular, *x*ð Þ<sup>1</sup> , … *x*ð Þ *<sup>C</sup>* are first drawn independently from *g*1, … , *gC*, respectively, and then *y* is simply a Gaussian random variable centred on *x*. This is a distributed analysis or divide-and-conquer approach. Detailed acceptance probabilities and rejection

*gc <sup>x</sup>*ð Þ*<sup>c</sup>* h i � � � exp � *<sup>C</sup>* � <sup>∥</sup>*<sup>y</sup>* � *<sup>x</sup>*∥<sup>2</sup>

Here *C* denotes the number of parallel computing cores available in big data problems, and each *gc*ð Þ *x* means the sub-posterior density based on a subset of the big data. In group decision problems, *C* means the number of different decisions which should be combined and *gc*ð Þ *x* stands for the decision from each group

*C*

*c*¼1

*C*

*c*¼1

*f*ð Þ *x* ∝*g*1ð Þ *x* ⋯*gC*ð Þ *x* , (10)

*<sup>c</sup> <sup>x</sup>*ð Þ*<sup>c</sup>* � � � *pc <sup>y</sup>*j*x*ð Þ*<sup>c</sup>* � � � <sup>1</sup>

*gc y* � � " #, (11)

� �, (12)

2*T*

acceptance probability.

member.

each sub-density *gc*.

proportional to the function

where *<sup>x</sup>* <sup>¼</sup> *<sup>C</sup>*�<sup>1</sup>P*<sup>C</sup>*

*h x*ð Þ<sup>1</sup> , … , *x*ð Þ *<sup>C</sup>* , *y* � � <sup>¼</sup> <sup>Y</sup>

sampling algorithms can be found in [13].

space,

**37**

cated method is introduced in the following section.

*A Review on the Exact Monte Carlo Simulation DOI: http://dx.doi.org/10.5772/intechopen.88619*

**4.2 Monte Carlo fusion for distributed analysis**

*g x*ð Þ<sup>1</sup> , … , *x*ð Þ *<sup>C</sup>* , *y* � � <sup>¼</sup> <sup>Y</sup>

named as Monte Carlo fusion. Suppose that we consider

Given *X*0, *X<sup>T</sup>* � � the process *Xt*, 0 <*t*< *T* � � is given by the diffusion bridge driven by Eq. (6).

The marginal distribution for *X<sup>T</sup>* is *f y* � �. Therefore, to draw a sample from the target distribution *f*ð Þ *x* , we need to simulate a process *Xt*, *t* ∈½ � 0, *T* from (the law induced by *X*) and then *XT* � *f*ð Þ *x* .

Simulation from can be done via rejection sampling. We can use a *biased Brownian motion Bt*; 0≤*t* ≤*T* � � as the proposal diffusion:

Definition 4.2 (Biased Brownian motion) The starting and ending points *B*0, *B<sup>T</sup>* � � follow a distribution with a density *h x*, *y* � �, and *Bt*; 0 <*t* <*T* � � is a Brownian bridge given *B*0, *B<sup>T</sup>* � �.

Under certain mild conditions, Dai [12] proved the following lemma. Lemma 4.1 Let be the probability law induced by *Bt*; 0 ≤*t* ≤*T* � �. By letting

$$h(\mathfrak{w}\_0, \mathfrak{w}\_T) = \mathfrak{g}\_2(\mathfrak{o}\_T)\mathfrak{g}\_1(\mathfrak{o}\_0) \frac{\mathbf{1}}{\sqrt{2\pi T}} e^{-\frac{\|\mathfrak{a}\_T - \mathfrak{w}\_0\|^2}{2T}} \tag{6}$$

we have

$$\frac{d\overline{\mathbb{Q}}}{d\mathbb{Z}}\left(\overrightarrow{\boldsymbol{\phi}}\right) \propto \exp\left[-\frac{1}{2}\int\_{0}^{T} \left(\|\boldsymbol{\alpha}\|^{2} + \operatorname{div}\boldsymbol{\alpha}\right) (\boldsymbol{\alpha}\_{t}) dt\right] \tag{7}$$

where div is the divergence operator. Condition 4.1 There exists *l*> � ∞ such that

$$\frac{1}{2} \left( \left||\mathbf{a}\right|\right)^{2} + \mathbf{div}\,\mathbf{a} \right) (\mathbf{x}) - l \ge 0. \tag{8}$$

Under Condition 4.1 the ratio (8) can be rewritten as

$$\frac{d\overline{\mathbb{Q}}}{d\mathbb{Z}}\left(\overrightarrow{\boldsymbol{\omega}}\right) \propto \exp\left[-\int\_{0}^{T} \left(\frac{1}{2}\left(\|\boldsymbol{\omega}\|^{2} + \mathbf{div}\,\boldsymbol{\omega}\right)(\boldsymbol{\omega}\_{\mathrm{f}}) - l\right) d\boldsymbol{t}\right],\tag{9}$$

which has a value no more than 1.

Therefore we can use rejection sampling to simulate from , with proposal measure . This acceptance probability (10) can be dealt with using similar methods as that in [9, 11]. The algorithm is presented below; see [12, 38] for more details.


Such a method is a rejection sampling algorithm, but it does not require finding a hat function to bound the target density, which is usually the main challenge of the traditional rejection sampling for complicated target densities. The above algorithm *A Review on the Exact Monte Carlo Simulation DOI: http://dx.doi.org/10.5772/intechopen.88619*

Definition 4.1 (Biased Langevin diffusions) The joint density for the pair *X*0, *X<sup>T</sup>*

� � the process *Xt*, 0 <*t*< *T* � � is given by the diffusion bridge

target distribution *f*ð Þ *x* , we need to simulate a process *Xt*, *t* ∈½ � 0, *T* from (the law

Simulation from can be done via rejection sampling. We can use a *biased*

Definition 4.2 (Biased Brownian motion) The starting and ending points

� � follow a distribution with a density *h x*, *y* � �, and *Bt*; 0 <*t* <*T* � � is a

Under certain mild conditions, Dai [12] proved the following lemma.

*h*ð Þ¼ *ω*0, *ω<sup>T</sup> g*2ð Þ *ω<sup>T</sup> g*1ð Þ *ω*<sup>0</sup>

<sup>∝</sup> exp � <sup>1</sup>

2 ð*T* 0 k k*α*

<sup>2</sup> <sup>þ</sup> **div***<sup>α</sup>* � �

> 1 <sup>2</sup> k k*<sup>α</sup>*

Therefore we can use rejection sampling to simulate from , with proposal measure . This acceptance probability (10) can be dealt with using similar methods as that in [9, 11]. The algorithm is presented below; see [12, 38] for more details.

Simulate ð Þ *ω*0, *ω<sup>T</sup>* from density *h* 01 Simulate the biased Brownian bridge *<sup>B</sup>t*, *<sup>t</sup>* <sup>∈</sup>ð Þ 0, *<sup>T</sup>* � � <sup>02</sup> Accept *ω<sup>T</sup>* as a sample from *f*, with probability (6); If rejected, go back to step 01. 03

Such a method is a rejection sampling algorithm, but it does not require finding a hat function to bound the target density, which is usually the main challenge of the traditional rejection sampling for complicated target densities. The above algorithm

Lemma 4.1 Let be the probability law induced by *Bt*; 0 ≤*t* ≤*T* � �. By letting

1 ffiffiffiffiffiffiffiffi <sup>2</sup>*π<sup>T</sup>* <sup>p</sup> *<sup>e</sup>*

<sup>2</sup> <sup>þ</sup> div*<sup>α</sup>* � �

� �

<sup>2</sup> <sup>þ</sup> **div***<sup>α</sup>* � �

� �

� �

�∥*ωT*�*ω*0∥<sup>2</sup>

ð Þ *ω<sup>t</sup> dt*

ð Þ� *ω<sup>t</sup> l*

ð Þ� *x l*≥0*:* (8)

*dt*

, (9)

� �, where **<sup>t</sup>** <sup>∗</sup> *<sup>y</sup>*j*<sup>x</sup>* � � is the transition density for the diffusion

� � <sup>¼</sup> *<sup>g</sup>*<sup>2</sup> *<sup>y</sup>*

� �. Therefore, to draw a sample from the

� �*=g*<sup>1</sup> *<sup>y</sup>* � �.

<sup>2</sup>*<sup>T</sup>* (6)

(7)

(the starting and ending points of the biased diffusion process), evaluated at point

process *X* defined in Eq. (6) from *X*<sup>0</sup> ¼ *x* to *X<sup>T</sup>* ¼ *y* and *f* <sup>2</sup> *y*

*Brownian motion Bt*; 0≤*t* ≤*T* � � as the proposal diffusion:

� �.

The marginal distribution for *X<sup>T</sup>* is *f y*

induced by *X*) and then *XT* � *f*ð Þ *x* .

Brownian bridge given *B*0, *B<sup>T</sup>*

*d <sup>d</sup> <sup>ω</sup>* !� �

where div is the divergence operator.

*d <sup>d</sup> <sup>ω</sup>* !� �

which has a value no more than 1.

**Algorithm 4.1 (Simple distributed sampler)**

Condition 4.1 There exists *l*> � ∞ such that

1 <sup>2</sup> k k*<sup>α</sup>*

Under Condition 4.1 the ratio (8) can be rewritten as

ð*T* 0

∝ exp �

*<sup>x</sup>*, *<sup>y</sup>* � �, is *<sup>f</sup>* <sup>1</sup>ð Þ *<sup>x</sup>* **<sup>t</sup>** <sup>∗</sup> *<sup>y</sup>*j*<sup>x</sup>* � �*<sup>f</sup>* <sup>2</sup> *<sup>y</sup>*

*Bayesian Inference on Complicated Data*

Given *X*0, *X<sup>T</sup>*

driven by Eq. (6).

*B*0, *B<sup>T</sup>*

we have

**36**

� �

uses *g*<sup>2</sup> as the proposal density function, which does not have to bound the target *f*. However, it requires a bound for the derivatives of the logarithm of the subdensities (see Condition 4.1). This is usually easier to get in practice, since the logarithm of the posterior is usually easy to deal with. Also [12] noted that we should choose sub-densities *g*<sup>1</sup> and *g*<sup>2</sup> as similar as possible, in order to achieve high acceptance probability.

Dai [12] focused on the simple decomposition of *f* ¼ *g*1*g*2, although it mentioned that for more general decomposition of *f* ¼ *g*1*g*2⋯*gC*, a recursive method can be used. Unfortunately, a naive recursive method is very inefficient. A more sophisticated method is introduced in the following section.

#### **4.2 Monte Carlo fusion for distributed analysis**

A more efficient and sophisticated methods were proposed recently in [13], named as Monte Carlo fusion. Suppose that we consider

$$f(\mathfrak{x}) \propto \mathfrak{g}\_1(\mathfrak{x}) \cdots \mathfrak{g}\_C(\mathfrak{x}),\tag{10}$$

where each *gc*ð Þ *x* (*c*∈f g 1, … ,*C* ) is a density (up to a multiplicative constant). Here *C* denotes the number of parallel computing cores available in big data problems, and each *gc*ð Þ *x* means the sub-posterior density based on a subset of the big data. In group decision problems, *C* means the number of different decisions which should be combined and *gc*ð Þ *x* stands for the decision from each group member.

Dai et al. [13] considered simulating from the following density on extended space,

$$\lg\left(\mathbf{x}^{(1)},\ldots,\mathbf{x}^{(C)},\mathbf{y}\right) = \prod\_{c=1}^{C} \left[ \mathbf{g}\_c^2\left(\mathbf{x}^{(c)}\right) \cdot p\_c\left(\mathbf{y}|\mathbf{x}^{(c)}\right) \cdot \frac{\mathbf{1}}{\mathbf{g}\_c(\mathbf{y})} \right],\tag{11}$$

which admits the marginal density *<sup>f</sup>* for *<sup>y</sup>*. Here *pc <sup>y</sup>*j*x*ð Þ*<sup>c</sup>* � � is the transition density from *x*ð Þ*<sup>c</sup>* to *y* for the Langevin diffusion defined in Eq. (6) associated with each sub-density *gc*.

Dai et al. [13] considered a rejection sampling approach with proposal density proportional to the function

$$h\left(\mathbf{x}^{(1)},\ldots,\mathbf{x}^{(C)},\mathbf{y}\right) = \prod\_{c=1}^{C} \left[\mathbf{g}\_c\left(\mathbf{x}^{(c)}\right)\right] \cdot \exp\left(-\frac{\mathbf{C}\cdot\left\|\mathbf{y}-\overline{\mathbf{x}}\right\|^2}{2T}\right),\tag{12}$$

where *<sup>x</sup>* <sup>¼</sup> *<sup>C</sup>*�<sup>1</sup>P*<sup>C</sup> <sup>c</sup>*¼<sup>1</sup>*x*ð Þ*<sup>c</sup>* and *<sup>T</sup>* is an arbitrary positive constant.

Simulation from the proposal *h* can be achieved directly. In particular, *x*ð Þ<sup>1</sup> , … *x*ð Þ *<sup>C</sup>* are first drawn independently from *g*1, … , *gC*, respectively, and then *y* is simply a Gaussian random variable centred on *x*. This is a distributed analysis or divide-and-conquer approach. Detailed acceptance probabilities and rejection sampling algorithms can be found in [13].

The above fusion approach arises in modern statistical methodologies for 'big data'. A full dataset will be artificially split into a large number of smaller data sets, and inference is then conducted on each smaller data set and combined (see, for instance, [39–46]). The advantage for such an approach is that inference on each small data set can be conducted in parallel. Then the heavy computational cost

of algorithms such as MCMC will not be a concern. Traditional methods suffer from the weakness that the fusion of the separately conducted inferences is inexact. However, the Monte Carlo fusion in [13] is an exact simulation algorithm and does not have any approximation weakness.

Suppose that we have *N* observations, *y*1, … , *yN*. Let *Lnk* ¼ *f <sup>k</sup> yn*

*p<sup>α</sup>k*�<sup>1</sup>

*N*

X *K*

*k*¼1

There are several ways to carry out perfect sampling from Eq. (16). The first method is based on CFTP [48]. An alternative perfect sampling method for simple mixture models is introduced by [49]. The third approach is to use adaptive rejection sampling [3, 18], since the posterior is log-concave. We may also use the ratioof-uniform method. However, none of these methods are more efficient than the

Suppose that **p**<sup>∗</sup> , the MLE of **p** is unique and for simplicity, assume the prior

X *K*

*k*¼1

P*p*<sup>∗</sup>

*N*

*n*¼1

Let *In* be a random element of arg max *kLnk*. Define *Aj* ¼ f g *n* : *In* ¼ *j* and let

*mjk* ¼ 0 for *j* 6¼ *k*. We now make two assumptions, which we will return to later on:

Sample **q** from the Dirichlet distribution with parameter **n** þ **1**. 01 Sample *U* from Unif 0, 1 ½ �. 02

Accept **p** and stop; 05 else 06 reject **p** and go to 01. 07

**1** are all positive. Under these assumptions, we will show that the following rejection sampler generates simulated values from the posterior distribution of **p**. First we define **V** to

*n*∈ *Aj*

*<sup>f</sup>*ð Þ **<sup>p</sup>**j*<sup>y</sup>* <sup>∝</sup>*h*ð Þ¼ **<sup>p</sup>**j*<sup>y</sup>* <sup>Y</sup>

**n** ¼ ð Þ *n*1, … , *nk* where *nj* is the number of elements in *Aj*.

be the diagonal matrix with diagonal elements **<sup>v</sup>***<sup>T</sup>* <sup>¼</sup> ð Þ *<sup>v</sup>*<sup>1</sup> … , *vK* .

*n*¼1

*<sup>k</sup>*¼<sup>1</sup>*pk* <sup>¼</sup> 1, *pk* <sup>&</sup>gt;0, *<sup>k</sup>* <sup>¼</sup> 1, … ,*<sup>K</sup>* n o.

that the prior distribution of **p** is Dirichlet:

*A Review on the Exact Monte Carlo Simulation DOI: http://dx.doi.org/10.5772/intechopen.88619*

P*<sup>K</sup>*

geometric–arithmetic mean method in [19].

**A.1 Geometric-arithmetic mean method**

*π*0ð Þ **p** is uniform. Define *ank* ¼ *Lnk=*

where X is defined in (16).

**A: M** is invertible.

Define **<sup>M</sup>** <sup>¼</sup> *mjk* � � with *mjk* <sup>¼</sup> <sup>P</sup>

**B:** The elements of **<sup>v</sup>** <sup>¼</sup> **<sup>M</sup>***<sup>T</sup>* � ��<sup>1</sup>

**Algorithm 6.1 (GAM sampler)**

*V*�<sup>1</sup>

Calculate **<sup>p</sup>** with **<sup>p</sup>** <sup>¼</sup> **<sup>M</sup>**�<sup>1</sup>

Q*<sup>K</sup> <sup>j</sup>*¼<sup>1</sup> *qj =vj* � �*nj*

If *U* ≤*h*ð Þ **p**j*y =*

**39**

where X ¼ **p**j

given by

*<sup>π</sup>*0ð Þ **<sup>p</sup>** <sup>∝</sup> <sup>Y</sup>

Then the posterior distribution is given by

*K*

*k*¼1

*<sup>f</sup>*ð Þ **<sup>p</sup>**j*<sup>y</sup>* <sup>∝</sup>*π*0ð Þ **<sup>p</sup>** <sup>Y</sup>

� � and assume

*<sup>k</sup>* , *α<sup>k</sup>* > 0, *k* ¼ 1, … ,*K:* (14)

*pkLnk* !*I*<sup>X</sup> ð Þ **<sup>p</sup>** , (15)

*<sup>k</sup> Lnk*. The posterior density of **p** is then

*pkank* !*I*<sup>X</sup> ð Þ **<sup>p</sup>** , (16)

*ank* � �*=nj*. If *nj* <sup>¼</sup> 0, then set *mjj* <sup>¼</sup> 1 and

**q**. 03

, <sup>04</sup>

The above fusion approach also arises in a number of other settings, where distributed analysis came naturally. For example, in signal processing, distributed multi-sensor may be used for network fusion systems. Fusion approach arises naturally to combine results from different sensors [47].

#### **5. Conclusion**

Although perfect simulation usually refers to correcting the statistical errors for the samples drawn via MCMC, it actually covers a much wider area beyond CFTP. In fact for certain applications, it is often possible to construct other types of perfect sampling methods which are much more efficient than CFTP. For example, for the exact simulation of the posterior of simple mixture models, the geometricarithmetic mean (GAM) method in [19] is much more efficient than CFTP in [48]. Details of GAM method is provided in Appendix. Also the random walk construction for exact simulation for random spanning trees [7] is much more efficient than the CFTP version.

Bayesian computational algorithms keep evolving, in particular under the current big data era. Although almost all newly developed algorithms are approximate simulation algorithms, perfect sampling is still one of the key wheel-driven forces for new Bayesian computational algorithms, and they usually can quickly motivate new class of 'mainstream' algorithms. More focus should be given to methods beyond CFTP, for example, the fusion type of algorithms.

The Monte Carlo fusion method has the potential to be used in many Bayesian big data applications. For example, for large car accident data, the response variable is usually a categorical variable representing the seriousness of the accident, and generalized linear regression model is often used. Under a Bayesian framework, we may estimate the posterior distribution for the regression parameters via such a fusion approach. Then the posterior mean, the posterior median, or other characteristics of the posterior distribution can be estimated using the Monte Carlo samples. Also such an algorithm is perfect sampling algorithm, and no convergence justification is needed, since it always provided realizations exactly from the target distribution.

#### **Appendix**

#### **A. Geometric-arithmetic mean method for simple mixture model**

Observations from a simple mixture model are assumed to be either discrete or continuous. The density function of an individual observation *y* has the form

$$f(\mathbf{y}; \mathbf{p}) = \sum\_{k=1}^{K} p\_k f\_k(\mathbf{y}), \quad \text{where} \quad \sum\_{k=1}^{K} p\_k = \mathbf{1}, \quad \text{and} \quad p\_k > 0, k = \mathbf{1}, \dots, K. \tag{13}$$

We assume that the component weights *p* ¼ *p*1, … , *pK* � � are unknown parameters and the number of components, *K*, and the component densities, *f <sup>k</sup>*, are all known. We focus on the perfect sampling from the posterior distribution of **p**.

*A Review on the Exact Monte Carlo Simulation DOI: http://dx.doi.org/10.5772/intechopen.88619*

of algorithms such as MCMC will not be a concern. Traditional methods suffer from the weakness that the fusion of the separately conducted inferences is inexact. However, the Monte Carlo fusion in [13] is an exact simulation algorithm and does

The above fusion approach also arises in a number of other settings, where distributed analysis came naturally. For example, in signal processing, distributed multi-sensor may be used for network fusion systems. Fusion approach arises

Although perfect simulation usually refers to correcting the statistical errors for the samples drawn via MCMC, it actually covers a much wider area beyond CFTP. In fact for certain applications, it is often possible to construct other types of perfect sampling methods which are much more efficient than CFTP. For example, for the

arithmetic mean (GAM) method in [19] is much more efficient than CFTP in [48]. Details of GAM method is provided in Appendix. Also the random walk construction for exact simulation for random spanning trees [7] is much more efficient than

Bayesian computational algorithms keep evolving, in particular under the current big data era. Although almost all newly developed algorithms are approximate simulation algorithms, perfect sampling is still one of the key wheel-driven forces for new Bayesian computational algorithms, and they usually can quickly motivate new class of 'mainstream' algorithms. More focus should be given to methods

The Monte Carlo fusion method has the potential to be used in many Bayesian big data applications. For example, for large car accident data, the response variable is usually a categorical variable representing the seriousness of the accident, and generalized linear regression model is often used. Under a Bayesian framework, we may estimate the posterior distribution for the regression parameters via such a fusion approach. Then the posterior mean, the posterior median, or other characteristics of the posterior distribution can be estimated using the Monte Carlo samples. Also such an algorithm is perfect sampling algorithm, and no convergence justification is needed, since it always provided realizations exactly from the target distribution.

**A. Geometric-arithmetic mean method for simple mixture model**

continuous. The density function of an individual observation *y* has the form

*K*

*k*¼1

ters and the number of components, *K*, and the component densities, *f <sup>k</sup>*, are all known. We focus on the perfect sampling from the posterior distribution of **p**.

*pkf <sup>k</sup>*ð Þ*<sup>y</sup>* , where <sup>X</sup>

We assume that the component weights *p* ¼ *p*1, … , *pK*

Observations from a simple mixture model are assumed to be either discrete or

*pk* ¼ 1, and *pk* >0, *k* ¼ 1, … ,*K:* (13)

� � are unknown parame-

exact simulation of the posterior of simple mixture models, the geometric-

not have any approximation weakness.

*Bayesian Inference on Complicated Data*

**5. Conclusion**

the CFTP version.

**Appendix**

*f y*ð Þ¼ ; *<sup>p</sup>* <sup>X</sup>

**38**

*K*

*k*¼1

naturally to combine results from different sensors [47].

beyond CFTP, for example, the fusion type of algorithms.

Suppose that we have *N* observations, *y*1, … , *yN*. Let *Lnk* ¼ *f <sup>k</sup> yn* � � and assume that the prior distribution of **p** is Dirichlet:

$$\pi\_0(\mathbf{p}) \propto \prod\_{k=1}^{K} p\_k^{a\_k - 1}, \quad a\_k > 0, k = 1, \ldots, K. \tag{14}$$

Then the posterior distribution is given by

$$f(\mathbf{p}|\mathbf{y}) \propto \pi\_0(\mathbf{p}) \prod\_{n=1}^{N} \left(\sum\_{k=1}^{K} p\_k L\_{nk}\right) I\_{\mathcal{X}}(\mathbf{p}),\tag{15}$$

where X ¼ **p**j P*<sup>K</sup> <sup>k</sup>*¼<sup>1</sup>*pk* <sup>¼</sup> 1, *pk* <sup>&</sup>gt;0, *<sup>k</sup>* <sup>¼</sup> 1, … ,*<sup>K</sup>* n o.

There are several ways to carry out perfect sampling from Eq. (16). The first method is based on CFTP [48]. An alternative perfect sampling method for simple mixture models is introduced by [49]. The third approach is to use adaptive rejection sampling [3, 18], since the posterior is log-concave. We may also use the ratioof-uniform method. However, none of these methods are more efficient than the geometric–arithmetic mean method in [19].

#### **A.1 Geometric-arithmetic mean method**

Suppose that **p**<sup>∗</sup> , the MLE of **p** is unique and for simplicity, assume the prior *π*0ð Þ **p** is uniform. Define *ank* ¼ *Lnk=* P*p*<sup>∗</sup> *<sup>k</sup> Lnk*. The posterior density of **p** is then given by

$$f(\mathbf{p}|\mathbf{y}) \propto h(\mathbf{p}|\mathbf{y}) = \prod\_{n=1}^{N} \left(\sum\_{k=1}^{K} p\_k a\_{nk}\right) I\_{\mathcal{X}}(\mathbf{p}),\tag{16}$$

where X is defined in (16).

Let *In* be a random element of arg max *kLnk*. Define *Aj* ¼ f g *n* : *In* ¼ *j* and let **n** ¼ ð Þ *n*1, … , *nk* where *nj* is the number of elements in *Aj*.

Define **<sup>M</sup>** <sup>¼</sup> *mjk* � � with *mjk* <sup>¼</sup> <sup>P</sup> *n*∈ *Aj ank* � �*=nj*. If *nj* <sup>¼</sup> 0, then set *mjj* <sup>¼</sup> 1 and *mjk* ¼ 0 for *j* 6¼ *k*. We now make two assumptions, which we will return to later on: **A: M** is invertible.

**B:** The elements of **<sup>v</sup>** <sup>¼</sup> **<sup>M</sup>***<sup>T</sup>* � ��<sup>1</sup> **1** are all positive.

Under these assumptions, we will show that the following rejection sampler generates simulated values from the posterior distribution of **p**. First we define **V** to be the diagonal matrix with diagonal elements **<sup>v</sup>***<sup>T</sup>* <sup>¼</sup> ð Þ *<sup>v</sup>*<sup>1</sup> … , *vK* .


Proposition 6.1 Under assumptions **A** and **B**, Algorithm 6.1 samples **p** with probability density (17).

**Proof:** Since the geometric average is no larger than the arithmetic average, for **p**∈ X, we have

$$h(\mathbf{p}|\mathbf{y}) = \prod\_{n=1}^{N} \left( \sum\_{k=1}^{K} p\_k a\_{nk} \right) = \prod\_{j=1}^{K} \prod\_{n \in A\_j} \left( \sum\_{k=1}^{K} p\_k a\_{nk} \right) \tag{17}$$

$$\leq \prod\_{j=1}^{K} \left( \frac{\sum\_{n \in A\_j} \left( \sum\_{k=1}^{K} p\_k a\_{nk} \right)}{n\_j} \right)^{n\_j},\tag{18}$$

*<sup>α</sup><sup>k</sup>* <sup>¼</sup> <sup>1</sup> *N* P*<sup>N</sup>*

**<sup>v</sup>** <sup>¼</sup> **<sup>M</sup>**<sup>~</sup> *<sup>T</sup>* � ��<sup>1</sup>

since

written as

models.

**41**

pseudo data f g *a*~*mk*, *m* ¼ 1, … , *A* .

**A.3 Simulation results and discussion**

by **M**~ .

*<sup>n</sup>*¼<sup>1</sup>*ank*, *<sup>α</sup>* <sup>¼</sup> max

*A Review on the Exact Monte Carlo Simulation DOI: http://dx.doi.org/10.5772/intechopen.88619*

**A.2 Dirichlet priors and** *pseudo* **data**

1, … , *<sup>K</sup>* are positive, integers and let *<sup>A</sup>* <sup>¼</sup> <sup>P</sup>*<sup>K</sup>*

*<sup>k</sup>* f g *<sup>α</sup><sup>k</sup>* , **<sup>v</sup>** <sup>¼</sup> ð Þ *<sup>α</sup><sup>N</sup>* �<sup>1</sup>

provides an alternative way of ensuring that the elements of **v** are positive.

Suppose that the prior *π*0ð Þ **p** is Dirichletð Þ *α*<sup>1</sup> þ 1, … , *α<sup>K</sup>* þ 1 , where *α<sup>i</sup>* : *i* ¼

introducing *pseudo* data, *a*~*mk*, *m* ¼ 1, … , *A*; *k* ¼ 1, … ,*K*, defined as follows:

0 otherwise, (

*pαk <sup>k</sup>* <sup>¼</sup> <sup>Y</sup> *A*

*m*¼1

With the Dirichlet prior, the posterior distribution given by Eq. (16) can be

X *K*

*k*¼1

where f g *alk*, *l* ¼ 1, … , *N* þ *A* contains the real data f g *ank*, *n* ¼ 1, … , *N* and the

The posterior distribution (26) has the same form as Eq. (17). Therefore GAM

We compare the running time of mixture models with sample sizes (*N*) and different number of components (*K*) in **Table 1**. The components have specified

The prior on **p** is uniform. We sample 10,000 realizations from the posterior of the

When *K* ¼ 3, 4, we simulate *N* observations from a three-component normal mixture with *<sup>μ</sup>* <sup>¼</sup> ð Þ 0, 0, 2 , *<sup>σ</sup>*<sup>2</sup> <sup>¼</sup> ð Þ 1, 4, 1 and mixture weight **<sup>p</sup>**<sup>0</sup> <sup>¼</sup> ð Þ <sup>1</sup>*=*2, 1*=*3, 1*=*<sup>6</sup> . We then either sample from the posterior distribution of **p** using the same distributional components in the case *K* ¼ 3 or sample from the posterior distribution of **p** with an

When *K* ¼ 5, observations are simulated from the normal mixture distribution with components having means *<sup>μ</sup>* ¼ �ð Þ 2, 0, 4, 2, 3 , variances *<sup>σ</sup>*<sup>2</sup> <sup>¼</sup> ð Þ 1, 1, 4, 1, 4

*<sup>a</sup>*~*mk* <sup>¼</sup> 1 if <sup>P</sup>*<sup>k</sup>*�<sup>1</sup>

*<sup>π</sup>*0ð Þ **<sup>p</sup>** <sup>∝</sup> <sup>Y</sup>

*f*ð Þ **p**j*y* ∝

*K*

*k*¼1

*N* Y þ*A*

can be used to sample realizations from the posterior distribution (26).

normal distributions with means *<sup>μ</sup>* <sup>¼</sup> *<sup>μ</sup>*<sup>1</sup> ð Þ , … , *<sup>μ</sup><sup>K</sup>* and variances *<sup>σ</sup>*<sup>2</sup> <sup>¼</sup> *<sup>σ</sup>*<sup>2</sup>

additional component having mean *<sup>μ</sup>*<sup>4</sup> <sup>¼</sup> 4 and variance *<sup>σ</sup>*<sup>2</sup>

*l*¼1

where Δ is a diagonal matrix with diagonal elements 1ð Þ *=α*1, … , 1*=α<sup>K</sup>* . Note that

Suppose now that assumption **A** does not hold, i.e. **M** is not invertible. This can be remedied by adding positive quantities to the diagonal elements of **M**. This also

**1**>0. Algorithm 6.1 and its proof can then be modified by replacing **M**

*<sup>j</sup>*¼<sup>1</sup> *<sup>α</sup><sup>j</sup>* <sup>þ</sup> <sup>1</sup><sup>≤</sup> *<sup>m</sup>* <sup>≤</sup>P*<sup>k</sup>*

X *K*

*a*~*mkpk* !

*k*¼1

**<sup>n</sup>** and **<sup>M</sup>**<sup>~</sup> <sup>¼</sup> *<sup>α</sup>***M**Δ,

*<sup>j</sup>*¼<sup>1</sup>*αj*. The prior can be synthesized by

*:* (24)

1, … , *σ*<sup>2</sup> *K* � �.

<sup>4</sup> ¼ 4, in the case *K* ¼ 4.

(23)

*<sup>j</sup>*¼<sup>1</sup>*α<sup>j</sup>*

*pkalk* !*I*<sup>X</sup> ð Þ **<sup>p</sup>** , (25)

where in the case *nj* ¼ 0, the product term is taken as 1. So that, for **p**∈ X, with *mjk* as previously defined, we have

$$h(\mathbf{p}|\mathbf{y}) \le \prod\_{j=1}^{K} \left(\sum\_{k=1}^{K} p\_k m\_{jk}\right)^{n\_j} \tag{19}$$

$$= \left[\prod\_{j=1}^{K} v\_j^{-n\_j}\right] \prod\_{j=1}^{K} \left(v\_j \sum\_{k=1}^{K} p\_k m\_{jk}\right)^{n\_j} \tag{20}$$

$$=\prod\_{j=1}^{K} \left(q\_j/v\_j\right)^{n\_j},\tag{21}$$

where *qj* ¼ *vj* P*<sup>K</sup> <sup>k</sup>*¼<sup>1</sup>*pkmjk*, *<sup>j</sup>* <sup>¼</sup> 1, … ,*<sup>K</sup>* or equivalently **<sup>q</sup>** <sup>¼</sup> **VMp**. Since *vj* >0 and P*<sup>K</sup> <sup>k</sup>*¼<sup>1</sup>*pkmjk* <sup>&</sup>gt;0, it follows that *qj* <sup>&</sup>gt;0 for *<sup>j</sup>* <sup>¼</sup> 1, … , *<sup>K</sup>*. Furthermore

$$\sum\_{j=1}^{K} \mathbf{q}\_{j} = \sum\_{j=1}^{K} v\_{j} \sum\_{k=1}^{K} p\_{k} \mathbf{m}\_{jk} = \mathbf{p}^{T} \mathbf{M}^{T} \mathbf{v} = \mathbf{p}^{T} \mathbf{1} = \mathbf{1},$$

since **<sup>M</sup>***<sup>T</sup>***<sup>v</sup>** <sup>¼</sup> **<sup>1</sup>**, from the definition of **<sup>v</sup>**. It follows that **<sup>p</sup>** <sup>∈</sup> <sup>X</sup> implies **<sup>q</sup>**<sup>∈</sup> <sup>X</sup>, so that

$$h(\mathbf{p}|\boldsymbol{\mathcal{y}})I\_{\mathcal{X}}(\mathbf{p}) \leq \prod\_{j=1}^{K} \left(q\_j/v\_j\right)^{n\_j} I\_{\mathcal{X}}(\mathbf{q}).\tag{22}$$

Note that the right-hand side of Eq. (22) is proportional to a Dirichlet distribution with parameters ð Þ *n*<sup>1</sup> þ 1, … , *nK* þ 1 .

Rejection sampling then proceeds as usual:


We now return to assumptions **A** and **B**. Suppose that **M** is invertible but the elements of **<sup>v</sup>** <sup>¼</sup> **<sup>M</sup>***<sup>T</sup>* � ��<sup>1</sup> **1** are not all positive. In this case, let

*A Review on the Exact Monte Carlo Simulation DOI: http://dx.doi.org/10.5772/intechopen.88619*

$$a\_k = \frac{1}{N} \sum\_{n=1}^{N} a\_{nk}, \quad a = \max\_k \{a\_k\}, \quad \mathbf{v} = (a\mathbf{N})^{-1}\mathbf{n} \quad \text{and} \quad \tilde{\mathbf{M}} = a\mathbf{M}\Delta,$$

where Δ is a diagonal matrix with diagonal elements 1ð Þ *=α*1, … , 1*=α<sup>K</sup>* . Note that **<sup>v</sup>** <sup>¼</sup> **<sup>M</sup>**<sup>~</sup> *<sup>T</sup>* � ��<sup>1</sup> **1**>0. Algorithm 6.1 and its proof can then be modified by replacing **M** by **M**~ .

Suppose now that assumption **A** does not hold, i.e. **M** is not invertible. This can be remedied by adding positive quantities to the diagonal elements of **M**. This also provides an alternative way of ensuring that the elements of **v** are positive.

#### **A.2 Dirichlet priors and** *pseudo* **data**

Suppose that the prior *π*0ð Þ **p** is Dirichletð Þ *α*<sup>1</sup> þ 1, … , *α<sup>K</sup>* þ 1 , where *α<sup>i</sup>* : *i* ¼ 1, … , *<sup>K</sup>* are positive, integers and let *<sup>A</sup>* <sup>¼</sup> <sup>P</sup>*<sup>K</sup> <sup>j</sup>*¼<sup>1</sup>*αj*. The prior can be synthesized by introducing *pseudo* data, *a*~*mk*, *m* ¼ 1, … , *A*; *k* ¼ 1, … ,*K*, defined as follows:

$$
\tilde{a}\_{mk} = \begin{cases}
\mathbf{1} & \text{if } \sum\_{j=1}^{k-1} a\_j + \mathbf{1} \le m \le \sum\_{j=1}^{k} a\_j \\
\mathbf{0} & \text{otherwise},
\end{cases}
\tag{23}
$$

since

Proposition 6.1 Under assumptions **A** and **B**, Algorithm 6.1 samples **p** with

**Proof:** Since the geometric average is no larger than the arithmetic average, for

P*<sup>K</sup>*

*nj*

where in the case *nj* ¼ 0, the product term is taken as 1. So that, for **p**∈ X, with

X *K*

*pkmjk* !*nj*

*pkmjk* !*nj*

*<sup>k</sup>*¼<sup>1</sup>*pkmjk* <sup>&</sup>gt;0, it follows that *qj* <sup>&</sup>gt;0 for *<sup>j</sup>* <sup>¼</sup> 1, … , *<sup>K</sup>*. Furthermore

*pkmjk* <sup>¼</sup> **<sup>p</sup>***<sup>T</sup>***M***<sup>T</sup>***<sup>v</sup>** <sup>¼</sup> **<sup>p</sup>***<sup>T</sup>***<sup>1</sup>** <sup>¼</sup> 1,

Q*<sup>K</sup> <sup>j</sup>*¼<sup>1</sup> *qj =vj* � �*nj*

*k*¼1

*vj* X *K*

*k*¼1

<sup>¼</sup> <sup>Y</sup> *K*

*j*¼1

*<sup>k</sup>*¼<sup>1</sup>*pkank* � �

Y *n*∈ *Aj* X *K*

*pkank* !

, (18)

, (21)

*I*<sup>X</sup> ð Þ **q** *:* (22)

.

(17)

(19)

(20)

*k*¼1

*nj*

1 A

probability density (17).

*Bayesian Inference on Complicated Data*

*<sup>h</sup>*ð Þ¼ **<sup>p</sup>**j*<sup>y</sup>* <sup>Y</sup>

*mjk* as previously defined, we have

P*<sup>K</sup>*

X *K*

*qj* <sup>¼</sup> <sup>X</sup> *K*

*j*¼1 *vj* X *K*

*k*¼1

*<sup>h</sup>*ð Þ **<sup>p</sup>**j*<sup>y</sup> <sup>I</sup>*<sup>X</sup> ð Þ **<sup>p</sup>** <sup>≤</sup> <sup>Y</sup>

**q** is calculated.

*j*¼1

tion with parameters ð Þ *n*<sup>1</sup> þ 1, … , *nK* þ 1 .

• The value **<sup>p</sup>** <sup>¼</sup> **<sup>M</sup>**�<sup>1</sup>

elements of **<sup>v</sup>** <sup>¼</sup> **<sup>M</sup>***<sup>T</sup>* � ��<sup>1</sup>

Rejection sampling then proceeds as usual:

• A sample **q** is drawn from Dirichletð Þ **n** þ **1** .

**V**�<sup>1</sup>

• It is accepted with probability *h*ð Þ **p**j *y I*<sup>X</sup> ð Þ **p** *=*

where *qj* ¼ *vj*

that

**40**

Since *vj* >0 and P*<sup>K</sup>*

*N*

X *K*

*pkank* !

*k*¼1

0 @

P *n*∈ *Aj*

*<sup>h</sup>*ð Þ **<sup>p</sup>**j*<sup>y</sup>* <sup>≤</sup> <sup>Y</sup>

*K*

*j*¼1

*K*

*j*¼1

*qj =vj* � �*nj*

*<sup>k</sup>*¼<sup>1</sup>*pkmjk*, *<sup>j</sup>* <sup>¼</sup> 1, … ,*<sup>K</sup>* or equivalently **<sup>q</sup>** <sup>¼</sup> **VMp**.

since **<sup>M</sup>***<sup>T</sup>***<sup>v</sup>** <sup>¼</sup> **<sup>1</sup>**, from the definition of **<sup>v</sup>**. It follows that **<sup>p</sup>**<sup>∈</sup> <sup>X</sup> implies **<sup>q</sup>**<sup>∈</sup> <sup>X</sup>, so

*K*

*j*¼1

Note that the right-hand side of Eq. (22) is proportional to a Dirichlet distribu-

We now return to assumptions **A** and **B**. Suppose that **M** is invertible but the

**1** are not all positive. In this case, let

*qj =vj* � �*nj*

<sup>¼</sup> <sup>Y</sup> *K*

*j*¼1

*n*¼1

≤ Y *K*

*j*¼1

<sup>¼</sup> <sup>Y</sup> *K*

*j*¼1 *v* �*nj j* " #Y

**p** ∈ X, we have

$$\pi\_0(\mathbf{p}) \propto \prod\_{k=1}^{K} p\_k^{a\_k} = \prod\_{m=1}^{A} \left( \sum\_{k=1}^{K} \vec{a}\_{mk} p\_k \right). \tag{24}$$

With the Dirichlet prior, the posterior distribution given by Eq. (16) can be written as

$$f(\mathbf{p}|\mathbf{y}) \propto \prod\_{l=1}^{N+A} \left(\sum\_{k=1}^{K} p\_k \overline{a}\_{lk}\right) I\_{\mathcal{X}}(\mathbf{p}),\tag{25}$$

where f g *alk*, *l* ¼ 1, … , *N* þ *A* contains the real data f g *ank*, *n* ¼ 1, … , *N* and the pseudo data f g *a*~*mk*, *m* ¼ 1, … , *A* .

The posterior distribution (26) has the same form as Eq. (17). Therefore GAM can be used to sample realizations from the posterior distribution (26).

#### **A.3 Simulation results and discussion**

We compare the running time of mixture models with sample sizes (*N*) and different number of components (*K*) in **Table 1**. The components have specified normal distributions with means *<sup>μ</sup>* <sup>¼</sup> *<sup>μ</sup>*<sup>1</sup> ð Þ , … , *<sup>μ</sup><sup>K</sup>* and variances *<sup>σ</sup>*<sup>2</sup> <sup>¼</sup> *<sup>σ</sup>*<sup>2</sup> 1, … , *σ*<sup>2</sup> *K* � �. The prior on **p** is uniform. We sample 10,000 realizations from the posterior of the models.

When *K* ¼ 3, 4, we simulate *N* observations from a three-component normal mixture with *<sup>μ</sup>* <sup>¼</sup> ð Þ 0, 0, 2 , *<sup>σ</sup>*<sup>2</sup> <sup>¼</sup> ð Þ 1, 4, 1 and mixture weight **<sup>p</sup>**<sup>0</sup> <sup>¼</sup> ð Þ <sup>1</sup>*=*2, 1*=*3, 1*=*<sup>6</sup> . We then either sample from the posterior distribution of **p** using the same distributional components in the case *K* ¼ 3 or sample from the posterior distribution of **p** with an additional component having mean *<sup>μ</sup>*<sup>4</sup> <sup>¼</sup> 4 and variance *<sup>σ</sup>*<sup>2</sup> <sup>4</sup> ¼ 4, in the case *K* ¼ 4.

When *K* ¼ 5, observations are simulated from the normal mixture distribution with components having means *<sup>μ</sup>* ¼ �ð Þ 2, 0, 4, 2, 3 , variances *<sup>σ</sup>*<sup>2</sup> <sup>¼</sup> ð Þ 1, 1, 4, 1, 4

and **p**<sup>0</sup> ¼ ð Þ 0*:*35, 0*:*3, 0*:*1, 0*:*2, 0*:*05 . Samples from the posterior distribution of **p** are drawn assuming the same components. Similar calculations are carried out for *<sup>K</sup>* <sup>¼</sup> 6, where *<sup>μ</sup>* <sup>¼</sup> ð Þ 0, 3, 2, �2, �4, 5 , variances *<sup>σ</sup>*<sup>2</sup> <sup>¼</sup> ð Þ 1, 1, 1, 1, 1, 4 and **<sup>p</sup>**<sup>0</sup> <sup>¼</sup> ð Þ 0*:*05, 0*:*3, 0*:*3, 0*:*1, 0*:*08, 0*:*17 , again assuming that the component distributions are known.

From **Table 1**, we can see that the GAM algorithm, while using very little memory, is highly efficient in running time. The last row of the table is the estimated acceptance probability of the GAM algorithm. The algorithm is very efficient when the component densities are known. We can see this not only by simulation but also from theoretical considerations, as follows.

#### **A.3.1 Explanation of efficiency**

When **<sup>v</sup>** <sup>¼</sup> **<sup>M</sup>***<sup>T</sup>* � ��<sup>1</sup> **1**>0, we are able to use **M** directly without modification to construct the hat function, thereby speeding up the calculations. In the simulations of the previous section, this was always found to be the case. Now we explain why this should be so.

If the maximum likelihood estimator of **p** is consistent, then when *N* ! ∞,

$$\frac{1}{n\_{\!j}} \sum\_{n \in A\_{\!j}} \left| \frac{L\_{nk}}{\sum\_{k=1}^{K} p\_{\!k}^{\*} L\_{nk}} - \frac{L\_{nk}}{\sum\_{k=1}^{K} p\_{\!k} L\_{nk}} \right| \stackrel{p}{\to} \mathbf{0}.\tag{26}$$

Assuming sufficient regularity, we also have

$$m\_{jk} = \frac{\sum\_{n \in A\_j} a\_{nk}}{n\_j} \tag{27}$$

where *<sup>Y</sup>* has density *f y* ð Þ¼ <sup>P</sup>*<sup>K</sup>*

*A Review on the Exact Monte Carlo Simulation DOI: http://dx.doi.org/10.5772/intechopen.88619*

<sup>L</sup>*<sup>j</sup> f y*ð Þ*dy*.

Let **v** ¼ *γ*<sup>1</sup> ð Þ , … , *γ<sup>K</sup>* , then

L*j*

**<sup>W</sup>***<sup>T</sup>***<sup>v</sup>** <sup>¼</sup>

P*<sup>K</sup> j*¼1 Ð

where the second equal sign is because ∪*<sup>K</sup>*

there also exists **<sup>v</sup>**<sup>≈</sup> **<sup>v</sup>**>0, satisfying **<sup>M</sup>***<sup>T</sup>***<sup>v</sup>** <sup>¼</sup> **<sup>1</sup>**.

*f y*ð Þ*dy* and *nj* ¼ # *Aj*

P*<sup>K</sup> j*¼1 Ð

and *<sup>γ</sup><sup>j</sup>* <sup>¼</sup> <sup>Ð</sup>

Therefore

where

Since *<sup>γ</sup><sup>j</sup>* <sup>¼</sup> <sup>Ð</sup>

*vj* ≈ *γj*, so **q** satisfies

and then,

**43**

*<sup>k</sup>*¼<sup>1</sup> *pk <sup>f</sup> <sup>k</sup>*ð Þ*<sup>y</sup>* , <sup>L</sup>*<sup>j</sup>* <sup>¼</sup> *<sup>y</sup>*<sup>j</sup> *fj*

Ð L*j*

*f <sup>k</sup>*ð Þ*y dy γj*

Ð

2 6 4

Ð

*f* <sup>1</sup>ð Þ*y dy* ⋮

3 7

*<sup>j</sup>*¼<sup>1</sup>L*<sup>j</sup>* ¼ �ð Þ <sup>∞</sup>, <sup>∞</sup> . So, there exists **<sup>v</sup>**>0,

**q**≈**1**, (34)

*K*

*j*¼1 *q nj*

**<sup>1</sup>** <sup>¼</sup> **<sup>p</sup>**<sup>∗</sup> *:* (35)

*<sup>j</sup>* , (36)

*γj:* When each *nj*, *j* ¼

*f <sup>K</sup>*ð Þ*y dy*

*p*

**<sup>M</sup>***<sup>T</sup>* ! *p*

**<sup>W</sup>** <sup>¼</sup> *wjk* � �, *wjk* <sup>¼</sup>

<sup>L</sup>*<sup>j</sup> <sup>f</sup>* <sup>1</sup>ð Þ*<sup>y</sup> dy* ⋮

<sup>L</sup>*<sup>j</sup> <sup>f</sup> <sup>K</sup>*ð Þ*<sup>y</sup> dy*

satisfying **<sup>W</sup>***<sup>T</sup>***<sup>v</sup>** <sup>¼</sup> **<sup>1</sup>**. Using Eq. (32), we can conclude that when *<sup>N</sup>* is large enough,

1, … , *K* is large, if a random sample **q** is drawn from a Dirichlet distribution with parameter **n** þ **1**, then each *qj* is approximately equal to *nj=N* ≈*γj*. Furthermore,

**V**�<sup>1</sup>

**V**�<sup>1</sup>

X *K*

*k*¼1

If **p** is approximately equal to the mode **p**<sup>∗</sup> , the two sides of the inequality,

*pkank* ! <sup>≤</sup> <sup>Y</sup>

are approximately equal as well. Thus, the closer the sampled realization **p** is to **p**<sup>∗</sup> , the larger the acceptance probability is. So the algorithm runs very rapidly,

This algorithm requires calculating the MLE, which can be performed very quickly since the likelihood function is log-concave. In fact an approximate *guess* for **p**<sup>∗</sup> will suffice. The more accurate the guess is, the more efficient the algorithm will be. The method performs well when the component densities are correctly specified, as explained in the previous section. For these same reasons, we would expect the algorithm to perform poorly under misspecification. Details of robustness to

**q**≈ **M**�<sup>1</sup>

*K*

*j*¼1 *v* �*nj j* " #Y

**<sup>p</sup>** <sup>¼</sup> **<sup>M</sup>**�<sup>1</sup>

*N*

*n*¼1

since the sampled values of **p** are always around the mode **p**<sup>∗</sup> .

*<sup>h</sup>*ð Þ¼ **<sup>p</sup>**<sup>j</sup> *<sup>y</sup>* <sup>Y</sup>

misspecification can be found in [19].

� �, we have *nj=<sup>N</sup>* !

ð Þ*y* ≥*f <sup>k</sup>*ð Þ*y* , *k* ¼ 1, … ,*K* n o

*:* (32)

<sup>5</sup> <sup>¼</sup> **<sup>1</sup>**, (33)

**W***<sup>T</sup>*, (31)

$$=\frac{1}{n\_j} \sum\_{n \in A\_j} \frac{L\_{nk}}{\sum\_{k=1}^{K} p\_k^\* \, L\_{nk}} \tag{28}$$

$$\stackrel{p}{\longrightarrow} E\left(\frac{f\_k(Y)}{f(Y)}|\mathcal{L}\_j\right) \tag{29}$$

$$\xi = \frac{\int\_{\mathcal{L}} f\_k(\mathbf{y}) d\mathbf{y}}{\eta\_j}, \quad \text{as } N \to \infty,\tag{30}$$


*Fearnhead's algorithm, Leydold's algorithm and ratio-of-uniform. GAM method. GAM acceptance probability. The \* indicates that Fearnhead's method, and Leydold's method will not run on a standard desktop when K* ¼ 4 *and K* ¼ 5*.*

where *<sup>Y</sup>* has density *f y* ð Þ¼ <sup>P</sup>*<sup>K</sup> <sup>k</sup>*¼<sup>1</sup> *pk <sup>f</sup> <sup>k</sup>*ð Þ*<sup>y</sup>* , <sup>L</sup>*<sup>j</sup>* <sup>¼</sup> *<sup>y</sup>*<sup>j</sup> *fj* ð Þ*y* ≥*f <sup>k</sup>*ð Þ*y* , *k* ¼ 1, … ,*K* n o and *<sup>γ</sup><sup>j</sup>* <sup>¼</sup> <sup>Ð</sup> <sup>L</sup>*<sup>j</sup> f y*ð Þ*dy*.

Therefore

and **p**<sup>0</sup> ¼ ð Þ 0*:*35, 0*:*3, 0*:*1, 0*:*2, 0*:*05 . Samples from the posterior distribution of **p** are drawn assuming the same components. Similar calculations are carried out for *<sup>K</sup>* <sup>¼</sup> 6, where *<sup>μ</sup>* <sup>¼</sup> ð Þ 0, 3, 2, �2, �4, 5 , variances *<sup>σ</sup>*<sup>2</sup> <sup>¼</sup> ð Þ 1, 1, 1, 1, 1, 4 and **<sup>p</sup>**<sup>0</sup> <sup>¼</sup> ð Þ 0*:*05, 0*:*3, 0*:*3, 0*:*1, 0*:*08, 0*:*17 , again assuming that the component distributions are

From **Table 1**, we can see that the GAM algorithm, while using very little memory, is highly efficient in running time. The last row of the table is the estimated acceptance probability of the GAM algorithm. The algorithm is very efficient when the component densities are known. We can see this not only by simulation

construct the hat function, thereby speeding up the calculations. In the simulations of the previous section, this was always found to be the case. Now we explain why

If the maximum likelihood estimator of **p** is consistent, then when *N* ! ∞,

P *n*∈ *Aj ank*

P*<sup>K</sup> <sup>k</sup>*¼<sup>1</sup>*p*<sup>∗</sup> *<sup>k</sup> Lnk*

*<sup>E</sup> <sup>f</sup> <sup>k</sup>*ð Þ *<sup>Y</sup> f Y*ð Þ jL*<sup>j</sup>* � �

*K* **33446 6** *N* 400 1000 400 1000 400 1000 Fearnhead's 242 s 3610 s \* \* \* \* Leydold's ≤1 s 3.6 s \* \* \* \* RoU 16:11 s 28:16 s 31:18 s 68:33 s 88:60 s 152:76 s GAM 4 s 9 s 11 s 16 s 6 s 11 s GAM AP 0.7472 0.7509 0.2433 0.3088 0.5325 0.5505 *Fearnhead's algorithm, Leydold's algorithm and ratio-of-uniform. GAM method. GAM acceptance probability. The \* indicates that Fearnhead's method, and Leydold's method will not run on a standard desktop when K* ¼ 4 *and K* ¼ 5*.*

*nj*

*Lnk*

� *Lnk* P*<sup>K</sup>*

*<sup>k</sup>*¼<sup>1</sup>*pkLnk*

� � � � � ! *p*

, as *N* ! ∞, (30)

0*:* (26)

(27)

(28)

(29)

*Lnk*

*mjk* ¼

¼ 1 *nj* X *n*∈ *Aj*

> ! *p*

¼ Ð L*j f <sup>k</sup>*ð Þ*y dy γj*

P*<sup>K</sup> <sup>k</sup>*¼<sup>1</sup>*p*<sup>∗</sup> *<sup>k</sup> Lnk*

� � � � �

**1**>0, we are able to use **M** directly without modification to

but also from theoretical considerations, as follows.

1 *nj* X *n*∈ *Aj*

Assuming sufficient regularity, we also have

**A.3.1 Explanation of efficiency**

*Bayesian Inference on Complicated Data*

When **<sup>v</sup>** <sup>¼</sup> **<sup>M</sup>***<sup>T</sup>* � ��<sup>1</sup>

this should be so.

**Table 1.**

**42**

*Running times (in s).*

known.

$$\mathbf{M}^T \xrightarrow{p} \mathbf{W}^T,\tag{31}$$

where

$$\mathbf{W} = \left\{ w\_{jk} \right\}, w\_{jk} = \frac{\int\_{\mathcal{L}\_j} f\_k(\boldsymbol{y}) d\boldsymbol{y}}{\boldsymbol{\gamma}\_j}. \tag{32}$$

Let **v** ¼ *γ*<sup>1</sup> ð Þ , … , *γ<sup>K</sup>* , then

$$\mathbf{W}^T \overline{\mathbf{v}} = \begin{bmatrix} \sum\_{j=1}^K \int\_{\mathcal{L}\_j} f\_1(\ \mathbf{y}) d\mathbf{y} \\ \vdots \\ \sum\_{j=1}^K \int\_{\mathcal{L}\_j} f\_K(\mathbf{y}) d\mathbf{y} \end{bmatrix} = \begin{bmatrix} \int f\_1(\mathbf{y}) d\mathbf{y} \\ \vdots \\ \int f\_K(\mathbf{y}) d\mathbf{y} \end{bmatrix} = \mathbf{1},\tag{33}$$

where the second equal sign is because ∪*<sup>K</sup> <sup>j</sup>*¼<sup>1</sup>L*<sup>j</sup>* ¼ �ð Þ <sup>∞</sup>, <sup>∞</sup> . So, there exists **<sup>v</sup>**>0, satisfying **<sup>W</sup>***<sup>T</sup>***<sup>v</sup>** <sup>¼</sup> **<sup>1</sup>**. Using Eq. (32), we can conclude that when *<sup>N</sup>* is large enough, there also exists **<sup>v</sup>**<sup>≈</sup> **<sup>v</sup>**>0, satisfying **<sup>M</sup>***<sup>T</sup>***<sup>v</sup>** <sup>¼</sup> **<sup>1</sup>**.

Since *<sup>γ</sup><sup>j</sup>* <sup>¼</sup> <sup>Ð</sup> L*j f y*ð Þ*dy* and *nj* ¼ # *Aj* � �, we have *nj=<sup>N</sup>* ! *p γj:* When each *nj*, *j* ¼ 1, … , *K* is large, if a random sample **q** is drawn from a Dirichlet distribution with parameter **n** þ **1**, then each *qj* is approximately equal to *nj=N* ≈*γj*. Furthermore, *vj* ≈ *γj*, so **q** satisfies

$$\mathbf{V}^{-1}\mathbf{q} \approx \mathbf{1},\tag{34}$$

and then,

$$\mathbf{p} = \mathbf{M}^{-1} \mathbf{V}^{-1} \mathbf{q} \approx \mathbf{M}^{-1} \mathbf{1} = \mathbf{p}^\*.\tag{35}$$

If **p** is approximately equal to the mode **p**<sup>∗</sup> , the two sides of the inequality,

$$h(\mathbf{p}|\boldsymbol{y}) = \prod\_{n=1}^{N} \left( \sum\_{k=1}^{K} p\_k a\_{nk} \right) \le \left[ \prod\_{j=1}^{K} v\_j^{-n\_j} \right] \prod\_{j=1}^{K} q\_j^{n\_j},\tag{36}$$

are approximately equal as well. Thus, the closer the sampled realization **p** is to **p**<sup>∗</sup> , the larger the acceptance probability is. So the algorithm runs very rapidly, since the sampled values of **p** are always around the mode **p**<sup>∗</sup> .

This algorithm requires calculating the MLE, which can be performed very quickly since the likelihood function is log-concave. In fact an approximate *guess* for **p**<sup>∗</sup> will suffice. The more accurate the guess is, the more efficient the algorithm will be.

The method performs well when the component densities are correctly specified, as explained in the previous section. For these same reasons, we would expect the algorithm to perform poorly under misspecification. Details of robustness to misspecification can be found in [19].

*Bayesian Inference on Complicated Data*

**References**

**91**:883-904

1998;**8**:131-162

1996;**9**:223-252

1998;**27**:170-217

**45**

Statistics. 1992;**41**:337-348

[4] Propp JG, Wilson DB. Exact sampling with coupled Markov chains and applications to statistical mechanics. Random Structures and Algorithms.

[5] Fill JA, Machida M, Murdoch DJ, Rosenthal JS. Extension of Fill's perfect rejection sampling algorithm to general

[6] Propp JG, Wilson DB. How to get an exact sample from a generic Markov chain and sample a random spanning tree from a directed graph, both within the cover time. Journal of Algorithms.

chains. Random Structure and Algorithms. 2000;**17**:290-316

[7] Aldous DJ. The random walk construction of uniform spanning trees and uniform labelled trees. SIAM Journal on Discrete Mathematics. 1990;**3**:450-465

[8] Wilson DB. Generating random spanning trees more quickly than the cover time. In: Annual ACM Symposium on Theory of Computing Archive, Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of

Computing. 1996. pp. 296-303

[9] Beskos A, Papaspiliopoulos O, Roberts GO, Fearnhead P. Exact and computationally efficient likelihood-based estimation for

[1] Cowles MK, Carlin BP. Markov chain Monte Carlo convergence diagnostics: A comparative review. Journal of the American Statistical Association. 1996;

*A Review on the Exact Monte Carlo Simulation DOI: http://dx.doi.org/10.5772/intechopen.88619*

> discretely observed diffusion processess. Journal of Royal Statistical Society, B.

[10] Beskos A, Roberts GO. Exact simulation on diffusions. The Annals of Applied Probability. 2005;**15**:

[11] Beskos A, Papaspiliopoulos O, Roberts GO. A factorisation of diffusion

[12] Dai H. A new rejection sampling method without using hat function. Bernoulli. 2017;**23**:2434-2465

[13] Dai H, Pollock M, Roberts G. Monte

[14] Tan A, Doss H, Hobert JP. Honest importance sampling with multiple

Computational and Graphical Statistics.

[15] Green PJ, Murdoch DJ. Exact sampling for Bayesian inference: Towards general purpose algorithms. Bayesian Statistics. 1998;**6**:301-321

[16] Wakefield JC, Gelfand AE, Smith AFM. Efficient generation of random variates via the ratio-ofuniforms method. Statistics and Computing. 1991;**1**:129-133

[17] Robert C, Casella G. Monte Carlo Statistical Methods. Springer Science &

[18] Leydold J. A rejection technique for sampling from log-concave multivariate distributions. Modeling and Computer

Simulation. 1998;**8**(3):254-280. Available from: citeseer.nj.nec.com/

leydold98rejection.html

Business Media; 2013

Carlo fusion. Journal of Applied Probability. 2019;**56**:174-191

Markov chains. Journal of

2015;**24**(3):792-826

measure and finite sample path constructions. Methodology and Computing in Applied Probability.

2006;**68**:333-382

2422-2444

2008;**10**:85-104

[2] Fill JA. An interruptible algorithm for perfect sampling via Markov chains. The Annals of Applied Probability.

[3] Gilks WR, Wild P. Adaptive rejection sampling for Gibbs sampling. Applied

#### **Author details**

Hongsheng Dai University of Essex, Colchester, UK

\*Address all correspondence to: hdaia@essex.ac.uk

© 2019 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

*A Review on the Exact Monte Carlo Simulation DOI: http://dx.doi.org/10.5772/intechopen.88619*

#### **References**

[1] Cowles MK, Carlin BP. Markov chain Monte Carlo convergence diagnostics: A comparative review. Journal of the American Statistical Association. 1996; **91**:883-904

[2] Fill JA. An interruptible algorithm for perfect sampling via Markov chains. The Annals of Applied Probability. 1998;**8**:131-162

[3] Gilks WR, Wild P. Adaptive rejection sampling for Gibbs sampling. Applied Statistics. 1992;**41**:337-348

[4] Propp JG, Wilson DB. Exact sampling with coupled Markov chains and applications to statistical mechanics. Random Structures and Algorithms. 1996;**9**:223-252

[5] Fill JA, Machida M, Murdoch DJ, Rosenthal JS. Extension of Fill's perfect rejection sampling algorithm to general chains. Random Structure and Algorithms. 2000;**17**:290-316

[6] Propp JG, Wilson DB. How to get an exact sample from a generic Markov chain and sample a random spanning tree from a directed graph, both within the cover time. Journal of Algorithms. 1998;**27**:170-217

[7] Aldous DJ. The random walk construction of uniform spanning trees and uniform labelled trees. SIAM Journal on Discrete Mathematics. 1990;**3**:450-465

[8] Wilson DB. Generating random spanning trees more quickly than the cover time. In: Annual ACM Symposium on Theory of Computing Archive, Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of Computing. 1996. pp. 296-303

[9] Beskos A, Papaspiliopoulos O, Roberts GO, Fearnhead P. Exact and computationally efficient likelihood-based estimation for

discretely observed diffusion processess. Journal of Royal Statistical Society, B. 2006;**68**:333-382

[10] Beskos A, Roberts GO. Exact simulation on diffusions. The Annals of Applied Probability. 2005;**15**: 2422-2444

[11] Beskos A, Papaspiliopoulos O, Roberts GO. A factorisation of diffusion measure and finite sample path constructions. Methodology and Computing in Applied Probability. 2008;**10**:85-104

[12] Dai H. A new rejection sampling method without using hat function. Bernoulli. 2017;**23**:2434-2465

[13] Dai H, Pollock M, Roberts G. Monte Carlo fusion. Journal of Applied Probability. 2019;**56**:174-191

[14] Tan A, Doss H, Hobert JP. Honest importance sampling with multiple Markov chains. Journal of Computational and Graphical Statistics. 2015;**24**(3):792-826

[15] Green PJ, Murdoch DJ. Exact sampling for Bayesian inference: Towards general purpose algorithms. Bayesian Statistics. 1998;**6**:301-321

[16] Wakefield JC, Gelfand AE, Smith AFM. Efficient generation of random variates via the ratio-ofuniforms method. Statistics and Computing. 1991;**1**:129-133

[17] Robert C, Casella G. Monte Carlo Statistical Methods. Springer Science & Business Media; 2013

[18] Leydold J. A rejection technique for sampling from log-concave multivariate distributions. Modeling and Computer Simulation. 1998;**8**(3):254-280. Available from: citeseer.nj.nec.com/ leydold98rejection.html

**Author details**

Hongsheng Dai

**44**

University of Essex, Colchester, UK

*Bayesian Inference on Complicated Data*

provided the original work is properly cited.

\*Address all correspondence to: hdaia@essex.ac.uk

© 2019 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium,

[19] Dai H. Perfect simulation methods for Bayesian applications [PhD thesis], University of Oxford. 2007

[20] Wilson DB. How to couple from the past using a read-once source of randomness. Random Structures and Algorithms. 2000;**16**:85-113

[21] Cai H. Exact sampling using auxiliary variables. In: Statistical Computing Section of ASA Proceedings. 1999

[22] Corcoran JN, Tweedie RL. Perfect sampling from independent Metropolis-Hastings chains. Journal of Statistical Planning and Inference. 2000;**104**(2): 297-314

[23] Mira A, Møller J, Roberts GO. Perfect slice samplers. Journal of the Royal Statistical Society B. 2001;**63**:593-606

[24] Murdoch DJ, Green PJ. Exact sampling from a continuous state space. Scandinavian Journal of Statistics. 1998; **25**:483-502

[25] Hörmann W, Leydold J. Improved perfect slice sampling. Technique Report. 2003

[26] Casella G, Mengersen KL, Robert CP, Titterington DM. Perfect samplers for mixtures of distributions. Journal of the Royal Statistical Society B. 2002;**64**:777-790

[27] Phillipe A, Robert CP. Perfect simulation of positive Gaussian distributions. Statistics and Computing. 2003;**13**:179-186

[28] Huber M. Perfect sampling using bounding chains. The Annals of Applied Probability. 2004;**14**:734-753

[29] Møller J. Perfect simulation of conditionally specified models. Journal of the Royal Statistical Society, B. 1999; **61**:251-264

[30] Kendall WS, Moller J. Perfect simulation using dominating processes on ordered spaces, with application to locally stable point processes. Advances in Applied Probability. 2000;**32**(3): 844-865

IEEE Conference on Decision and Control; Maui Hawaii, USA. 2012

*A Review on the Exact Monte Carlo Simulation DOI: http://dx.doi.org/10.5772/intechopen.88619*

> Journal of Selected Topics in Signal Processing. 2013;**7**(3):521-531

On perfect simulation for some

Computing. 1999;**9**:287-298

125-133

[48] Hobert J, Robert C, Titterington D.

mixtures of distributions. Statistics and

[49] Fearnhead P. Direct simulation for discrete mixture distributions. Statistics and Computing. 2005;**15**(2):

[40] Li C, Srivastava S, Dunson DB. Simple, scalable and accurate posterior interval estimation. Biometrika. 2017;

[41] Minsker S, Srivastava S, Lin L, Dunson DB. Scalable and robust Bayesian inference via the median posterior. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14). 2014. pp. 1656-1664

[42] Neiswanger W, Wang C, Xing E. Asymptotically exact, embarrassingly parallel MCMC. In: Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence (2014). 2014.

[43] Scott SL, Blocker AW, Bonassi FV,

McCulloch RE. Bayes and big data: The consensus Monte Carlo algorithm. International Journal of Management Science and Engineering Management.

[44] Srivastava S, Cevher V, Tan-Dinh Q, Dunson DB. WASP: Scalable Bayes via barycenters of subset posteriors. In:

International Conference on Artificial Intelligence and Statistics (AISTATS

[45] Stamatakis A, Aberer AJ. Novel parallelization schemes for large-scale

International Symposium on Parallel and Distributed Processing. 2013. DOI:

[46] Wang X, Dunson. Parallelizing MCMC via Weierstrass Sampler. arXiv

preprint arXiv:1312.4605. 2013

[47] Uney M, Clark DE, Julier SJ. Distributed fusion of PHD filters via exponential mixture densities. IEEE

likelihood-based phylogenetic inference. In: 2013 IEEE 27th

Proceedings of the Eighteenth

2016). 2016. pp. 912-920

10.1109/IPDPS.2013.70

**47**

Chipman HA, George EI,

2016;**11**(2):78-88

**104**(3):665-680

pp. 623-632

[31] Fearnhead P. Perfect simulation from population genetic models with selection. Theoretical Population Biology. 2001;**59**(4):263-279

[32] Dai H. Exact Monte Carlo simulation for fork-join networks. Advances in Applied Probability. 2011;**43**(2):484-503

[33] Dai H. Exact simulation for fork-join networks with heterogeneous service. International Journal of Statistics and Probability. 2015;**4**(1):19-32

[34] Connor SB, Kendal WS. Perfect simulation of M/G/c queues. Advances in Applied Probability. 2015;**47**(4): 1039-1063

[35] Djuric PM, Huang Y, Ghirmai T. Perfect sampling: A review and applications to signal processing. IEEE Transactions on Signal Processing. 2002; **50**(2):345-356

[36] Wilson DB. Layered multishift coupling for use in perfect sampling algorithms (with a primer on CFTP). In: Madras N, editor. Monte Carlo Methods. Vol. 26. Fields Institute Communications. American Mathematical Society; 2000. pp. 141-176

[37] Hansen NR. Geometric ergodicity of discrete-time approximations to multivariate diffusions. Bernoulli. 2003; **9**(4):725-743

[38] Dai H. Exact simulation for diffusion bridges: An adaptive approach. Journal of Applied Probability. 2014;**51**(2):346-358

[39] Agarwal A, Duchi JC. Distributed delayed stochastic optimization. In: 51st *A Review on the Exact Monte Carlo Simulation DOI: http://dx.doi.org/10.5772/intechopen.88619*

IEEE Conference on Decision and Control; Maui Hawaii, USA. 2012

[19] Dai H. Perfect simulation methods for Bayesian applications [PhD thesis],

*Bayesian Inference on Complicated Data*

[30] Kendall WS, Moller J. Perfect simulation using dominating processes on ordered spaces, with application to locally stable point processes. Advances in Applied Probability. 2000;**32**(3):

[31] Fearnhead P. Perfect simulation from population genetic models with selection. Theoretical Population Biology. 2001;**59**(4):263-279

[32] Dai H. Exact Monte Carlo simulation for fork-join networks. Advances in Applied Probability. 2011;**43**(2):484-503

[33] Dai H. Exact simulation for fork-join networks with heterogeneous service. International Journal of Statistics and

[34] Connor SB, Kendal WS. Perfect simulation of M/G/c queues. Advances in Applied Probability. 2015;**47**(4):

[35] Djuric PM, Huang Y, Ghirmai T. Perfect sampling: A review and applications to signal processing. IEEE Transactions on Signal Processing. 2002;

[36] Wilson DB. Layered multishift coupling for use in perfect sampling algorithms (with a primer on CFTP). In: Madras N, editor. Monte Carlo Methods. Vol. 26. Fields Institute Communications. American Mathematical Society; 2000.

[37] Hansen NR. Geometric ergodicity of

multivariate diffusions. Bernoulli. 2003;

[39] Agarwal A, Duchi JC. Distributed delayed stochastic optimization. In: 51st

discrete-time approximations to

[38] Dai H. Exact simulation for diffusion bridges: An adaptive approach. Journal of Applied Probability. 2014;**51**(2):346-358

Probability. 2015;**4**(1):19-32

1039-1063

**50**(2):345-356

pp. 141-176

**9**(4):725-743

844-865

[20] Wilson DB. How to couple from the

Computing Section of ASA Proceedings.

[22] Corcoran JN, Tweedie RL. Perfect sampling from independent Metropolis-Hastings chains. Journal of Statistical Planning and Inference. 2000;**104**(2):

[23] Mira A, Møller J, Roberts GO. Perfect slice samplers. Journal of the Royal Statistical Society B. 2001;**63**:593-606

[24] Murdoch DJ, Green PJ. Exact sampling from a continuous state space. Scandinavian Journal of Statistics. 1998;

[25] Hörmann W, Leydold J. Improved perfect slice sampling. Technique

[26] Casella G, Mengersen KL, Robert CP, Titterington DM. Perfect samplers for mixtures of distributions. Journal of the Royal Statistical Society B.

[27] Phillipe A, Robert CP. Perfect simulation of positive Gaussian

distributions. Statistics and Computing.

[28] Huber M. Perfect sampling using bounding chains. The Annals of Applied

[29] Møller J. Perfect simulation of conditionally specified models. Journal of the Royal Statistical Society, B. 1999;

Probability. 2004;**14**:734-753

University of Oxford. 2007

Algorithms. 2000;**16**:85-113

1999

297-314

**25**:483-502

Report. 2003

2002;**64**:777-790

2003;**13**:179-186

**61**:251-264

**46**

past using a read-once source of randomness. Random Structures and

[21] Cai H. Exact sampling using auxiliary variables. In: Statistical [40] Li C, Srivastava S, Dunson DB. Simple, scalable and accurate posterior interval estimation. Biometrika. 2017; **104**(3):665-680

[41] Minsker S, Srivastava S, Lin L, Dunson DB. Scalable and robust Bayesian inference via the median posterior. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14). 2014. pp. 1656-1664

[42] Neiswanger W, Wang C, Xing E. Asymptotically exact, embarrassingly parallel MCMC. In: Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence (2014). 2014. pp. 623-632

[43] Scott SL, Blocker AW, Bonassi FV, Chipman HA, George EI, McCulloch RE. Bayes and big data: The consensus Monte Carlo algorithm. International Journal of Management Science and Engineering Management. 2016;**11**(2):78-88

[44] Srivastava S, Cevher V, Tan-Dinh Q, Dunson DB. WASP: Scalable Bayes via barycenters of subset posteriors. In: Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2016). 2016. pp. 912-920

[45] Stamatakis A, Aberer AJ. Novel parallelization schemes for large-scale likelihood-based phylogenetic inference. In: 2013 IEEE 27th International Symposium on Parallel and Distributed Processing. 2013. DOI: 10.1109/IPDPS.2013.70

[46] Wang X, Dunson. Parallelizing MCMC via Weierstrass Sampler. arXiv preprint arXiv:1312.4605. 2013

[47] Uney M, Clark DE, Julier SJ. Distributed fusion of PHD filters via exponential mixture densities. IEEE

Journal of Selected Topics in Signal Processing. 2013;**7**(3):521-531

[48] Hobert J, Robert C, Titterington D. On perfect simulation for some mixtures of distributions. Statistics and Computing. 1999;**9**:287-298

[49] Fearnhead P. Direct simulation for discrete mixture distributions. Statistics and Computing. 2005;**15**(2): 125-133

Section 3

Bayesian Inference

for Complicated Data

**49**

Section 3

## Bayesian Inference for Complicated Data

**Chapter 4**

**Abstract**

**1. Introduction**

**51**

Effects Models

*Junshan Shen and Catherine C. Liu*

(AD) study to illustrate the presented model and methods.

ically reviewed the Bayesian works in the area.

Bayesian Analysis for Random

Random effects models have been widely used to analyze correlated data sets, and Bayesian techniques have emerged as a powerful tool to fit the models. However, there has been scarce literature that systematically reviews and summarizes the recent advances of Bayesian analyses of random effects models. This chapter reviews the use of the Dirichlet process mixture (DPM) prior to approximate the distribution of random errors within the general semiparametric random effects models with parametric random effects for longitudinal data setting and failure time setting separately. In a survival setting with clusters, we propose a new class of nonparametric random effects models which is motivated from the accelerated failure models. We employ a beta process prior to tact clustering and estimation simultaneously. We analyze a new data set integrated from Alzheimer's disease

**Keywords:** beta process, Dirichlet process mixture, clustered data, longitudinal data, random effects, survival outcome, nonparametric transformation model

Random effects models have been widely used as a powerful tool for analyzing correlated data [1, 2]. The model features a finite number of random terms acting as latent variables to model unobserved factors; see [3] for a comprehensive review. Some authors have further proposed semiparametric mixed effect models by allowing for infinite dimensional random effects [4, 5]. Most of the aforementioned works draw inferences using frequentist approaches, while Bayesian approaches have been largely ignored because of the lack of computational feasibility and expediency. With the advent of the "supercomputer" era, Bayesian analyses have recently sparked much interest in the setting of random effects models for clustered data or longitudinal settings. However, there is scarce literature that has systemat-

By extending the traditional random effects models, recent research focus has shifted to study heterogeneous random effects or nonparametric distributions of random effects, which arise because of skewness of data, missing covariates, or unmeasurable subject-specific covariates [6]. The extended random effects models, termed semiparametric random effects models, improve statistical performance with added interpretability. Bayesian techniques, which provide a convenient means to model non-Gaussian distributions, have recently been proposed for semiparametric random effects model in a variety of settings ([7, 8], among others).

#### **Chapter 4**

### Bayesian Analysis for Random Effects Models

*Junshan Shen and Catherine C. Liu*

#### **Abstract**

Random effects models have been widely used to analyze correlated data sets, and Bayesian techniques have emerged as a powerful tool to fit the models. However, there has been scarce literature that systematically reviews and summarizes the recent advances of Bayesian analyses of random effects models. This chapter reviews the use of the Dirichlet process mixture (DPM) prior to approximate the distribution of random errors within the general semiparametric random effects models with parametric random effects for longitudinal data setting and failure time setting separately. In a survival setting with clusters, we propose a new class of nonparametric random effects models which is motivated from the accelerated failure models. We employ a beta process prior to tact clustering and estimation simultaneously. We analyze a new data set integrated from Alzheimer's disease (AD) study to illustrate the presented model and methods.

**Keywords:** beta process, Dirichlet process mixture, clustered data, longitudinal data, random effects, survival outcome, nonparametric transformation model

#### **1. Introduction**

Random effects models have been widely used as a powerful tool for analyzing correlated data [1, 2]. The model features a finite number of random terms acting as latent variables to model unobserved factors; see [3] for a comprehensive review. Some authors have further proposed semiparametric mixed effect models by allowing for infinite dimensional random effects [4, 5]. Most of the aforementioned works draw inferences using frequentist approaches, while Bayesian approaches have been largely ignored because of the lack of computational feasibility and expediency. With the advent of the "supercomputer" era, Bayesian analyses have recently sparked much interest in the setting of random effects models for clustered data or longitudinal settings. However, there is scarce literature that has systematically reviewed the Bayesian works in the area.

By extending the traditional random effects models, recent research focus has shifted to study heterogeneous random effects or nonparametric distributions of random effects, which arise because of skewness of data, missing covariates, or unmeasurable subject-specific covariates [6]. The extended random effects models, termed semiparametric random effects models, improve statistical performance with added interpretability. Bayesian techniques, which provide a convenient means to model non-Gaussian distributions, have recently been proposed for semiparametric random effects model in a variety of settings ([7, 8], among others). The discreteness of the Dirichlet process makes it impossible as a prior for estimating a density. However, as a remedy by convolving with a kernel, Dirichlet process mixture plays an important role [9].

*<sup>f</sup> <sup>G</sup>* <sup>ϵ</sup>j*σ*<sup>2</sup> � � <sup>¼</sup>

date such data, we utilize a general accelerated failure time model:

pendent random errors following the distribution with density *fi*

*<sup>f</sup>* <sup>1</sup>ð Þ*<sup>t</sup>* <sup>¼</sup> exp *<sup>θ</sup>*0*<sup>i</sup>* <sup>þ</sup> *<sup>θ</sup><sup>T</sup>*

satisfying Ð

distribution.

erogeneity. That is,

exp *θ<sup>T</sup>*

**3. Beta process prior**

as the prior for the baseline function.

observed data can be described as

log Ð

pooling.

**53**

algorithm is computationally feasible.

*Bayesian Analysis for Random Effects Models DOI: http://dx.doi.org/10.5772/intechopen.88822*

**2.2 Accelerated failure time model**

log*Tij* <sup>¼</sup> **<sup>X</sup>***<sup>T</sup>*

*fi* ð Þ*t* ð

where *<sup>φ</sup>* �j*u*, *<sup>σ</sup>*<sup>2</sup> ð Þ is the probability density function for a normal random variable with mean *u* and variance *σ*<sup>2</sup> and *G* is an unspecified probability distribution of *u*

Replacing the Dirichlet process by an equivalent Pólya urn representation, [8] employed an empirical likelihood approach with the moment constraints and developed a posterior adjusted Gibbs sampler for more precise estimation. The

We shift gears to study survival outcomes with a cluster structure. Denote the data set by *Tij*, *Xij* � �, *<sup>i</sup>* <sup>¼</sup> 1, <sup>⋯</sup>, *<sup>K</sup>*, *<sup>j</sup>* <sup>¼</sup> 1, <sup>⋯</sup>, *ni*, where *Tij* is the failure time of the *<sup>j</sup>*th subject in the *i*th cluster and *Xij* is a vector of associated covariates. To accommo-

where *β* is a vector of *p*-dim regression coefficients of interest and *εij* are inde-

exponential tilt on the distributions of error terms to incorporate the cluster het-

where *q t*ð Þ is a *q*-dimensional prespecified functions containing potential covariate information and *θ<sup>i</sup>* is the corresponding parameter vector with *θ*0*<sup>i</sup>* ¼

*<sup>i</sup> q t*ð Þ � � *<sup>f</sup>* <sup>1</sup>ð Þ*<sup>t</sup> dt* � ��<sup>1</sup> h i*:* Thus, *<sup>θ</sup><sup>i</sup>* represents the parametric random effects in

the model. Li et al. [7] place the DPM prior on the baseline density *f* <sup>1</sup> to develop a set of procedures which improves estimation efficiency through information

We now present a nonparametric random effects model for the clustered survival data with nonparametric monotone link functions. We employ a beta process

Let *Tij* denote the failure time of the *j*th subject in the *i*th cluster, *Xij* be the covariate vector for the subject, and *Cij* be the potential censoring time to the *j*th subject in the *i*th cluster. Assume that *Cij* is independent of the failure time *Tij*. Let *Zij* <sup>¼</sup> min *Tij*,*Cij* � � and let *<sup>δ</sup>ij* <sup>¼</sup> *I Tij* <sup>&</sup>lt;*Cij* � � be the censoring indicator. Then the

*Zij*, *<sup>δ</sup>ij*, *Xij* � �, *<sup>i</sup>* <sup>¼</sup> 1, <sup>⋯</sup>, *<sup>n</sup>*; *<sup>j</sup>* <sup>¼</sup> 1, <sup>⋯</sup>, *ni:* (5)

*ij β* þ *εij*, *i* ¼ 1, ⋯, *K* and *j* ¼ 1, ⋯, *ni*, (3)

*<sup>i</sup> q t*ð Þ � �, *<sup>i</sup>* <sup>¼</sup> 2, <sup>⋯</sup>,*K*, (4)

. [7] posed an

*udG u*ð Þ¼ 0, which ensures that ϵ comes from a mean-zero mixture

*<sup>φ</sup>* <sup>ϵ</sup>j*u*, *<sup>σ</sup>*<sup>2</sup> � �*dG u*ð Þ, (2)

For censored outcome data, transformation models, which transform the timeto-event responses using a monotone function and link them to the covariates of interest, have surged as a strong competitor of the Cox model [10]. Moreover, the transformation model framework is fairly general. The Cox model and the proportional odd model [11] can be viewed as nonparametric transformation linear models with some specific error terms; see [12–14]. For correlated data, the transformation model naturally extends the semiparametric random effects model by directly incorporating random effects to the transformation functions, treating them as realizations of an underlying random function. Bayesian analyses have found much use in this new area. For example, the beta process has been found to be a reasonable candidate for the prior of the monotone transformation function [15–17].

This chapter focuses on the Bayesian analysis of the transformed linear model with censored data and in a clustered setting. In many biomedical studies, the observations are naturally clustered. For example, patients in observational studies can be grouped in analysis according to a variety of factors, such as age, race, gender, and hospital, in order to reduce the confounding effects. Following Mallick and Walker [18], we explore using a mixture of beta distributions and the beta process as the candidates for the prior distribution of the random transformation function [17, 19, 20].

The rest of this chapter is structured as follows. Section 2 reviews the use of the Bayesian approach to infer parametric random effects models. In the setting of survival analysis, Section 3 proposes a beta process prior to fit random effects model with nonparametric transformation functions, and Section 4 applies the method to study the progression of Alzheimer's disease (AD). Section 5 concludes the chapter with future research directions.

#### **2. Dirichlet process mixture prior**

In parametric random effects models, we considered the situation that the distribution form of the random error term is unknown. Dirichlet process mixture (DPM) is used as the prior for the baseline distribution in that error terms used to be continuous random variables in most situations.

#### **2.1 Linear mixed effects model**

With a longitudinal data set *Yi*, *xi* f g , *zi* , we posit a mixed effects model with an AR(1) serial correlation structure:

$$\begin{aligned} \mathbf{y}\_i &= \mathbf{x}\_i \boldsymbol{\theta} + \mathbf{z}\_i \mathbf{b}\_i + \mathbf{w}\_i, i = 1, \dots, m; \\ \mathbf{w}\_i &= (w\_{i1}, \dots, w\_{i n\_i})^T; \boldsymbol{w}\_{\vec{\eta}} = \rho w\_{i, j-1} + \epsilon\_{\vec{\eta}}, j = 2, \dots, n\_i, \end{aligned} \tag{1}$$

where **y***<sup>i</sup>* ¼ *yi*1, … , *yini <sup>T</sup>* with *yij* being the *j*th response of the *i*th subject for *i* ¼ 1, … , *m*, *β* is a *p* � 1 vector of fixed effect parameters, **b***<sup>i</sup>* a *q* � 1 Gaussian random vector representing the subject-specific random effects, **x***<sup>i</sup>* and **z***<sup>i</sup>* are *ni* � *p* and *ni* � *q* design matrices linking *β* and **b***<sup>i</sup>* to **y***<sup>i</sup>* , respectively, **<sup>w</sup>***<sup>i</sup>* <sup>¼</sup> *wi*1, … , *wini* ð Þ*<sup>T</sup>* is an *ni* � 1 vector of model errors, *ρ* is the autoregressive coefficient, and ϵ*ij*0*s* are i.i. d. noises. When ϵ*ij* is non-normal, we assume a mixture model:

*Bayesian Analysis for Random Effects Models DOI: http://dx.doi.org/10.5772/intechopen.88822*

$$f\_{\,G}(\epsilon|\sigma^2) = \int \rho(\epsilon|u, \sigma^2) dG(u),\tag{2}$$

where *<sup>φ</sup>* �j*u*, *<sup>σ</sup>*<sup>2</sup> ð Þ is the probability density function for a normal random variable with mean *u* and variance *σ*<sup>2</sup> and *G* is an unspecified probability distribution of *u* satisfying Ð *udG u*ð Þ¼ 0, which ensures that ϵ comes from a mean-zero mixture distribution.

Replacing the Dirichlet process by an equivalent Pólya urn representation, [8] employed an empirical likelihood approach with the moment constraints and developed a posterior adjusted Gibbs sampler for more precise estimation. The algorithm is computationally feasible.

#### **2.2 Accelerated failure time model**

We shift gears to study survival outcomes with a cluster structure. Denote the data set by *Tij*, *Xij* � �, *<sup>i</sup>* <sup>¼</sup> 1, <sup>⋯</sup>, *<sup>K</sup>*, *<sup>j</sup>* <sup>¼</sup> 1, <sup>⋯</sup>, *ni*, where *Tij* is the failure time of the *<sup>j</sup>*th subject in the *i*th cluster and *Xij* is a vector of associated covariates. To accommodate such data, we utilize a general accelerated failure time model:

$$\log T\_{i\circ} = \mathbf{X}\_{i\circ}^T \boldsymbol{\theta} + \boldsymbol{\varepsilon}\_{i\circ}, \qquad i = 1, \cdots, K \quad \text{and} \quad j = 1, \cdots, n\_i,\tag{3}$$

where *β* is a vector of *p*-dim regression coefficients of interest and *εij* are independent random errors following the distribution with density *fi* . [7] posed an exponential tilt on the distributions of error terms to incorporate the cluster heterogeneity. That is,

$$\frac{f\_i(t)}{f\_1(t)} = \exp\left(\theta\_{0i} + \theta\_i^T q(t)\right), \qquad i = 2, \cdots, K,\tag{4}$$

where *q t*ð Þ is a *q*-dimensional prespecified functions containing potential covariate information and *θ<sup>i</sup>* is the corresponding parameter vector with *θ*0*<sup>i</sup>* ¼

log Ð exp *θ<sup>T</sup> <sup>i</sup> q t*ð Þ � � *<sup>f</sup>* <sup>1</sup>ð Þ*<sup>t</sup> dt* � ��<sup>1</sup> h i*:* Thus, *<sup>θ</sup><sup>i</sup>* represents the parametric random effects in the model. Li et al. [7] place the DPM prior on the baseline density *f* <sup>1</sup> to develop a set of procedures which improves estimation efficiency through information pooling.

#### **3. Beta process prior**

We now present a nonparametric random effects model for the clustered survival data with nonparametric monotone link functions. We employ a beta process as the prior for the baseline function.

Let *Tij* denote the failure time of the *j*th subject in the *i*th cluster, *Xij* be the covariate vector for the subject, and *Cij* be the potential censoring time to the *j*th subject in the *i*th cluster. Assume that *Cij* is independent of the failure time *Tij*. Let *Zij* <sup>¼</sup> min *Tij*,*Cij* � � and let *<sup>δ</sup>ij* <sup>¼</sup> *I Tij* <sup>&</sup>lt;*Cij* � � be the censoring indicator. Then the observed data can be described as

$$(Z\_{\vec{\eta}}, \delta\_{\vec{\eta}}, X\_{\vec{\eta}}), i = 1, \cdots, n; \quad j = 1, \cdots, n\_i. \tag{5}$$

Within each cluster, *Tij* is linked to *Xij* via the following transformation model:

$$\ln H\_i(T\_{i\bar{\jmath}}) = X\_{i\bar{\jmath}}^T \mathcal{Y} + \ln \epsilon\_{i\bar{\jmath}}, i = \mathbf{1}, 2 \cdots, n,\tag{6}$$

Teh et al. [20] showed that a sample from BPð Þ *γ*, *B*<sup>0</sup> could be represented as

*pilI*ð Þ *θil* ≤ *y* , (9)

*<sup>T</sup>* and define

(10)

*l*¼1

*θil* � *B*0ð Þ*θ* , *νil* � Betað Þ *γ*, 1 *l* ¼ 1, 2, ⋯*:*

In practice, we need to approximate samples of BPð Þ *γ*, *B*<sup>0</sup> with a finite dimensional form. Since beta process BPð Þ *γB*<sup>0</sup> can be represented by a stick-breaking process defined in Eq. (9), a natural approximation is obtained by retaining its first

*l*¼1

*<sup>j</sup>*¼<sup>1</sup>*νil*, *<sup>l</sup>* <sup>¼</sup> 1, <sup>⋯</sup>, *<sup>L</sup>*. Denote *<sup>ξ</sup><sup>i</sup>* <sup>¼</sup> ð Þ *<sup>ν</sup>i*1, <sup>⋯</sup>, *<sup>ν</sup>iL*, *<sup>θ</sup>i*1, <sup>⋯</sup>, *<sup>θ</sup>iL*

*pilδθil* ,

*<sup>σ</sup> <sup>z</sup>*, *<sup>ξ</sup><sup>i</sup>* ð Þ¼ <sup>X</sup>

*f Zij*,*Xij*, *<sup>δ</sup>ij*j*β*, *<sup>ξ</sup>i*<sup>Þ</sup> � �,

*<sup>σ</sup>* ð Þ *<sup>z</sup>*, *<sup>ξ</sup>* exp �*xT<sup>β</sup>* � � � � *<sup>δ</sup>*

*:*

*L*

*l*¼1

*pilϕσ*ð Þ *z* � *θil :*

*Bi*ð Þ¼ *<sup>y</sup>* <sup>X</sup><sup>∞</sup>

*B*∗ *<sup>i</sup>* <sup>¼</sup> <sup>X</sup> *L*

*pil*Φ*σ*ð Þ *<sup>z</sup>* � *<sup>θ</sup>il* , *<sup>h</sup>*<sup>∗</sup>

*<sup>π</sup><sup>ξ</sup> <sup>ξ</sup><sup>i</sup>* ð ÞY*ni*

*j*¼1

*<sup>σ</sup>* ð Þ *<sup>z</sup>*, *<sup>ξ</sup>* exp �*xT<sup>β</sup>* � � � � *<sup>h</sup>*<sup>∗</sup>

The samples for *β* and *ξ*1, … , *ξ<sup>n</sup>* ð Þ based on the posterior can be obtained with Markov chain Monte Carlo (MCMC) [21]. In our simulation, we use the R-package MCMC (https://cran.r-project.org/web/packages/mcmc/index.html) to draw samples for *ξ*1, … , *ξ<sup>n</sup>* and *β* and use the Metropolis algorithm with a normal working

Alzheimer's Disease Neuroimaging Initiative (ADNI) is a multisite cooperative study for the purpose of improving the prevention and treatment of Alzheimer's disease. The subjects in the study fall into three groups, cognitively normal (CN) individuals, mild cognitive impairment (MCI) patients, and early AD patients. ADNI provides a rich array of patients' information, including functional magnetic resonance imaging (fMRI), positron emission tomography (PET), longitudinal functional cognitive tests scores, blood samples, genetics data, and censored failure time outcomes. Details of the study can be found at http://adni.loni.usc.edu.

*<sup>σ</sup>* ð Þ *<sup>z</sup>*, *<sup>ξ</sup>* exp �*xT<sup>β</sup>* � � � � � � <sup>1</sup>�*<sup>δ</sup>*

**4. An application to Alzheimer's disease neuroimaging initiative**

The approximated posterior based on the truncated DP is

"

*<sup>j</sup>*¼<sup>1</sup>*νil* and *<sup>θ</sup>il* ð Þ , *<sup>ν</sup>il* follows

where *pil* <sup>¼</sup> <sup>Q</sup>*<sup>l</sup>*

*Bayesian Analysis for Random Effects Models DOI: http://dx.doi.org/10.5772/intechopen.88822*

*L* components. That is,

with *pil* <sup>¼</sup> <sup>Q</sup>*<sup>l</sup>*

where

distribution.

**55**

*H*<sup>∗</sup>

*<sup>σ</sup> <sup>z</sup>*, *<sup>ξ</sup><sup>i</sup>* ð Þ¼ <sup>X</sup>

*L*

*l*¼1

*π β*ð ÞY*<sup>n</sup> i*¼1

� *<sup>P</sup><sup>ε</sup> <sup>H</sup>*<sup>∗</sup>

*f z*ð Þ¼ , *<sup>x</sup>*, *<sup>δ</sup>*j*β*, *<sup>ξ</sup> <sup>p</sup><sup>ε</sup> <sup>H</sup>*<sup>∗</sup>

where *εij* are i.i.d. variables with a known density function *f <sup>ε</sup>*ð Þ� and *Hi*ð Þ*t* are unknown cluster-specific monotone functions, which are i.i.d. realizations of a random function and can be viewed as a nonparametric version of random effects for independent clusters. In a parametric setting, if we set *Hi*ðÞ¼ *t t*expð Þ �*bi* with *bi* being a cluster-specific random effect, Eq. (6) reduces to a classical random effects model, which has been discussed in Section 2.2. The challenge, however, lies in how to draw inferences in such a nonparametric setting.

To proceed, let the coefficient vector *β* be a *p*-dim unknown vector of interest. We further assume *Hi* 0 s are differentiable with derivative *hi*ðÞ¼ *t H*<sup>0</sup> *i* ð Þ*t* , and then the likelihood based on the observed data is

$$L(\boldsymbol{\beta}, H\_1, \dots, H\_n | data) = \prod\_{i=1}^n \prod\_{j=1}^{n\_i} p\left(T\_{\vec{\eta}}, X\_{\vec{\eta}}, \delta\_{\vec{\eta}} | H\_i, \boldsymbol{\beta}\right), \tag{7}$$

where

$$p(t, \boldsymbol{x}, \boldsymbol{\delta} | \boldsymbol{H}, \boldsymbol{\beta}) = \; \left( f\_{\varepsilon} \left( H(t) e^{-\boldsymbol{x}^{T} \boldsymbol{\theta}} \right) h(t) e^{-\boldsymbol{x}^{T} \boldsymbol{\theta}} \right)^{\delta} \mathrm{S}\_{\varepsilon} \left( H(t) e^{-\boldsymbol{x}^{T} \boldsymbol{\theta}} \right)^{1-\delta}.$$

Here *S<sup>ε</sup>* is the survival function of *varepsilon* defined by *Sε*ðÞ¼ *s P*ð Þ *ε*≥*s* .

We develop a Bayesian inference procedure based on model (6). We assume that the regression coefficient *β* follows a normal prior:

$$
\beta \sim \mathcal{N}\_p \left( \mathbf{0}, \sigma\_\beta^2 I\_p \right), \tag{8}
$$

where *Ip* is the *p* � *p* dimensional identity matrix. Since *Hi* is assumed differentiable, we model it with a kernel convolution:

$$H\_i = \int \Phi\_{\sigma}(\cdot - s) dB\_i(s),$$

where *B* is an increasing function and Φ*<sup>σ</sup>* is the zero-mean normal distribution with variance *σ*2. Hence, the derivative of *Hi* is

$$h\_i = \int \phi\_{\sigma}(\cdot - s) dB\_i(s)$$

with *ϕσ*ðÞ¼ *<sup>t</sup>* <sup>1</sup> *<sup>σ</sup> ϕ <sup>t</sup> σ* � �*:* This actually mimics the idea of DPM to smooth beta process by convolution.

We are in a position to select an appropriate stochastic process used as the prior of *Bi*. Beta process, as studied by [16, 17], is an ideal candidate for the prior of a monotone function. Specifically, beta process BPð Þ *γ*, *B*<sup>0</sup> with concentration parameter *γ* and a base measure *B*<sup>0</sup> is an increasing Lévy process with independent increments of the form

$$dB(t) \sim \text{Beta}\left(\gamma dB\_0(t), \gamma(1 - dB\_0(t))\right).$$

Teh et al. [20] showed that a sample from BPð Þ *γ*, *B*<sup>0</sup> could be represented as

$$B\_i(\boldsymbol{\jmath}) = \sum\_{l=1}^{\infty} p\_{il} I(\theta\_{il} \le \boldsymbol{\jmath}),\tag{9}$$

where *pil* <sup>¼</sup> <sup>Q</sup>*<sup>l</sup> <sup>j</sup>*¼<sup>1</sup>*νil* and *<sup>θ</sup>il* ð Þ , *<sup>ν</sup>il* follows

$$
\theta\_{il} \sim B\_0(\theta), \nu\_{il} \sim \text{Beta}\left(\boldsymbol{\gamma}, \mathbf{1}\right) \quad l = \mathbf{1}, 2, \dots, \mathbf{1}
$$

In practice, we need to approximate samples of BPð Þ *γ*, *B*<sup>0</sup> with a finite dimensional form. Since beta process BPð Þ *γB*<sup>0</sup> can be represented by a stick-breaking process defined in Eq. (9), a natural approximation is obtained by retaining its first *L* components. That is,

$$B\_i^\* := \sum\_{l=1}^L p\_{il} \delta\_{\theta\_{il}},$$

with *pil* <sup>¼</sup> <sup>Q</sup>*<sup>l</sup> <sup>j</sup>*¼<sup>1</sup>*νil*, *<sup>l</sup>* <sup>¼</sup> 1, <sup>⋯</sup>, *<sup>L</sup>*. Denote *<sup>ξ</sup><sup>i</sup>* <sup>¼</sup> ð Þ *<sup>ν</sup>i*1, <sup>⋯</sup>, *<sup>ν</sup>iL*, *<sup>θ</sup>i*1, <sup>⋯</sup>, *<sup>θ</sup>iL <sup>T</sup>* and define

$$H\_{\sigma}^{\*}\left(\mathbf{z},\mathbf{\xi}\_{i}\right) = \sum\_{l=1}^{L} p\_{il} \Phi\_{\sigma}(\mathbf{z} - \theta\_{il}),\\h\_{\sigma}^{\*}\left(\mathbf{z},\mathbf{\xi}\_{i}\right) = \sum\_{l=1}^{L} p\_{il} \Phi\_{\sigma}(\mathbf{z} - \theta\_{il}).$$

The approximated posterior based on the truncated DP is

$$\pi(\boldsymbol{\beta}) \prod\_{i=1}^{n} \left[ \pi^{\boldsymbol{\xi}}(\boldsymbol{\xi}\_{i}) \prod\_{j=1}^{n\_{i}} f\left(\boldsymbol{Z}\_{ij}, \boldsymbol{X}\_{ij}, \delta\_{ij} | \boldsymbol{\beta}, \boldsymbol{\xi}\_{i}\right) \right],\tag{10}$$

where

$$f(\mathbf{z}, \boldsymbol{\kappa}, \delta | \boldsymbol{\theta}, \boldsymbol{\xi}) = \left( p\_{\boldsymbol{\kappa}} \left( H\_{\sigma}^{\*} (\mathbf{z}, \boldsymbol{\xi}) \exp \left( -\boldsymbol{\kappa}^{T} \boldsymbol{\theta} \right) \right) h\_{\sigma}^{\*} (\mathbf{z}, \boldsymbol{\xi}) \exp \left( -\boldsymbol{\kappa}^{T} \boldsymbol{\theta} \right) \right)^{\delta}.$$

$$\times \left( P\_{\sigma} \left( H\_{\sigma}^{\*} (\mathbf{z}, \boldsymbol{\xi}) \exp \left( -\boldsymbol{\kappa}^{T} \boldsymbol{\theta} \right) \right) \right)^{1-\delta}.$$

The samples for *β* and *ξ*1, … , *ξ<sup>n</sup>* ð Þ based on the posterior can be obtained with Markov chain Monte Carlo (MCMC) [21]. In our simulation, we use the R-package MCMC (https://cran.r-project.org/web/packages/mcmc/index.html) to draw samples for *ξ*1, … , *ξ<sup>n</sup>* and *β* and use the Metropolis algorithm with a normal working distribution.

#### **4. An application to Alzheimer's disease neuroimaging initiative**

Alzheimer's Disease Neuroimaging Initiative (ADNI) is a multisite cooperative study for the purpose of improving the prevention and treatment of Alzheimer's disease. The subjects in the study fall into three groups, cognitively normal (CN) individuals, mild cognitive impairment (MCI) patients, and early AD patients. ADNI provides a rich array of patients' information, including functional magnetic resonance imaging (fMRI), positron emission tomography (PET), longitudinal functional cognitive tests scores, blood samples, genetics data, and censored failure time outcomes. Details of the study can be found at http://adni.loni.usc.edu.

We focus on the MCI group. MCI is recognized as a transitional stage between normal cognition and Alzheimer's disease. The failure time is defined to be the time that a MCI patient is diagnosed with AD, which will be censored if a MCI patient remains at the MCI stage at the end of the follow-up time. Wide heterogeneities are exhibited among the failure times, which may be due to demographics and a variety of functional clinical biomarkers, such as the brain areas of the hippocampus, ventricles, and entorhinal cortex. The goal of the analysis is to study the impact of risk factors on progression to AD.

20,000 iterations with the first 4000 draws discarded as burn-in samples and use

**Figure 2** illustrates the estimated transformation functions with age-stratified data, and **Table 2** summarizes the posterior means and standard errors of the other

The left curve is relatively flat, while the right curve has a sharper slope. This is consistent with the recognition that AD is an aging disease: elder people above a

Both **Tables 1** and **2** show that none of the biomarkers are significant, whereas they are statistically significant in the analysis of [14]. One possible conjecture is that our nonparametric transformation functions may have well captured the effects of unobserved confounders, which may leave little to be explained by the

**RID AGE PTGENDER PTEDUCAT APOE4 Hipp.**

PM 0.9635 0.0069 0.1453 0.0231 0.1817 0.2710 SE 1.3288 0.0841 1.2331 0.1835 0.8616 0.5333

*Smoothed transformation functions with two age-strata: The left curve is the smoothed transformation function for group aged below the average age; the right curve is the smoothed transformation function for the group aged*

PM 0.6399 0.0706 0.0072 0.1349 0.1919 SE 0.9273 0.8491 0.1267 0.6098 0.3716

**RID PTGENDER PTEDUCAT APOE4 Hipp.**

Geweke's statistic to ensure the convergence of the chains.

certain age threshold tend to progress faster from MCI to AD.

observed covariates. More thorough investigation is warranted.

*Posterior estimates of regression coefficients with standard errors.*

*Posterior estimators of regression coefficients with standard errors.*

regression coefficients.

*Bayesian Analysis for Random Effects Models DOI: http://dx.doi.org/10.5772/intechopen.88822*

**Table 1.**

**Figure 2.**

**Table 2.**

**57**

*over the average age.*

Using the same data as analyzed by [14], we demonstrate our methodology by modeling the failure time (the observed time of AD diagnosis from MCI stage in year) of 281 MCI patients on gender (0 = female, 1 = male), years of education, the number of apolipoprotein E alleles (0, 1, or 2), and the baseline hippocampal volume.

As age is a strong confounder but the functional form of its impact has not reached consensus, we elect to model its impact nonparametrically. Specifically, we use age to form two strata (below and above the median age) and use model (6) to estimate the stratum-specific transformation functions and the effects of other covariates. For comparisons, we also fit model (6) with age as a continuous variable and with a common transformation function. That is, we do not assume the data are clustered. For both models, the regression errors *ε*'s are assumed to follow an exponential distribution with mean 10. In our calculation, we approximate the BPs by a finite truncation with *L* ¼ 20. We assume the precision parameter *α* ¼ 1 and scale parameter *<sup>σ</sup>*<sup>2</sup> � <sup>1</sup>*=σ*2.

**Figure 1** illustrates the estimated transformation function *H* of the failure time without clustering. The posterior means (PM) and standard errors (SE) of the regression coefficients in the model are reported in **Table 1**. We run the MCMC for

**Figure 1.** *Smoothed transformation function without clustering.*

#### *Bayesian Analysis for Random Effects Models DOI: http://dx.doi.org/10.5772/intechopen.88822*

20,000 iterations with the first 4000 draws discarded as burn-in samples and use Geweke's statistic to ensure the convergence of the chains.

**Figure 2** illustrates the estimated transformation functions with age-stratified data, and **Table 2** summarizes the posterior means and standard errors of the other regression coefficients.

The left curve is relatively flat, while the right curve has a sharper slope. This is consistent with the recognition that AD is an aging disease: elder people above a certain age threshold tend to progress faster from MCI to AD.

Both **Tables 1** and **2** show that none of the biomarkers are significant, whereas they are statistically significant in the analysis of [14]. One possible conjecture is that our nonparametric transformation functions may have well captured the effects of unobserved confounders, which may leave little to be explained by the observed covariates. More thorough investigation is warranted.


**Table 1.**

*Posterior estimates of regression coefficients with standard errors.*

#### **Figure 2.**

*Smoothed transformation functions with two age-strata: The left curve is the smoothed transformation function for group aged below the average age; the right curve is the smoothed transformation function for the group aged over the average age.*


#### **Table 2.**

*Posterior estimators of regression coefficients with standard errors.*

#### **5. Future directions**

Following [12], we can extend the transformation model (6) by allowing the error function *f <sup>ε</sup>* to be unspecified. In this case, we need to specify the regression coefficient *β* to obey some constraints such as *β*<sup>1</sup> ¼ 1 or ∥*β*∥ ¼ 1 for identifiability. We will propose to model the error function using a Dirichlet processes mixture model:

Data used in preparation of this article were obtained from the Alzheimers Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni. usc.edu/wp-content/uploads/how\_to\_apply/ADNI\_Acknowledgement\_List.pdf.

**Author details**

**59**

Junshan Shen1† and Catherine C. Liu<sup>2</sup>

*Bayesian Analysis for Random Effects Models DOI: http://dx.doi.org/10.5772/intechopen.88822*

† These authors contributed equally.

provided the original work is properly cited.

\*†

© 2020 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium,

1 Capital University of Economics and Trade, Beijing, China

2 The Hong Kong Polytechnic University, Hong Kong SAR

\*Address all correspondence to: macliu@polyu.edu.hk

$$f\_{\varepsilon}(t) = \int \varphi(t|\mu, \sigma^2) dG(\mu, \sigma^2), \quad G \sim \text{DP}\left(a, G\_0 = \text{N}\left(\mu|\mu\_0, \sigma\_0^2\right) \times \text{IG}\left(a\_1, a\_2\right)\right),$$

where *<sup>φ</sup> <sup>t</sup>*j*μ*, *<sup>σ</sup>*<sup>2</sup> ð Þ is a normal kernel with mean *<sup>μ</sup>* and variance *<sup>σ</sup>*<sup>2</sup> and *<sup>G</sup>* are samples from a Dirichlet process DP *<sup>α</sup>*1, *<sup>G</sup>*<sup>0</sup> <sup>¼</sup> <sup>N</sup> *<sup>μ</sup>*j*μ*0, *<sup>σ</sup>*<sup>2</sup> 0 � � � IGð Þ *<sup>a</sup>*, *<sup>b</sup>* � �, where *<sup>α</sup>*<sup>1</sup> is the mass parameter and IGð Þ �j*a*, *b* is the inverse gamma distribution with shape parameter *a* and scale parameter *b*.

In a slightly different context, we may also consider clustering observations by developing a new nested beta-Dirichlet process prior with companion MCMC algorithms. As there are limited works on functional random effects models that accommodate clustering structures observed, for example, from neural studies, we may propose a nested Dirichlet process [19] as the prior of Dirichlet process to cluster cumulative distribution functions successfully. We envision that such a nested Bayesian procedure will provide substantial computational expedience for practitioners and can certainly be applied to studies that cover beyond the neurodegenerative and aging diseases.

#### **Acknowledgements**

Shen's research is partially supported by Beijing Natural Science Foundation 1192006 and National Natural Science Foundation of China; Liu's research is partially supported by General Research Fund, Research Grants Council, Hong Kong, 15327216, and the Hong Kong Polytechnic University grant YBTR. Data collection and sharing for this project were funded by the Alzheimer's Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer's Association; Alzheimer's Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research Development, LLC.; Johnson & Johnson Pharmaceutical Research Development LLC.; Lumosity; Lundbeck; Merck Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer's Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.

*Bayesian Analysis for Random Effects Models DOI: http://dx.doi.org/10.5772/intechopen.88822*

Data used in preparation of this article were obtained from the Alzheimers Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni. usc.edu/wp-content/uploads/how\_to\_apply/ADNI\_Acknowledgement\_List.pdf.

#### **Author details**

Junshan Shen1† and Catherine C. Liu<sup>2</sup> \*†

1 Capital University of Economics and Trade, Beijing, China

2 The Hong Kong Polytechnic University, Hong Kong SAR

\*Address all correspondence to: macliu@polyu.edu.hk

† These authors contributed equally.

© 2020 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### **References**

[1] Harville DA. Extension of the Gauss– Markov theorem to include the estimation of random effect. The Annals of Statistics. 1976;**4**:384-395

[2] Laird NM, Ware JH. Random-effects models for longitudinal data. Biometrics. 1982;**38**(2):963-974

[3] Li Y. Random effect models. In: Pham H, editor. Springer Handbook of Engineering Statistics. London: Spring-Verlag; 2006. pp. 687-704

[4] Zeger SL, Diggle PJ. Semiparametric models for longitudinal data with application to CD4 cell numbers in HIV seroconverters. Biometrics. 1994;**50**(3): 689-699

[5] Li Y, Lin X, Müller P. Bayesian inference in semiparametric mixed models for longitudinal data. Biometrics. 2010;**66**(1):70-78

[6] Li Y, Mller P, Lin X. Center-adjusted inference for a nonparametric Bayesian random effect distribution. Statistica Sinica. 2011;**21**:1201-1223

[7] Li Z, Xu X, Shen J. Semiparametric Bayesian analysis of accelerated failure time models with cluster structures. Statistics in Medicine. 2017;**36**(25): 3976-3989

[8] Shen J, Yu H, Yang J, Liu C. Semiparametric Bayesian analysis for longitudinal mixed effects models with non-normal AR(1) errors. Statistics and Computing. 2019;**29**(3):571-583

[9] Ghosal S, van der vaart A. Fundamentals of Nonparametric Bayesian Inference. Cambridge: Cambridge University Press; 2017

[10] Cox DR. Regression models and life tables (with discussion). Journal of the Royal Statistical Society: Series B. 1972; **34**(2):187-220

[11] Hanson T, Yang M. Bayesian semiparametric proportional odds models. Biometrics. 2007;**63**(1):88

[12] Horowitz JL. Semiparametric estimation of a regression model with an unknown transformation of the dependent variable. Econometrica. 1996;**64**(1):103-137

the Eleventh International Conference on Artificial Intelligence and Statistics;

*Bayesian Analysis for Random Effects Models DOI: http://dx.doi.org/10.5772/intechopen.88822*

[21] Geyer CJ. Introduction to Markov chain Monte Carlo. In: Brooks S, Gelman A, Jones G, Meng X-L, editors. Handbook of Markov Chain Monte Carlo. Boca Raton: Chapman and Hall/

Vol. 2; 2007. pp. 556-563

CRC; 2011. pp. 3-48

**61**

[13] Linton O, Sperlich S, Keilegom IV. Estimation of a semiparametric transformation model. Annals of Statistics. 2008;**36**(2):686-718

[14] Li K, Luo S. Functional joint model for longitudinal and time-to-event data: An application to Alzheimer's disease. Statistics in Medicine. 2017;**36**(25): 3560-3572

[15] Müller P, Mitra R. Bayesian nonparametric inference why and how. Bayesian Analysis. 2013;**8**(2):269-302

[16] Kalbfleisch JD. Non-parametric bayesian analysis of survival time data. Journal of the Royal Statistical Society. 1978;**40**(2):214-221

[17] Hjort N. Nonparametric bayes estimators based on beta processes in models for life history data. Annals of Statistics. 1990;**18**(3):1259-1294

[18] Mallick B, Walker S. A Bayesian semiparametric transformation model incorporating frailties. Journal of Statistical Planning and Inference. 2007; **112**(1):159-174

[19] Rodriguez A, Dunson DB, Gelfand AE. The nested Dirichlet process. Journal of the American Statistical Association. 2008;**103**(483): 1131-1154

[20] Teh YW, Görür D, Ghahramani Z. Stick-breaking construction for the Indian buffet process. In: Proceedings of *Bayesian Analysis for Random Effects Models DOI: http://dx.doi.org/10.5772/intechopen.88822*

the Eleventh International Conference on Artificial Intelligence and Statistics; Vol. 2; 2007. pp. 556-563

[21] Geyer CJ. Introduction to Markov chain Monte Carlo. In: Brooks S, Gelman A, Jones G, Meng X-L, editors. Handbook of Markov Chain Monte Carlo. Boca Raton: Chapman and Hall/ CRC; 2011. pp. 3-48

**Chapter 5**

**Abstract**

Bayesian Inference of Gene

Bayesian deep learning for large-scale and complex GRN inference.

**Keywords:** gene regulatory network, data integration, Bayesian inference,

The era of "big data" has arrived to the field of computational biology [1]. Biological systems are so complex that in many situations, it is not feasible to directly measure the target signals. Actually, most of biological measurements are noisy and dependent to but not exactly about what we aim to find. This is where probability theory comes to our aid: estimate the true signals from noisy measurements in the presence of uncertainty. Bayesian inference has been widely applied in computational biology field. In certain systems for which we have a good understanding, i.e., gene regulation, behind the observed signals, there exist multiple hidden factors controlling how genes behave under a specific condition. As we are lacking observations on those hidden factors, we model them as parameters in a Bayesian framework, with or without informative prior. Then, for each parameter, Bayesian inference learns a "posterior" distribution, through which we make a final

Bayesian inference can update the shape of the learned posterior distributions for model parameters whenever new data observations arrive, providing enough

Gene regulatory networks (GRN) have been studied by computational scientists and biologists over 20 years to gain a fine map of gene functions. With large-scale genomic and epigenetic data generated under diverse cells, tissues, and diseases, the integrative analysis of multi-omics data plays a key role in identifying casual genes in human disease development. Bayesian inference (or integration) has been successfully applied to inferring GRNs. Learning a posterior distribution than making a single-value prediction of model parameter makes Bayesian inference a more robust approach to identify GRN from noisy biomedical observations. Moreover, given multi-omics data as input and a large number of model parameters to estimate, the automatic preference of Bayesian inference for simple models that sufficiently explain data without unnecessary complexity ensures fast convergence to reliable results. In this chapter, we introduced GRN modeling using hierarchical Bayesian network and then used Gibbs sampling to identify network variables. We applied this model to breast cancer data and identified genes relevant to breast cancer recurrence. In the end, we discussed the potential of Bayesian inference as well as

Regulatory Network

*Xi Chen and Jianhua Xuan*

Gibbs sampling, breast cancer

estimation with a confidence interval.

**1. Introduction**

**63**

#### **Chapter 5**

## Bayesian Inference of Gene Regulatory Network

*Xi Chen and Jianhua Xuan*

#### **Abstract**

Gene regulatory networks (GRN) have been studied by computational scientists and biologists over 20 years to gain a fine map of gene functions. With large-scale genomic and epigenetic data generated under diverse cells, tissues, and diseases, the integrative analysis of multi-omics data plays a key role in identifying casual genes in human disease development. Bayesian inference (or integration) has been successfully applied to inferring GRNs. Learning a posterior distribution than making a single-value prediction of model parameter makes Bayesian inference a more robust approach to identify GRN from noisy biomedical observations. Moreover, given multi-omics data as input and a large number of model parameters to estimate, the automatic preference of Bayesian inference for simple models that sufficiently explain data without unnecessary complexity ensures fast convergence to reliable results. In this chapter, we introduced GRN modeling using hierarchical Bayesian network and then used Gibbs sampling to identify network variables. We applied this model to breast cancer data and identified genes relevant to breast cancer recurrence. In the end, we discussed the potential of Bayesian inference as well as Bayesian deep learning for large-scale and complex GRN inference.

**Keywords:** gene regulatory network, data integration, Bayesian inference, Gibbs sampling, breast cancer

#### **1. Introduction**

The era of "big data" has arrived to the field of computational biology [1]. Biological systems are so complex that in many situations, it is not feasible to directly measure the target signals. Actually, most of biological measurements are noisy and dependent to but not exactly about what we aim to find. This is where probability theory comes to our aid: estimate the true signals from noisy measurements in the presence of uncertainty. Bayesian inference has been widely applied in computational biology field. In certain systems for which we have a good understanding, i.e., gene regulation, behind the observed signals, there exist multiple hidden factors controlling how genes behave under a specific condition. As we are lacking observations on those hidden factors, we model them as parameters in a Bayesian framework, with or without informative prior. Then, for each parameter, Bayesian inference learns a "posterior" distribution, through which we make a final estimation with a confidence interval.

Bayesian inference can update the shape of the learned posterior distributions for model parameters whenever new data observations arrive, providing enough

flexibility for integrative analysis and model extension [2]. Although using more data types means defining more model parameters, Bayesian inference automatically prefers for simple models that sufficiently explain data without unnecessary complexity. This is a very important property for biological data analysis because a simple model is much easier to validate using lab-controlled experiments.

In this chapter, we introduce how to apply Bayesian inference to inferring gene regulatory networks (GRN). GRN is a hierarchical network with regulatory proteins, target genes, and interactions between them [3], playing a key role in mediating cellular functions and signaling pathways in cells [4]. Accurate inference of GRN using data specific for a disease returns disease-associated regulatory proteins and genes, serving as potential targets for drug treatment [5]. In recent years, noncoding DNA analysis reveals more and more noncoding regions with strong regulatory effects on gene transcription [6], which greatly expands the scope of GRN research.

GRN analysis requires an integration of multiple types of measurements including but not limited to gene expression, chromatin accessibility, transcription factor binding, methylation, and histone modification [7]. The challenge of GRN inference is that there exit hundreds of proteins and tens of thousands of genes. One protein can regulate hundreds of target genes, and their regulatory relationship (an interaction in GRN) may vary across different cell types, tissues, or diseases. Experiments of high-throughput target gene measurements for one protein in one specific condition are costive and noisy [8], let alone for hundreds of proteins under diverse conditions. For many tissues or diseases, we need to integrate multiple relevant data types and computationally infer GRNs specific for those conditions.

Bayesian inference is particularly suitable for GRN inference as it is very flexible for large-scale data integration. Moreover, when we have multiple datasets generated from very similar conditions, estimating variables using distribution learning than a single-value prediction makes the final estimation more robust and easier to compare across multiple datasets. We demonstrated this using two breast cancer datasets generated under very similar conditions, in which we also compared a hierarchical Bayesian model with several competing methods. Moreover, using patient data as model input, although they are noisy, we successfully identified a GRN associated with breast cancer recurrence. Finally, we discussed the potential of Bayesian deep learning for large-scale and complex GRN inference.

#### **2. Gene regulatory networks**

Human genome can be simply divided into coding (exomes) and noncoding regions. The process of producing an RNA copy from exomes is called transcription, which can be quantitatively measured using microarray or RNA-seq technics [9, 10], producing gene expression data of 30,000 genes simultaneously. The transcription process is mediated by regulatory regions located in the noncoding genome, including promoters and enhancers [11]. Promoters are proximal to gene transcription starting sites (TSS), usually within 3 kbps (**Figure 1A**), while enhancers are usually located distantly, i.e., 200 kbps (**Figure 1B**), and can be up to 1 Mbps. In general, each gene could be associated with one promoter and multiple enhancers.

TFs are not random [14, 15]. Some TFs tend to co-bind at the same regions more often than with others, i.e. MYC and MAX. One TF can regulate multiple genes, and a target gene can also be regulated by multiple TFs considering the existence of CRMs (**Figure 1C**). For each specific TF-gene interaction in **Figure 1C**, its regulatory effect can be either positive (activating gene expression) or negative (depressing gene expression), as shown in **Figure 1D**. The protein activities of TFs are therefore connected to the dynamic changes of gene expression across multiple samples [13]. To accurately identify GRNs, we need quantitative measures of all types of signals in **Figure 1D**–**F**. However, due to technical limitations, we can obtain good quality measurements of gene expression, binary measurements (existence or not) of individual TF-gene interactions yet with a high false positive rate, but no measurements of TF activities. To infer GRNs, we must jointly estimate TF activities, TF-gene regulation strengths, and CRMs (TF associations) given gene

*up-regulated, "green" as down-regulated, "back" as no change.*

*Bayesian Inference of Gene Regulatory Network DOI: http://dx.doi.org/10.5772/intechopen.88799*

*Illustration of gene regulation: (A) transcription factor (TF)-gene regulation through proximal promoter regions; (B) TF-gene regulation through distal enhancer regions; (C) a gene regulatory network (GRN) including TFs, genes, and their interactions; (D) regulatory effects of TFs on individual genes with "red" as activation, "blue" as depression, and "white" as no regulatory effects; (E) a heatmap of TF protein activities across biological samples of multiple conditions with "red" as enhanced activity, "green" as reduced activity, "black" as no activity; and (F) a heatmap of gene expression across multiple samples, with "red" as*

expression observations.

**65**

**Figure 1.**

Transcription factors (TFs), a special category of proteins, often coordinate with each other as cis-regulatory modules (CRMs) [12] and co-bind at regulatory regions [13]. For example, in **Figure 1A** or **B**, there are three TFs binding at promoter or enhancer regions and functioning together as one CRM to mediate the transcription process of their target genes. It has been known that the association relationships of *Bayesian Inference of Gene Regulatory Network DOI: http://dx.doi.org/10.5772/intechopen.88799*

#### **Figure 1.**

flexibility for integrative analysis and model extension [2]. Although using more data types means defining more model parameters, Bayesian inference automatically prefers for simple models that sufficiently explain data without unnecessary complexity. This is a very important property for biological data analysis because a

In this chapter, we introduce how to apply Bayesian inference to inferring gene regulatory networks (GRN). GRN is a hierarchical network with regulatory proteins, target genes, and interactions between them [3], playing a key role in mediating cellular functions and signaling pathways in cells [4]. Accurate inference of GRN using data specific for a disease returns disease-associated regulatory proteins and genes, serving as potential targets for drug treatment [5]. In recent years, noncoding DNA analysis reveals more and more noncoding regions with strong regulatory effects on gene transcription [6], which greatly expands the scope of

GRN analysis requires an integration of multiple types of measurements including but not limited to gene expression, chromatin accessibility, transcription factor binding, methylation, and histone modification [7]. The challenge of GRN inference is that there exit hundreds of proteins and tens of thousands of genes. One protein can regulate hundreds of target genes, and their regulatory relationship (an interaction in GRN) may vary across different cell types, tissues, or diseases. Experiments of high-throughput target gene measurements for one protein in one specific condition are costive and noisy [8], let alone for hundreds of proteins under diverse conditions. For many tissues or diseases, we need to integrate multiple relevant data

Bayesian inference is particularly suitable for GRN inference as it is very flexible for large-scale data integration. Moreover, when we have multiple datasets generated from very similar conditions, estimating variables using distribution learning than a single-value prediction makes the final estimation more robust and easier to compare across multiple datasets. We demonstrated this using two breast cancer datasets generated under very similar conditions, in which we also compared a hierarchical Bayesian model with several competing methods. Moreover, using patient data as model input, although they are noisy, we successfully identified a GRN associated with breast cancer recurrence. Finally, we discussed the potential of

Human genome can be simply divided into coding (exomes) and noncoding regions. The process of producing an RNA copy from exomes is called transcription, which can be quantitatively measured using microarray or RNA-seq technics [9, 10], producing gene expression data of 30,000 genes simultaneously. The transcription process is mediated by regulatory regions located in the noncoding genome, including promoters and enhancers [11]. Promoters are proximal to gene transcription starting sites (TSS), usually within 3 kbps (**Figure 1A**), while enhancers are usually located distantly, i.e., 200 kbps (**Figure 1B**), and can be up to 1 Mbps. In general,

Transcription factors (TFs), a special category of proteins, often coordinate with each other as cis-regulatory modules (CRMs) [12] and co-bind at regulatory regions [13]. For example, in **Figure 1A** or **B**, there are three TFs binding at promoter or enhancer regions and functioning together as one CRM to mediate the transcription process of their target genes. It has been known that the association relationships of

each gene could be associated with one promoter and multiple enhancers.

simple model is much easier to validate using lab-controlled experiments.

types and computationally infer GRNs specific for those conditions.

Bayesian deep learning for large-scale and complex GRN inference.

**2. Gene regulatory networks**

**64**

GRN research.

*Bayesian Inference on Complicated Data*

*Illustration of gene regulation: (A) transcription factor (TF)-gene regulation through proximal promoter regions; (B) TF-gene regulation through distal enhancer regions; (C) a gene regulatory network (GRN) including TFs, genes, and their interactions; (D) regulatory effects of TFs on individual genes with "red" as activation, "blue" as depression, and "white" as no regulatory effects; (E) a heatmap of TF protein activities across biological samples of multiple conditions with "red" as enhanced activity, "green" as reduced activity, "black" as no activity; and (F) a heatmap of gene expression across multiple samples, with "red" as up-regulated, "green" as down-regulated, "back" as no change.*

TFs are not random [14, 15]. Some TFs tend to co-bind at the same regions more often than with others, i.e. MYC and MAX. One TF can regulate multiple genes, and a target gene can also be regulated by multiple TFs considering the existence of CRMs (**Figure 1C**). For each specific TF-gene interaction in **Figure 1C**, its regulatory effect can be either positive (activating gene expression) or negative (depressing gene expression), as shown in **Figure 1D**. The protein activities of TFs are therefore connected to the dynamic changes of gene expression across multiple samples [13]. To accurately identify GRNs, we need quantitative measures of all types of signals in **Figure 1D**–**F**. However, due to technical limitations, we can obtain good quality measurements of gene expression, binary measurements (existence or not) of individual TF-gene interactions yet with a high false positive rate, but no measurements of TF activities. To infer GRNs, we must jointly estimate TF activities, TF-gene regulation strengths, and CRMs (TF associations) given gene expression observations.

#### **3. Bayesian inference**

Bayesian inference is particularly suitable for inferring GRN as it will learn a posterior distribution for each variable, with a high tolerance on the noise existing in the gene expression data or caused by non-perfect prior assumptions.

#### **3.1 A hierarchical Bayesian model**

Given gene expression data under multiple biological samples (conditions), we focus on the expression variation of each gene from its baseline expression because such variation reflects the effects of condition changes. For a specific disease, only genes showing significant expression changes between disease cells and normal cells are interesting candidates. Thus, for gene *n*, we calculate the log fold change of gene expression under each sample (1,2,3,…,*M*) to that of baseline condition (0). To model gene expression data of hundreds of genes in the same framework, for genen, we normalize its *M* log fold change values (indexed by m) to values with 0-mean and 1-standard deviation, denoted by *yn,m*. Then, a linear model is applied to modeling *yn,m* as follows [16, 17]:

$$\mathcal{Y}\_{n,m} = \sum\_{t} a\_{n,t} b\_{n,t} \mathbf{x}\_{t,m} + \varepsilon\_n. \tag{1}$$

follow a 0-mean Gaussian distribution as well, denoted by *N* 0*; σ*<sup>2</sup>

activity. Thus, we assume a 0-mean Gaussian prior for *x*, as *N* 0*; σ*<sup>2</sup>

expression measurements. Therefore, we set *σ*<sup>2</sup>

As GRN is a sparse network, most *a* values would be 0.

*x,prior* is also a hyperparameter).

*a,prior* and *σ*<sup>2</sup>

Gaussian prior on *a*, as *N* 0*; σ*<sup>2</sup>

*Therefore, GRNs are highly context-specific.*

*Bayesian Inference of Gene Regulatory Network DOI: http://dx.doi.org/10.5772/intechopen.88799*

variance *σ*<sup>2</sup>

**Figure 2.**

simultaneously. *σ*<sup>2</sup>

**67**

variance of noise is hard to determine, it should fall in the same scale as gene

*a,prior* (the variance *<sup>σ</sup>*<sup>2</sup>

The regulation strength variable *a* is conditional on the state of *b* (as shown in **Figure 2**): for *b* ¼ 0, we set *a* ¼ 0, denoting the nonexistence of TF-gene regulation; for *b* ¼ 1, *a* can be either positive or negative so that we assume a 0-mean

*A hierarchical Bayesian framework for GRN modeling. The number of variables in this framework depends on the numbers of biological samples,TFs, genes, and candidate CRMs. Given gene expression data under different conditions, for the same TF and same gene, their regulatory relationship (variable b) may have very different regulatory strength (variable a). And the TF activity (variable x) can be significantly different as well.*

We model TF activity *x* under multiple biological samples using Gaussian random processes. As baseline expression is largely removed from gene expression data during the data normalization process, ideally the baseline activity of each TF is 0. In each sample, *x* can be either enhanced or reduced with respect to its baseline

Regarding hyperparameters of the prior mean and variance for *a* or *x*, a benefit of assuming 0-mean prior is to control model overfitting. Only when the posterior distribution has a significant non-zero mean value that we will accept that estimation. It is hard to determine the scale of variable values without direct measurements. A conservative way is to assume non-informative prior on them and let the algorithm determine the final posterior distribution, although the non-informative prior will lead to a stickier chain and a posterior with potential multiple modes. Exploring such a posterior is certainly more challenging than exploring a wellbehaved unimodal posterior. However, there is really no need to trouble with this multimodal posterior on *a* or *x*, as the inferential values of the whole framework are: the discrete posterior distributions of CRMs. For each gene, the posterior distribution of CRMs learned from a data reveals which CRM(s) are regulating this gene. If there are more than one mode in the CRM posterior distribution, this gene will be associated with two or three CRMs. This is quite common in gene regulatory networks as one gene can be regulated by CRMs at multiple regulatory regions

*x,prior* should be significantly larger than the variance of

*<sup>ε</sup>* ¼ 1.

*ε*

*a,prior* is a hyperparameter).

*x,prior* (the

. Although the

where variable *an,t* denotes the regulation strength of TF *t* on gene *n*; *bn,t* is a binary variable denoting the regulation occurrence of TF *t* on gene *n*; TF protein activity variable *xt,m* under condition (sample) *m* directly connects to gene expression *yn,m* under the same condition [16]; and the noise variable *ε<sup>n</sup>* denotes inaccuracy of gene expression marvelments.

Given protein-DNA binding measurements of *T* TFs and N genes (i.e., ENCODE database), we are able to identify TF binding sites at promoter or enhancer regions within 1 Mbps around individual target genes [18]. Each gene can be associated with several regulatory regions, and at each region, there exit a subset of TFs, as a candidate CRM. Then, we may observe multiple candidate CRMs (in total *Kn*) for gene *n*, indexed by *cn* ¼ 1*,* 2*,* 3*,* …*, k,* …*, Kn*. Each *cn* is associated with a unique set of TF-gene binding events (*bcn,t* ¼ 1 or *bcn,t* ¼ 0). We assume *cn* a hidden variable controlling how binding variables are associated with each other, with candidate space defined from existing databases.

To estimate the abovementioned variables, we develop a hierarchical Bayesian network to model their internal dependency and associations with gene expression, as shown in **Figure 2**. CRM variable *c* controls the state of each binding variable *b*. For *b* ¼ 1, regulation strength *a* can be either positive or negative denoting gene activation or depression by the binding TF. In the meanwhile, through TF-gene regulation, the protein activities of TFs are directly connected to target gene expression, with *ε* denoting the measurement noise in gene expression data. With Eq. (1) and **Figure 2**, we aim to estimate all these variables using Bayesian inference, which requires a prior assumption (not necessary to be informative) on the distribution of each variable.

Based on prior binding observations from public database, the candidate space of CRM is known, denoted by **C**. Given a gene expression dataset generated from a specific condition, for gene *n*, we need to estimate which CRM *cn* is regulating its gene expression. As the prior data does not tell which CRM is more likely to be true under a specific condition, we assume a discrete uniform prior on *c*.

Based on data observation, *y* has a Gaussian-like distribution with 0-mean and 1-standard deviation. The gene expression noise component *ε* can be assumed to

*Bayesian Inference of Gene Regulatory Network DOI: http://dx.doi.org/10.5772/intechopen.88799*

#### **Figure 2.**

**3. Bayesian inference**

**3.1 A hierarchical Bayesian model**

*Bayesian Inference on Complicated Data*

modeling *yn,m* as follows [16, 17]:

racy of gene expression marvelments.

space defined from existing databases.

the distribution of each variable.

**66**

Bayesian inference is particularly suitable for inferring GRN as it will learn a posterior distribution for each variable, with a high tolerance on the noise existing

Given gene expression data under multiple biological samples (conditions), we focus on the expression variation of each gene from its baseline expression because such variation reflects the effects of condition changes. For a specific disease, only genes showing significant expression changes between disease cells and normal cells are interesting candidates. Thus, for gene *n*, we calculate the log fold change of gene expression under each sample (1,2,3,…,*M*) to that of baseline condition (0). To model gene expression data of hundreds of genes in the same framework, for genen, we normalize its *M* log fold change values (indexed by m) to values with 0-mean and 1-standard deviation, denoted by *yn,m*. Then, a linear model is applied to

where variable *an,t* denotes the regulation strength of TF *t* on gene *n*; *bn,t* is a binary variable denoting the regulation occurrence of TF *t* on gene *n*; TF protein activity variable *xt,m* under condition (sample) *m* directly connects to gene expression *yn,m* under the same condition [16]; and the noise variable *ε<sup>n</sup>* denotes inaccu-

Given protein-DNA binding measurements of *T* TFs and N genes (i.e., ENCODE database), we are able to identify TF binding sites at promoter or enhancer regions within 1 Mbps around individual target genes [18]. Each gene can be associated with several regulatory regions, and at each region, there exit a subset of TFs, as a candidate CRM. Then, we may observe multiple candidate CRMs (in total *Kn*) for gene *n*, indexed by *cn* ¼ 1*,* 2*,* 3*,* …*, k,* …*, Kn*. Each *cn* is associated with a unique set of TF-gene binding events (*bcn,t* ¼ 1 or *bcn,t* ¼ 0). We assume *cn* a hidden variable controlling how binding variables are associated with each other, with candidate

To estimate the abovementioned variables, we develop a hierarchical Bayesian network to model their internal dependency and associations with gene expression, as shown in **Figure 2**. CRM variable *c* controls the state of each binding variable *b*. For *b* ¼ 1, regulation strength *a* can be either positive or negative denoting gene activation or depression by the binding TF. In the meanwhile, through TF-gene regulation, the protein activities of TFs are directly connected to target gene expression, with *ε* denoting the measurement noise in gene expression data. With Eq. (1) and **Figure 2**, we aim to estimate all these variables using Bayesian inference, which requires a prior assumption (not necessary to be informative) on

Based on prior binding observations from public database, the candidate space of CRM is known, denoted by **C**. Given a gene expression dataset generated from a specific condition, for gene *n*, we need to estimate which CRM *cn* is regulating its gene expression. As the prior data does not tell which CRM is more likely to be true

Based on data observation, *y* has a Gaussian-like distribution with 0-mean and 1-standard deviation. The gene expression noise component *ε* can be assumed to

under a specific condition, we assume a discrete uniform prior on *c*.

*an,tbn,txt,m* þ *εn,* (1)

in the gene expression data or caused by non-perfect prior assumptions.

*yn,m* <sup>¼</sup> <sup>X</sup> *t*

*A hierarchical Bayesian framework for GRN modeling. The number of variables in this framework depends on the numbers of biological samples,TFs, genes, and candidate CRMs. Given gene expression data under different conditions, for the same TF and same gene, their regulatory relationship (variable b) may have very different regulatory strength (variable a). And the TF activity (variable x) can be significantly different as well. Therefore, GRNs are highly context-specific.*

follow a 0-mean Gaussian distribution as well, denoted by *N* 0*; σ*<sup>2</sup> *ε* . Although the variance of noise is hard to determine, it should fall in the same scale as gene expression measurements. Therefore, we set *σ*<sup>2</sup> *<sup>ε</sup>* ¼ 1.

The regulation strength variable *a* is conditional on the state of *b* (as shown in **Figure 2**): for *b* ¼ 0, we set *a* ¼ 0, denoting the nonexistence of TF-gene regulation; for *b* ¼ 1, *a* can be either positive or negative so that we assume a 0-mean Gaussian prior on *a*, as *N* 0*; σ*<sup>2</sup> *a,prior* (the variance *<sup>σ</sup>*<sup>2</sup> *a,prior* is a hyperparameter). As GRN is a sparse network, most *a* values would be 0.

We model TF activity *x* under multiple biological samples using Gaussian random processes. As baseline expression is largely removed from gene expression data during the data normalization process, ideally the baseline activity of each TF is 0. In each sample, *x* can be either enhanced or reduced with respect to its baseline activity. Thus, we assume a 0-mean Gaussian prior for *x*, as *N* 0*; σ*<sup>2</sup> *x,prior* (the variance *σ*<sup>2</sup> *x,prior* is also a hyperparameter).

Regarding hyperparameters of the prior mean and variance for *a* or *x*, a benefit of assuming 0-mean prior is to control model overfitting. Only when the posterior distribution has a significant non-zero mean value that we will accept that estimation. It is hard to determine the scale of variable values without direct measurements. A conservative way is to assume non-informative prior on them and let the algorithm determine the final posterior distribution, although the non-informative prior will lead to a stickier chain and a posterior with potential multiple modes. Exploring such a posterior is certainly more challenging than exploring a wellbehaved unimodal posterior. However, there is really no need to trouble with this multimodal posterior on *a* or *x*, as the inferential values of the whole framework are: the discrete posterior distributions of CRMs. For each gene, the posterior distribution of CRMs learned from a data reveals which CRM(s) are regulating this gene. If there are more than one mode in the CRM posterior distribution, this gene will be associated with two or three CRMs. This is quite common in gene regulatory networks as one gene can be regulated by CRMs at multiple regulatory regions simultaneously. *σ*<sup>2</sup> *a,prior* and *σ*<sup>2</sup> *x,prior* should be significantly larger than the variance of gene expression data to allow a "large" space for the algorithm to generate posterior distributions. As *y* is already normalized with variance of 1, we set *σ*<sup>2</sup> *a,prior* ¼ 10 and *σ*2 *x,prior* ¼ 100.

Then, the problem of GRN inference is Bayesian formed as estimating posterior probabilistic distributions of **A** ¼ f g *acn,t* , **B** ¼ *bcn,t* f g j*bcn,t* ¼ 0 or 1 , and **X** ¼ f g *xt,m* given **Y** ¼ *yn,m* n o. Considering the dependence relationship of all variables in **Figure 2**, we define a joint posterior probability as follow:

$$\begin{split} P(\mathbf{A}, \mathbf{B}, \mathbf{X} | \mathbf{Y}) \propto P(\mathbf{Y} | \mathbf{A}, \mathbf{B}, \mathbf{X}) \times P(\mathbf{A}) \times P(\mathbf{C}) \times P(\mathbf{X}) \\ \propto & \prod\_{n} \prod\_{m} \left( \sigma\_{\epsilon}^{-1} \right) \exp \left( -\frac{1}{2\sigma\_{\epsilon}^{2}} \left( y\_{n,m} - \sum\_{l} a\_{i\_{n}l} b\_{l,n}, \mathbf{x}\_{l,m} \right)^{2} \right) \\ \qquad \times & \prod\_{n} \prod\_{l} \left( \sigma\_{a\_{l} \operatorname{prior}}^{-1} \right) \exp \left( -\frac{a\_{a\_{n}l}^{2}}{2\sigma\_{a\_{r} \operatorname{prior}}^{2}} \right) \\ \qquad \times & \prod\_{n} \prod\_{a\_{n}} \frac{1}{K\_{n}} \\ \qquad \times & \prod\_{l} \prod\_{m} \left( \sigma\_{x\_{l} \operatorname{prior}}^{-1} \right) \exp \left( -\frac{\mathbf{x}\_{l,m}^{2}}{2\sigma\_{x\_{r} \operatorname{prior}}^{2}} \right). \end{split} \tag{2}$$

Estimating the joint distribution of above-mentioned variables is difficult. Alternatively, we can approximate the joint posterior distribution by estimating the marginal distribution of each variable. To do that, we iteratively calculate each variable's conditional probability and perform Bayesian estimation using Gibbs sampling. The advantage of using Gibbs sampling is that it is theoretically guaranteed to converge to the posterior distribution [2, 19–21].

by one for *t* ¼ 1 � *T* according to each individual posterior Gaussian distribution

*Gibbs sampling of CRMs,TF activities, and regulation strengths with prior TF-gene regulation and gene*

strength *acn,t* according to the following conditional probability:

exp � <sup>1</sup>

0 @

2*σ*<sup>2</sup> *ε*

> *σ*2 *a,prior* P *<sup>m</sup> x*<sup>2</sup>

*a,priorMσ*<sup>2</sup> *ε*

tion of each *acn,t* also depends on the values of the other *acn,j*ð Þ *j* 6¼ *t* . Thus, we iteratively sample *acn,t* for TFs in module *cn* one by one according to each individual

*a* � �.

Secondly, for gene *n*, for each *bcn,t* ¼ 1, we estimate the associated regulation

*yn,m* �<sup>X</sup> *t*

*P a*ð Þ *cn,t*j**Y***;* **X***;* **B** is a Gaussian distribution, too, with mean and variance calculated

*<sup>m</sup> yn,m* � <sup>P</sup>

*t,m* þ *σ*<sup>2</sup> *<sup>ε</sup>M*

Similar to the estimation process of TF activity variables, the posterior distribu-

Finally, with sampled TF activity and regulation strength variables, we sample CRM variable *cn* for the gene *n*. It is hard to assume a prior probabilistic distribution shape on the joint distribution of multiple binding variables in *cn*. In practice, *cn* has a finite number of states as *Kn*. Therefore we can directly calculate a discrete

!<sup>2</sup>

*j*6¼*t*

*t,m* þ *Mσ*<sup>2</sup> *ε*

*acn,txt,m*

*acn,jxj,m* � �*xt,m*

� *<sup>a</sup>*<sup>2</sup> *cn,t* 2*σ*<sup>2</sup> *a,prior*

1

A*:* (5)

(6)

*N μx; σ*<sup>2</sup> *x* � �.

**Figure 3.**

as follows:

**69**

*P a*ð Þ *cn,t*j**Y***;* **<sup>X</sup>***;* **<sup>B</sup>** <sup>∝</sup> <sup>Y</sup>

*expression observations as input.*

*Bayesian Inference of Gene Regulatory Network DOI: http://dx.doi.org/10.5772/intechopen.88799*

*m*

*μ<sup>a</sup>* ¼

8 >>>>>>><

*σ*2

>>>>>>>:

posterior Gaussian distribution *N μa; σ*<sup>2</sup>

*σ*2 *a,prior* P

*<sup>a</sup>* <sup>¼</sup> *<sup>σ</sup>*<sup>2</sup>

*σ*2 *a,prior* P *<sup>m</sup> x*<sup>2</sup>

discrete conditional probability for each *cn* ¼ *k* as follows:

#### **3.2 Gibbs sampling**

We first sample TF activity variable *xt,m* for the TF *t* and sample *m*, according to its conditional probability (based on Eq. (2)) as follows (**Figure 3**):

$$P(\mathbf{x}\_{t,m}|\mathbf{Y}, \mathbf{A}, \mathbf{B}) \propto \prod\_{n} \exp\left(-\frac{1}{2\sigma\_{\varepsilon}^{2}} \left(\boldsymbol{y}\_{n,m} - \sum\_{t} a\_{\varepsilon,t} b\_{\varepsilon\_{n},t} \boldsymbol{x}\_{t,m}\right)^{2} - \frac{\boldsymbol{x}\_{t,m}^{2}}{2\sigma\_{\boldsymbol{x},prior}^{2}}\right). \tag{3}$$

*P x*ð Þ *t,m*j**Y***;* **A***;***B** is a Gaussian distribution with mean and variance as follows:

$$\begin{cases} \mu\_{\rm x} = \frac{\sigma\_{\rm x,prior}^2 \sum\_{n} \left( y\_{n,m} - \sum\_{j \neq t} a\_{c\_{n,j}} b\_{c\_{n,j}} \mathbf{x}\_{j,m} \right) a\_{c\_{n,t}} b\_{c\_{n,t}}}{\sigma\_{\rm x,prior}^2 \sum\_{n} a\_{c\_{n,t}}^2 b\_{c\_{n,t}}^2 + \sigma\_{\varepsilon}^2 N} \\\\ \sigma\_{\rm x}^2 = \frac{\sigma\_{\rm x}^2 N \sigma\_{\rm x,prior}^2}{\sigma\_{\rm x,prior}^2 \sum\_{n} a\_{c\_{n,t}}^2 b\_{c\_{n,t}}^2 + \sigma\_{\varepsilon}^2 N} \end{cases} \tag{4}$$

As shown in Eq. (4), the estimation of distribution of *xt,m* is conditional on other TF activities *xj,m*ð Þ *j* 6¼ *t* . Therefore, we iteratively sample *xt,m* as *xt,m*∣*xj,m*ð Þ *j* 6¼ *t* one

*Bayesian Inference of Gene Regulatory Network DOI: http://dx.doi.org/10.5772/intechopen.88799*

#### **Figure 3.**

gene expression data to allow a "large" space for the algorithm to generate posterior

Then, the problem of GRN inference is Bayesian formed as estimating posterior probabilistic distributions of **A** ¼ f g *acn,t* , **B** ¼ *bcn,t* f g j*bcn,t* ¼ 0 or 1 , and **X** ¼ f g *xt,m*

> 2*σ*<sup>2</sup> *ε*

exp � *<sup>a</sup>*<sup>2</sup>

exp � *<sup>x</sup>*<sup>2</sup>

We first sample TF activity variable *xt,m* for the TF *t* and sample *m*, according to

*yn,m* �<sup>X</sup> *t*

*P x*ð Þ *t,m*j**Y***;* **A***;***B** is a Gaussian distribution with mean and variance as follows:

As shown in Eq. (4), the estimation of distribution of *xt,m* is conditional on other TF activities *xj,m*ð Þ *j* 6¼ *t* . Therefore, we iteratively sample *xt,m* as *xt,m*∣*xj,m*ð Þ *j* 6¼ *t* one

*<sup>n</sup> yn,m* � <sup>P</sup>

Estimating the joint distribution of above-mentioned variables is difficult. Alternatively, we can approximate the joint posterior distribution by estimating the marginal distribution of each variable. To do that, we iteratively calculate each variable's conditional probability and perform Bayesian estimation using Gibbs sampling. The advantage of using Gibbs sampling is that it is theoretically

. Considering the dependence relationship of all variables in

*yn,m* �<sup>X</sup> *t*

*cn,t* 2*σ*<sup>2</sup> *a,prior*

*t,m* 2*σ*<sup>2</sup> *x,prior*

*:*

*acn,tbcn,txt,m*

� *<sup>x</sup>*<sup>2</sup> *t,m* 2*σ*<sup>2</sup> *x,prior*

*acn,tbcn,t*

1

A*:* (3)

(4)

!<sup>2</sup>

*<sup>j</sup>*6¼*<sup>t</sup> acn,jbcn,jxj,m*

� �

!

!

!<sup>2</sup> 0

*acn,tbcn,txt,m*

*a,prior* ¼ 10 and

1 A

(2)

distributions. As *y* is already normalized with variance of 1, we set *σ*<sup>2</sup>

**Figure 2**, we define a joint posterior probability as follow:

*P*ð Þ **A***;* **B***;* **X Y**j ∝*P*ð Þ� **Y A**j *;* **B***;* **X** *P*ð Þ� **A** *P*ð Þ� **C** *P*ð Þ **X**

*σ*�<sup>1</sup> *ε*

*σ*�<sup>1</sup> *a,prior* � �

*σ*�<sup>1</sup> *x,prior* � �

guaranteed to converge to the posterior distribution [2, 19–21].

its conditional probability (based on Eq. (2)) as follows (**Figure 3**):

2*σ*<sup>2</sup> *ε*

> *σ*2 *x,prior* P *na*<sup>2</sup> *cn,t b*2 *cn,t* þ *σ*<sup>2</sup> *εN*

*εNσ*<sup>2</sup> *x,prior*

exp � <sup>1</sup>

0 @

1 *Kn*

� � exp � <sup>1</sup>

@

*σ*2

*x,prior* ¼ 100.

given **Y** ¼ *yn,m*

**3.2 Gibbs sampling**

*P x*ð Þ *t,m*j**Y***;* **<sup>A</sup>***;***<sup>B</sup>** <sup>∝</sup> <sup>Y</sup>

*n*

*μ<sup>x</sup>* ¼

8 >>>>><

>>>>>:

**68**

*σ*2

*σ*2 *x,prior* P

*<sup>x</sup>* <sup>¼</sup> *<sup>σ</sup>*<sup>2</sup>

*σ*2 *x,prior* P *na*<sup>2</sup> *cn,t b*2 *cn,t* þ *σ*<sup>2</sup> *εN*

n o

*Bayesian Inference on Complicated Data*

∝ Y *n*

� <sup>Y</sup> *n*

� <sup>Y</sup> *n*

� <sup>Y</sup> *t*

Y *m*

Y *t*

Y *cn*

Y *m*

*Gibbs sampling of CRMs,TF activities, and regulation strengths with prior TF-gene regulation and gene expression observations as input.*

by one for *t* ¼ 1 � *T* according to each individual posterior Gaussian distribution *N μx; σ*<sup>2</sup> *x* � �.

Secondly, for gene *n*, for each *bcn,t* ¼ 1, we estimate the associated regulation strength *acn,t* according to the following conditional probability:

$$P(a\_{c\_n, t} | \mathbf{Y}, \mathbf{X}, \mathbf{B}) \propto \prod\_m \exp\left(-\frac{1}{2\sigma\_\varepsilon^2} \left(\mathcal{y}\_{n, m} - \sum\_t a\_{c\_n, t} \mathbf{x}\_{t, m}\right)^2 - \frac{a\_{c\_n, t}^2}{2\sigma\_{a, prior}^2}\right). \tag{5}$$

*P a*ð Þ *cn,t*j**Y***;* **X***;* **B** is a Gaussian distribution, too, with mean and variance calculated as follows:

$$\begin{cases} \mu\_{a} = \frac{\sigma\_{a,prior}^{2} \sum\_{m} \left( y\_{n,m} - \sum\_{j \neq t} a\_{c\_{u},j} \boldsymbol{\chi}\_{j,m} \right) \boldsymbol{\chi}\_{t,m}}{\sigma\_{a,prior}^{2} \sum\_{m} \boldsymbol{\chi}\_{t,m}^{2} + M \sigma\_{\varepsilon}^{2}} \\\\ \sigma\_{a}^{2} = \frac{\sigma\_{a,prior}^{2} M \sigma\_{\varepsilon}^{2}}{\sigma\_{a,prior}^{2} \sum\_{m} \boldsymbol{\chi}\_{t,m}^{2} + \sigma\_{\varepsilon}^{2} M} \end{cases} \tag{6}$$

Similar to the estimation process of TF activity variables, the posterior distribution of each *acn,t* also depends on the values of the other *acn,j*ð Þ *j* 6¼ *t* . Thus, we iteratively sample *acn,t* for TFs in module *cn* one by one according to each individual posterior Gaussian distribution *N μa; σ*<sup>2</sup> *a* � �.

Finally, with sampled TF activity and regulation strength variables, we sample CRM variable *cn* for the gene *n*. It is hard to assume a prior probabilistic distribution shape on the joint distribution of multiple binding variables in *cn*. In practice, *cn* has a finite number of states as *Kn*. Therefore we can directly calculate a discrete discrete conditional probability for each *cn* ¼ *k* as follows:

$$\begin{split} P(c\_{n}|\mathbf{y}\_{n},\mathbf{A},\mathbf{X}) & \propto \prod\_{t} P(\mathbf{y}\_{n}|a\_{c\_{n}t},\mathbf{x}\_{t}) P(a\_{c\_{n}t}|c\_{n}) P(\mathbf{x}\_{t}) \\ & \propto \prod\_{t} \exp\left(-\frac{1}{2\sigma\_{\varepsilon}^{2}} \sum\_{m} \left(y\_{n,m} - \sum\_{t} a\_{c\_{n}t}b\_{c\_{n},\mathbf{x}\_{t,m}}\right)^{2} - \frac{a\_{c\_{n}t}^{2}}{2\sigma\_{\mathbf{a},pri}^{2}} - \frac{\sum\_{m} \mathbf{x}\_{m,t}^{2}}{2\sigma\_{\mathbf{x},pri}^{2}}\right) \end{split} \tag{7}$$

After calculating Eq. (7) for all possible values of *cn*, we sample one value according to the following discrete probability density function:

$$p(c\_n = k) = \frac{P(c\_n = k | \mathbf{y}\_n, \mathbf{X}, \mathbf{A})}{\sum\_p P(c\_n = p | \mathbf{y}\_n, \mathbf{X}, \mathbf{A})} \tag{8}$$

examining TF binding signals at promoter and distantly associated enhancers associated with each gene. In total we collected 2,319 candidate TF-gene interactions (**Figure 4A**) between 39 TFs and 275 genes, whose gene expression is consistently upregulated in both datasets when breast cancer cells are stimulated to fast proliferate (**Figure 4B** and **C**). We, respectively, applied the hierarchical Bayesian model to the two gene expression datasets with the same prior settings. To monitor the convergence of the sampling process, we ran five sequences with different initial states and sampled 1000 times in each. As shown in **Figure 4C** and **D** (for Data #1), after 100 rounds of sampling, the model started to converge. The sampling frequency on each TF-gene interaction was calculated as the posterior probabilistic weight. We extracted top 500 most confident TF-gene interactions as the final GRN estimation for each data set and then focused on common interactions

Here, we specifically compared our approach with three competing methods (COGRIM [20], LASSO [23], and NARROMI [24]). COGRIM was a Bayesian inference approach without modeling on CRMs. It treated individual TF-gene binding events independently. Although such an assumption lowered the model complexity, it made the model less robust against the inaccuracy in the TF-gene binding prior. Moreover, for the TF activity, COGRIM simply treated it as an observed value by directly using TF mRNA expression. Although ideally the variation of mRNA transcription is proportional to the activity change of mRNA-translated protein, currently this correlation is very low in most studies using gene expression. These inaccurate assumptions brought a lot of uncertainty to modeling gene expression data. LASSO used a linear regression model to integrate prior TF-gene interactions and gene expression data and predicted one value for each TF-gene interaction. The NARROMI approach inferred GRNs using gene expression data only without any prior on TF-gene interactions, and also, it made single-value prediction for each interaction based on the mutual information between gene and TF expression values. Theoretically, the Bayesian approach described in this chapter should be

*Input breast cancer cell line data for GRN inference: (A) prior TF-gene interactions ("black" denotes binding occurrence); (B) heatmap of time-course gene expression data; (C) heatmap of steady-state gene expression, all data are from the same breast cancer cell line; (D) convergence of regulatory strength estimation using timecourse gene expression data; and (E) convergence of TF activity estimation using time-course gene expression*

between two relevant GRNs.

*Bayesian Inference of Gene Regulatory Network DOI: http://dx.doi.org/10.5772/intechopen.88799*

**Figure 4.**

*data.*

**71**

After sampling TFA, TF-gene regulation strength, and cis-regulatory module variables for all *N* genes, we update binding states in matrix **B** according to the sampled CRMs for individual genes and start the next round of sampling.

Convergence of Gibbs sampling can be monitored based on the ratio (R) of within-variance and between-variance using multiple sequences with different initial states [22]. In each application, we ran five sequences of sampling in parallel. In the *i*-th round of sampling, for each variable we calculated the within-variance using samples from 1 to *i* in each sequence and then take the mean value of variances from five sequences. In the meanwhile, we calculate the betweenvariance of the same variable using its sampled values in the *i*-th round but from five sequences. For each catalog of variables, the distribution of ratio (R) between within-variance and between-variance is used to monitor the overall sampling convergence. When the sampler converges, values of R would be around "1." We, respectively, monitor the sampling convergence for regulation strengths and TF activities. Once both of them converge, we start to accumulate samples on TF-gene binding variables. As each TF-gene binding variable is binary, its sampling frequency represents the posterior probability of binding occurrence. In the meanwhile, for each gene, a discrete posterior probability distribution of all associated candidate CRMs is inferred, the mode of which reveals the most likely regulatory region associated with current gene.

#### **4. Inferring GRNs for breast cancer**

#### **4.1 Application to in vitro breast cancer cell line data**

We first applied the hierarchical Bayesian model to gene expression data measured from in vitro breast cancer cell lines. We chose to use cell line data mainly because such data is usually clean and good for validating computational models. Here, we carefully selected two public available breast cancer cell line datasets measured independently but under the same condition (downloadable from the GEO database https://www.ncbi.nlm.nih.gov/geo/, with accession number GSE62789 for Data #1 and accession number GSE51403 for Data #2, both treated by 24 hours of 17b-estradiol (E2) to stimulate breast cancer cells proliferation). The similarity between the two inferred GRNs can be used to evaluate the robustness of GRN inference methods.

For prior TF-gene collection, we checked the ENCODE database (https://www. encodeproject.org/) and selected genome-wide binding profiles of 39 TFs, measured from the same breast cancer cell line. We collected candidate binding events by

#### *Bayesian Inference of Gene Regulatory Network DOI: http://dx.doi.org/10.5772/intechopen.88799*

*P cn* **<sup>y</sup>***n;* **<sup>A</sup>***;* **<sup>X</sup>** � � � �<sup>∝</sup> <sup>Y</sup>

*t*

*Bayesian Inference on Complicated Data*

∝ Y *t*

region associated with current gene.

GRN inference methods.

**70**

**4. Inferring GRNs for breast cancer**

**4.1 Application to in vitro breast cancer cell line data**

exp � <sup>1</sup>

0 @

2*σ*<sup>2</sup> *ε* X *m*

according to the following discrete probability density function:

*p c*ð Þ¼ *<sup>n</sup>* ¼ *k*

*<sup>P</sup>* **<sup>y</sup>***<sup>n</sup> acn,t;* **<sup>x</sup>***<sup>t</sup>* <sup>j</sup> � �*P a*ð Þ *cn,t*j*cn <sup>P</sup>*ð Þ **<sup>x</sup>***<sup>t</sup>*

*yn,m* �<sup>X</sup> *t*

After calculating Eq. (7) for all possible values of *cn*, we sample one value

P

After sampling TFA, TF-gene regulation strength, and cis-regulatory module variables for all *N* genes, we update binding states in matrix **B** according to the sampled CRMs for individual genes and start the next round of sampling.

Convergence of Gibbs sampling can be monitored based on the ratio (R) of within-variance and between-variance using multiple sequences with different initial states [22]. In each application, we ran five sequences of sampling in parallel. In the *i*-th round of sampling, for each variable we calculated the within-variance using samples from 1 to *i* in each sequence and then take the mean value of variances from five sequences. In the meanwhile, we calculate the betweenvariance of the same variable using its sampled values in the *i*-th round but from five sequences. For each catalog of variables, the distribution of ratio (R) between within-variance and between-variance is used to monitor the overall sampling convergence. When the sampler converges, values of R would be around "1." We, respectively, monitor the sampling convergence for regulation strengths and TF activities. Once both of them converge, we start to accumulate samples on TF-gene binding variables. As each TF-gene binding variable is binary, its sampling frequency represents the posterior probability of binding occurrence. In the meanwhile, for each gene, a discrete posterior probability distribution of all associated candidate CRMs is inferred, the mode of which reveals the most likely regulatory

We first applied the hierarchical Bayesian model to gene expression data measured from in vitro breast cancer cell lines. We chose to use cell line data mainly because such data is usually clean and good for validating computational models. Here, we carefully selected two public available breast cancer cell line datasets measured independently but under the same condition (downloadable from the GEO database https://www.ncbi.nlm.nih.gov/geo/, with accession number

GSE62789 for Data #1 and accession number GSE51403 for Data #2, both treated by 24 hours of 17b-estradiol (E2) to stimulate breast cancer cells proliferation). The similarity between the two inferred GRNs can be used to evaluate the robustness of

For prior TF-gene collection, we checked the ENCODE database (https://www. encodeproject.org/) and selected genome-wide binding profiles of 39 TFs, measured from the same breast cancer cell line. We collected candidate binding events by

*acn,tbcn,txt,m*

� *<sup>a</sup>*<sup>2</sup> *cn,t* 2*σ*<sup>2</sup> *a,prior* � P *mx*<sup>2</sup> *m,t*

� � � (8)

2*σ*<sup>2</sup> *x,prior*

(7)

1 A

!<sup>2</sup>

*P cn* <sup>¼</sup> *<sup>k</sup>* **<sup>y</sup>***n;* **<sup>X</sup>***;* **<sup>A</sup>** � � � �

*<sup>p</sup> P cn* <sup>¼</sup> *<sup>p</sup>* **<sup>y</sup>***n;* **<sup>X</sup>***;* **<sup>A</sup>** �

examining TF binding signals at promoter and distantly associated enhancers associated with each gene. In total we collected 2,319 candidate TF-gene interactions (**Figure 4A**) between 39 TFs and 275 genes, whose gene expression is consistently upregulated in both datasets when breast cancer cells are stimulated to fast proliferate (**Figure 4B** and **C**). We, respectively, applied the hierarchical Bayesian model to the two gene expression datasets with the same prior settings. To monitor the convergence of the sampling process, we ran five sequences with different initial states and sampled 1000 times in each. As shown in **Figure 4C** and **D** (for Data #1), after 100 rounds of sampling, the model started to converge. The sampling frequency on each TF-gene interaction was calculated as the posterior probabilistic weight. We extracted top 500 most confident TF-gene interactions as the final GRN estimation for each data set and then focused on common interactions between two relevant GRNs.

Here, we specifically compared our approach with three competing methods (COGRIM [20], LASSO [23], and NARROMI [24]). COGRIM was a Bayesian inference approach without modeling on CRMs. It treated individual TF-gene binding events independently. Although such an assumption lowered the model complexity, it made the model less robust against the inaccuracy in the TF-gene binding prior. Moreover, for the TF activity, COGRIM simply treated it as an observed value by directly using TF mRNA expression. Although ideally the variation of mRNA transcription is proportional to the activity change of mRNA-translated protein, currently this correlation is very low in most studies using gene expression. These inaccurate assumptions brought a lot of uncertainty to modeling gene expression data. LASSO used a linear regression model to integrate prior TF-gene interactions and gene expression data and predicted one value for each TF-gene interaction. The NARROMI approach inferred GRNs using gene expression data only without any prior on TF-gene interactions, and also, it made single-value prediction for each interaction based on the mutual information between gene and TF expression values. Theoretically, the Bayesian approach described in this chapter should be

#### **Figure 4.**

*Input breast cancer cell line data for GRN inference: (A) prior TF-gene interactions ("black" denotes binding occurrence); (B) heatmap of time-course gene expression data; (C) heatmap of steady-state gene expression, all data are from the same breast cancer cell line; (D) convergence of regulatory strength estimation using timecourse gene expression data; and (E) convergence of TF activity estimation using time-course gene expression data.*

more robust to identify GRNs. We applied the four competing methods to the above two datasets. Indeed, GRNs identified using our Bayesian model were more consistent between two related datasets (**Table 1**).

By analyzing the common 306 TF-gene interactions in **Table 1**, we identified two functional CRMs. The first CRM had five TFs including POL2A, TDRD3, MYC, MAX, and E2F1 (**Figure 5A**). The activities of these TFs, as inferred from both datasets, were shown in **Figure 5B** and **C**, respectively. In total there were 100 genes regulated by this module, and 60 of them were associated with breast cancer through literature survey (selected genes shown in **Figure 5D**). The second CRM had six TFs including ELF1, JUND, JUN, FOXA1, CTCF, and HDAC1. In total, there were 89 genes regulated by this module, and 51 of them were associated with breast cancer (selected genes shown in **Figure 5E**). COGRIM identified fewer genes for the first CRM and failed to identify the second CRM. For the other non-Bayesian approaches, as the number of common TF-gene interactions inferred from two


datasets was small, size reduced by over 75%. We did not identify the two key

*Breast cancer recurrence-associated GRN: (A) heatmap of gene expression in breast cancer cell lines including MCF7, MIII, LCC1, and LCC9, where "red" represents overexpression and "green" represents lower expression; (B) heatmap of gene expression of breast cancer patients in "Early recurrence" and "Late recurrence" groups, divided by 5-year survival; (C) binding sites of 11 TFs on 32 target genes; and (D) association of 5 CRMs and*

We finally applied the Bayesian approach to breast cancer patient data downloaded from the TCGA database (https://portal.gdc.cancer.gov/). Survival time distribution of 93 breast cancer patients treated by *tamoxifen* revealed two modes with 5-year survival as division. Accordingly, we defined an "Early recurrence" group including patients with survival time <5 years and a "Late recur-

Differentially expressed genes between two groups (t-test p-value <0.05) were selected for further GRN analysis. It can be seen from **Figure 6B** that the gene expression data of breast cancer patient is quite noisy. To increase the robustness of GRN results, we used another cell line dataset. Specifically, gene expression data was generated from four cell lines including MCF7, MIII, LCC1, and LCC9, with three replicates for each. MCF7 cells were sensitive to *tamoxifen* treatment, while LCC9 cells were drug-resistant. One hypothesis is that breast cancer recurrence is associated with drug resistance. Thus, we expected that the overexpressed genes in the "Early recurrence" group were also overexpressed in LCC9 cells. For 431 genes with such expression pattern in both patient and cell line data, we collected prior TF-gene interactions from 39 TF binding profiles used in previous sections. We, respectively, inferred GRNs using both datasets and identified a common GRN including interactions between 25 proteins and 161 genes. Analysis of this common CRN revealed 5 key CRMs with 11 proteins and 32 target genes highly relevant to

Recent technology advance in single-cell gene transcription makes it feasible to study TF-gene regulation during the cell differentiation process [25]. In sections

rence" group including patients with survival time longer than 5 years.

CRMs using either approach.

**Figure 6.**

*32 target genes.*

breast cancer recurrence (**Figure 6**).

**5.1 Gene regulatory networks in different cell states**

**5. Discussion**

**73**

**4.2 Application to breast cancer patient data**

*Bayesian Inference of Gene Regulatory Network DOI: http://dx.doi.org/10.5772/intechopen.88799*

#### **Table 1.**

*Comparison of methods for robust GRN inference.*

#### **Figure 5.**

*Key CRMs inferred from breast cancer cell line data: (A) CRM #1 and their TF components; (B) estimated TF activities from Data #1 (time-course); (C) estimated TF activities from Data #2 (steady state); (D) target genes regulated by CRM with MAX, MYC, E2F1, POL2A and TDRD3; (E) target genes regulated by CRM with ELF1, JUND, JUN, FOXA1, CTCF, and HDAC2. Target genes in D and E are associated with breast cancer as supported by literature survey. "Blue" block represents genes showing up in at least two literatures, while "green" block represents genes with one literature support.*

*Bayesian Inference of Gene Regulatory Network DOI: http://dx.doi.org/10.5772/intechopen.88799*

**Figure 6.**

more robust to identify GRNs. We applied the four competing methods to the above two datasets. Indeed, GRNs identified using our Bayesian model were more consis-

By analyzing the common 306 TF-gene interactions in **Table 1**, we identified two functional CRMs. The first CRM had five TFs including POL2A, TDRD3, MYC, MAX, and E2F1 (**Figure 5A**). The activities of these TFs, as inferred from both datasets, were shown in **Figure 5B** and **C**, respectively. In total there were 100 genes regulated by this module, and 60 of them were associated with breast cancer through literature survey (selected genes shown in **Figure 5D**). The second CRM had six TFs including ELF1, JUND, JUN, FOXA1, CTCF, and HDAC1. In total, there were 89 genes regulated by this module, and 51 of them were associated with breast cancer (selected genes shown in **Figure 5E**). COGRIM identified fewer genes for the first CRM and failed to identify the second CRM. For the other non-Bayesian approaches, as the number of common TF-gene interactions inferred from two

> **GRN edges in Data #2**

**Bayesian** 500 **0.878\*\*\*** 413 **0.822\*\*\* 306\*\*\*** COGRIM 516 0.798 457 0.696 239 LASSO 565 0.486 510 0.533 74 NARROMI 514 0.519 591 0.516 44

*Key CRMs inferred from breast cancer cell line data: (A) CRM #1 and their TF components; (B) estimated TF activities from Data #1 (time-course); (C) estimated TF activities from Data #2 (steady state); (D) target genes regulated by CRM with MAX, MYC, E2F1, POL2A and TDRD3; (E) target genes regulated by CRM with ELF1, JUND, JUN, FOXA1, CTCF, and HDAC2. Target genes in D and E are associated with breast cancer as supported by literature survey. "Blue" block represents genes showing up in at least two literatures,*

**Similarity with other methods**

**Common GRN for Data #1 and #2**

tent between two related datasets (**Table 1**).

*Bayesian Inference on Complicated Data*

**Methods GRN edges**

**in Data #1**

*\*\*\*denotes hypergeometric p-value < 0.001.*

*Comparison of methods for robust GRN inference.*

*while "green" block represents genes with one literature support.*

**Table 1.**

**Figure 5.**

**72**

**Similarity with other methods**

*Breast cancer recurrence-associated GRN: (A) heatmap of gene expression in breast cancer cell lines including MCF7, MIII, LCC1, and LCC9, where "red" represents overexpression and "green" represents lower expression; (B) heatmap of gene expression of breast cancer patients in "Early recurrence" and "Late recurrence" groups, divided by 5-year survival; (C) binding sites of 11 TFs on 32 target genes; and (D) association of 5 CRMs and 32 target genes.*

datasets was small, size reduced by over 75%. We did not identify the two key CRMs using either approach.

#### **4.2 Application to breast cancer patient data**

We finally applied the Bayesian approach to breast cancer patient data downloaded from the TCGA database (https://portal.gdc.cancer.gov/). Survival time distribution of 93 breast cancer patients treated by *tamoxifen* revealed two modes with 5-year survival as division. Accordingly, we defined an "Early recurrence" group including patients with survival time <5 years and a "Late recurrence" group including patients with survival time longer than 5 years. Differentially expressed genes between two groups (t-test p-value <0.05) were selected for further GRN analysis. It can be seen from **Figure 6B** that the gene expression data of breast cancer patient is quite noisy. To increase the robustness of GRN results, we used another cell line dataset. Specifically, gene expression data was generated from four cell lines including MCF7, MIII, LCC1, and LCC9, with three replicates for each. MCF7 cells were sensitive to *tamoxifen* treatment, while LCC9 cells were drug-resistant. One hypothesis is that breast cancer recurrence is associated with drug resistance. Thus, we expected that the overexpressed genes in the "Early recurrence" group were also overexpressed in LCC9 cells. For 431 genes with such expression pattern in both patient and cell line data, we collected prior TF-gene interactions from 39 TF binding profiles used in previous sections. We, respectively, inferred GRNs using both datasets and identified a common GRN including interactions between 25 proteins and 161 genes. Analysis of this common CRN revealed 5 key CRMs with 11 proteins and 32 target genes highly relevant to breast cancer recurrence (**Figure 6**).

#### **5. Discussion**

#### **5.1 Gene regulatory networks in different cell states**

Recent technology advance in single-cell gene transcription makes it feasible to study TF-gene regulation during the cell differentiation process [25]. In sections

above, across multiple samples, TF-gene interactions are assumed to hold, and the gene expression change is connected to the dynamic variation of TF activities across samples. Yet, at the single-cell level, gene expression measurements are very noisy, whose variation across cells may be partially disconnected from the dynamic changes of TF activities [26]. In that situation, the linear model in Eq. (1) will not work with such gene expression input. Moreover, during the cell differentiation process, in fact we do not have prior knowledge on whether GRNs will hold or change between individual cell states. That means TF-gene interaction change can be another causal factor on gene expression variation across different cell states, too. To model GRNs individually for cell states, we need to define more binding variables, which will definitely make the estimation process more complex.

**6. Conclusion**

**Acknowledgements**

(VT OASF).

**Author details**

Arlington, VA, USA

**75**

Xi Chen1,2\* and Jianhua Xuan1

\*Address all correspondence to: xichen86@vt.edu

provided the original work is properly cited.

In this chapter, we mathematically illustrated how Bayesian inference can be

Funding for open access charge: Virginia Tech's Open Access Subvention Found

1 Bradley Department of Electrical and Computer Engineering, Virginia Tech,

2 Center for Computational Biology, Flatiron Institute, New York, NY, USA

© 2019 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium,

used to infer gene regulatory networks. Using several breast cancer-specific datasets, we demonstrated the effectiveness of Bayesian network modeling in biological meaningful signal discovery, in comparison with methods of linear regression. Potentially, Bayesian inference can be used to infer dynamic GRN during cell differentiation using new types of gene expression data. For very large-scale GRN inference in complex systems, the big number of variables may degrade conventional Bayesian inference performance. Bayesian neural networks

using variational inference can be a good solution.

*Bayesian Inference of Gene Regulatory Network DOI: http://dx.doi.org/10.5772/intechopen.88799*

Those cell state-specific GRNs will uncover the regulatory mechanism that drives cell differentiation. This would be particularly useful for cancer treatment. If any regulation changes at a very early cell state eventually lead to cancer cell fast proliferation, we can engineeringly target those TFs, binding regions, or genes for cancer prevention. Currently inference of cell-state-specific GRN is either through enrichment analysis of TF binding signals in each cell state [27] or regression modeling of gene expression using the matched measurements of regulatory region activities [28]. When the single-cell expression measurements become more accurate, we hope the connection between gene expression and TF activities still holds. Then, the model in Eq. (1) with proper improvement can be used to infer cell-state-specific GRNs.

#### **5.2 Bayesian neural network**

Although theoretically there is no upper limit on the number of model parameters in the Bayesian framework (**Figure 2**), the more variables we have, the slower the convergence will be. Moreover, given a complex network with many states, the dependence of different variables will be hard to model, and the estimation process is more easily to stuck into a local state. In recent years, neural network is widely applied to variable estimation in complex systems. Neural network is an end-to-end system that mimics the human brain and tries to learn complex representation within the dataset to provide an output. Similar to conventional machine learning, deep neural networks make a single-value prediction for each model parameter, without measuring uncertainty. That means the model performance relies heavily on the prediction accuracy, and even one overconfident decision can result in a big problem. A Bayesian approach to neural networks can naturally solve this problem by learning a distribution accounting for the uncertainty in parameter estimates [29].

Unlike Bayesian inference discussed in previous sections, inferring model posterior in a Bayesian neural network is much more difficult as there are many parameters to estimate in neural networks. Direct inference of variable posterior distribution is hard so that approximations to the posterior are often used, i.e., the variational inference. The posterior can be modelled using a simple variational distribution such as a Gaussian distribution, and the distribution's parameters are fitted to approximate the true posterior as close as possible by minimizing the Kullback-Leibler divergence between this simple variational distribution and the true posterior. In earlier sections, we have demonstrated that modeling variables in GRN using Gaussian distribution provided robust performance. To infer large-scale GRN with thousands of genes and hundreds of TFs, Bayesian neural network can be a solution in which posterior distributions of all variables can be approximated by Gaussian distribution.

### **6. Conclusion**

above, across multiple samples, TF-gene interactions are assumed to hold, and the gene expression change is connected to the dynamic variation of TF activities across samples. Yet, at the single-cell level, gene expression measurements are very noisy, whose variation across cells may be partially disconnected from the dynamic changes of TF activities [26]. In that situation, the linear model in Eq. (1)

differentiation process, in fact we do not have prior knowledge on whether GRNs will hold or change between individual cell states. That means TF-gene interaction change can be another causal factor on gene expression variation across different cell states, too. To model GRNs individually for cell states, we need to define more

will not work with such gene expression input. Moreover, during the cell

binding variables, which will definitely make the estimation process more

Those cell state-specific GRNs will uncover the regulatory mechanism that drives cell differentiation. This would be particularly useful for cancer treatment. If any regulation changes at a very early cell state eventually lead to cancer cell fast proliferation, we can engineeringly target those TFs, binding regions, or genes for cancer prevention. Currently inference of cell-state-specific GRN is either through enrichment analysis of TF binding signals in each cell state [27] or regression modeling of gene expression using the matched measurements of regulatory region activities [28]. When the single-cell expression measurements become more accurate, we hope the connection between gene expression and TF activities still holds. Then, the model in Eq. (1) with proper improvement can be used to infer

Although theoretically there is no upper limit on the number of model parameters in the Bayesian framework (**Figure 2**), the more variables we have, the slower the convergence will be. Moreover, given a complex network with many states, the dependence of different variables will be hard to model, and the estimation process is more easily to stuck into a local state. In recent years, neural network is widely applied to variable estimation in complex systems. Neural network is an end-to-end system that mimics the human brain and tries to learn complex representation within the dataset to provide an output. Similar to conventional machine learning, deep neural networks make a single-value prediction for each model parameter, without measuring uncertainty. That means the model performance relies heavily on the prediction accuracy, and even one overconfident decision can result in a big problem. A Bayesian approach to neural networks can naturally solve this problem by learning a distribution accounting for the uncertainty in parameter

Unlike Bayesian inference discussed in previous sections, inferring model pos-

terior in a Bayesian neural network is much more difficult as there are many parameters to estimate in neural networks. Direct inference of variable posterior distribution is hard so that approximations to the posterior are often used, i.e., the variational inference. The posterior can be modelled using a simple variational distribution such as a Gaussian distribution, and the distribution's parameters are fitted to approximate the true posterior as close as possible by minimizing the Kullback-Leibler divergence between this simple variational distribution and the true posterior. In earlier sections, we have demonstrated that modeling variables in GRN using Gaussian distribution provided robust performance. To infer large-scale GRN with thousands of genes and hundreds of TFs, Bayesian neural network can be a solution in which posterior distributions of all variables can be approximated by

complex.

cell-state-specific GRNs.

estimates [29].

Gaussian distribution.

**74**

**5.2 Bayesian neural network**

*Bayesian Inference on Complicated Data*

In this chapter, we mathematically illustrated how Bayesian inference can be used to infer gene regulatory networks. Using several breast cancer-specific datasets, we demonstrated the effectiveness of Bayesian network modeling in biological meaningful signal discovery, in comparison with methods of linear regression. Potentially, Bayesian inference can be used to infer dynamic GRN during cell differentiation using new types of gene expression data. For very large-scale GRN inference in complex systems, the big number of variables may degrade conventional Bayesian inference performance. Bayesian neural networks using variational inference can be a good solution.

### **Acknowledgements**

Funding for open access charge: Virginia Tech's Open Access Subvention Found (VT OASF).

### **Author details**

Xi Chen1,2\* and Jianhua Xuan1

1 Bradley Department of Electrical and Computer Engineering, Virginia Tech, Arlington, VA, USA

2 Center for Computational Biology, Flatiron Institute, New York, NY, USA

\*Address all correspondence to: xichen86@vt.edu

© 2019 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### **References**

[1] Schuster SC. Next-generation sequencing transforms today's biology. Nature Methods. 2008;**5**(1):16-18

[2] Chen X et al. CRNET: An efficient sampling approach to infer functional regulatory networks by integrating large-scale ChIP-seq and time-course RNA-seq data. Bioinformatics. 2018; **34**(10):1733-1740

[3] Barabasi AL, Oltvai ZN. Network biology: Understanding the cell's functional organization. Nature Reviews. Genetics. 2004;**5**(2):101-113

[4] Blais A, Dynlacht BD. Constructing transcriptional regulatory networks. Genes & Development. 2005;**19**(13): 1499-1511

[5] van 't Veer LJ et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;**415**(6871): 530-536

[6] Li W, Notani D, Rosenfeld MG. Enhancers as non-coding RNA transcription units: Recent insights and future perspectives. Nature Reviews. Genetics. 2016;**17**(4):207-223

[7] Bock C, Lengauer T. Computational epigenetics. Bioinformatics. 2008;**24**(1): 1-10

[8] Landt SG et al. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Research. 2012;**22**(9):1813-1831

[9] Quackenbush J. Microarray data normalization and transformation. Nature Genetics. 2002;**32**(Suppl): 496-501

[10] Wang Z, Gerstein M, Snyder M. RNA-Seq: A revolutionary tool for transcriptomics. Nature Reviews. Genetics. 2009;**10**(1):57-63

[11] Riethoven JJ. Regulatory regions in DNA: Promoters, enhancers, silencers, and insulators. Methods in Molecular Biology. 2010;**674**:33-42

[18] Chen X et al. ChIP-BIT: Bayesian inference of target genes using a novel joint probabilistic model of ChIP-seq profiles. Nucleic Acids Research. 2016;

*Bayesian Inference of Gene Regulatory Network DOI: http://dx.doi.org/10.5772/intechopen.88799*

clustering. Nature Methods. 2017;

[28] Cao J et al. Joint profiling of chromatin accessibility and gene expression in thousands of single cells. Science. 2018;**361**(6409):1380-1385

[29] Crucianu M, Bone R, de

recurrent neural networks.

Beauville JPA. Bayesian learning for

Neurocomputing. 2001;**36**:235-242

**14**(11):10831-11086

[19] Sabatti C, James GM. Bayesian sparse hidden components analysis for transcription regulation networks. Bioinformatics. 2006;**22**(6):739-746

[20] Chen G, Jensen ST, Stoeckert CJ Jr. Clustering of genes into regulons using integrated modeling-COGRIM. Genome

[21] Shi X et al. mAPC-GibbsOS: An integrated approach for robust identification of gene regulatory

[22] Gelman A, Rubin DB. Inference from iterative simulation using multiple sequences. Statistical Science. 1992;**7**(4):

networks. BMC Systems Biology. 2013;**7**

[23] Qin J, Hu Y, Xu F, Yalamanchili HK, Wang J. Inferring gene regulatory networks by integrating ChIP-seq/chip and transcriptome data via LASSO-type regularization methods. Methods. 2014;

[24] Zhang X et al. NARROMI: A noise and redundancy reduction technique improves accuracy of gene regulatory network inference. Bioinformatics.

[25] Wagner A, Regev A, Yosef N. Revealing the vectors of cellular identity with single-cell genomics. Nature Biotechnology. 2016;**34**(11):1145-1160

[26] Raj A, van Oudenaarden A. Nature, nurture, or chance: Stochastic gene expression and its consequences. Cell.

[27] Aibar S et al. SCENIC: Single-cell regulatory network inference and

Biology. 2007;**8**(1):R4

(Suppl 5):S4

457-472

**67**(3):294-303

2013;**29**(1):106-113

2008;**135**(2):216-226

**77**

**44**(7):e65

[12] Chen X, Xuan J, Shi X, Shajahan-Haq AN, Hilakivi-Clarke L, Clarke R. A novel statistical approach to identify coregulatory gene modules. In: 2013 IEEE International Conference on Bioinformatics and Biomedicine; 2013. pp. 16-18

[13] Spitz F, Furlong EE. Transcription factors: From enhancer binding to developmental control. Nature Reviews. Genetics. 2012;**13**(9):613-626

[14] Wang J et al. Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Research. 2012;**22**(9):1798-1812

[15] Chen X, Shi X, Shajahan-Haq AN, Hilakivi-Clarke L, Clarke R, Xuan J. Statistical identification of co-regulatory gene modules using multiple ChIP-seq experiments. In: Presented at the International Conference on Bioinformatics Models, Methods and Algorithms (Bioinformatics); 2014

[16] Liao JC, Boscolo R, Yang YL, Tran LM, Sabatti C, Roychowdhury VP. Network component analysis: Reconstruction of regulatory signals in biological systems. Proceedings of the National Academy of Sciences of the United States of America. 2003;**100**(26): 15522-15527

[17] Chen X, Xuan J, Wang C, Shajahan AN, Riggins RB, Clarke R. Reconstruction of transcriptional regulatory networks by stability-based network component analysis. IEEE/ ACM Transactions on Computational Biology and Bioinformatics. 2013;**10**(6): 1347-1358

*Bayesian Inference of Gene Regulatory Network DOI: http://dx.doi.org/10.5772/intechopen.88799*

[18] Chen X et al. ChIP-BIT: Bayesian inference of target genes using a novel joint probabilistic model of ChIP-seq profiles. Nucleic Acids Research. 2016; **44**(7):e65

**References**

**34**(10):1733-1740

1499-1511

530-536

1-10

496-501

**76**

[1] Schuster SC. Next-generation sequencing transforms today's biology. Nature Methods. 2008;**5**(1):16-18

*Bayesian Inference on Complicated Data*

[2] Chen X et al. CRNET: An efficient sampling approach to infer functional regulatory networks by integrating large-scale ChIP-seq and time-course RNA-seq data. Bioinformatics. 2018;

[11] Riethoven JJ. Regulatory regions in DNA: Promoters, enhancers, silencers, and insulators. Methods in Molecular

[12] Chen X, Xuan J, Shi X, Shajahan-Haq AN, Hilakivi-Clarke L, Clarke R. A novel statistical approach to identify coregulatory gene modules. In: 2013 IEEE

Bioinformatics and Biomedicine; 2013.

[13] Spitz F, Furlong EE. Transcription factors: From enhancer binding to developmental control. Nature Reviews.

[14] Wang J et al. Sequence features and

[15] Chen X, Shi X, Shajahan-Haq AN, Hilakivi-Clarke L, Clarke R, Xuan J. Statistical identification of co-regulatory gene modules using multiple ChIP-seq experiments. In: Presented at the International Conference on

Bioinformatics Models, Methods and Algorithms (Bioinformatics); 2014

Reconstruction of regulatory signals in biological systems. Proceedings of the National Academy of Sciences of the United States of America. 2003;**100**(26):

[16] Liao JC, Boscolo R, Yang YL, Tran LM, Sabatti C, Roychowdhury VP.

Network component analysis:

[17] Chen X, Xuan J, Wang C, Shajahan AN, Riggins RB, Clarke R. Reconstruction of transcriptional regulatory networks by stability-based network component analysis. IEEE/ ACM Transactions on Computational Biology and Bioinformatics. 2013;**10**(6):

15522-15527

1347-1358

Biology. 2010;**674**:33-42

International Conference on

Genetics. 2012;**13**(9):613-626

chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Research.

2012;**22**(9):1798-1812

pp. 16-18

[3] Barabasi AL, Oltvai ZN. Network biology: Understanding the cell's functional organization. Nature Reviews. Genetics. 2004;**5**(2):101-113

[4] Blais A, Dynlacht BD. Constructing transcriptional regulatory networks. Genes & Development. 2005;**19**(13):

[5] van 't Veer LJ et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;**415**(6871):

[6] Li W, Notani D, Rosenfeld MG. Enhancers as non-coding RNA

Genetics. 2016;**17**(4):207-223

transcription units: Recent insights and future perspectives. Nature Reviews.

[7] Bock C, Lengauer T. Computational epigenetics. Bioinformatics. 2008;**24**(1):

[8] Landt SG et al. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Research. 2012;**22**(9):1813-1831

[9] Quackenbush J. Microarray data normalization and transformation. Nature Genetics. 2002;**32**(Suppl):

[10] Wang Z, Gerstein M, Snyder M. RNA-Seq: A revolutionary tool for transcriptomics. Nature Reviews. Genetics. 2009;**10**(1):57-63

[19] Sabatti C, James GM. Bayesian sparse hidden components analysis for transcription regulation networks. Bioinformatics. 2006;**22**(6):739-746

[20] Chen G, Jensen ST, Stoeckert CJ Jr. Clustering of genes into regulons using integrated modeling-COGRIM. Genome Biology. 2007;**8**(1):R4

[21] Shi X et al. mAPC-GibbsOS: An integrated approach for robust identification of gene regulatory networks. BMC Systems Biology. 2013;**7** (Suppl 5):S4

[22] Gelman A, Rubin DB. Inference from iterative simulation using multiple sequences. Statistical Science. 1992;**7**(4): 457-472

[23] Qin J, Hu Y, Xu F, Yalamanchili HK, Wang J. Inferring gene regulatory networks by integrating ChIP-seq/chip and transcriptome data via LASSO-type regularization methods. Methods. 2014; **67**(3):294-303

[24] Zhang X et al. NARROMI: A noise and redundancy reduction technique improves accuracy of gene regulatory network inference. Bioinformatics. 2013;**29**(1):106-113

[25] Wagner A, Regev A, Yosef N. Revealing the vectors of cellular identity with single-cell genomics. Nature Biotechnology. 2016;**34**(11):1145-1160

[26] Raj A, van Oudenaarden A. Nature, nurture, or chance: Stochastic gene expression and its consequences. Cell. 2008;**135**(2):216-226

[27] Aibar S et al. SCENIC: Single-cell regulatory network inference and

clustering. Nature Methods. 2017; **14**(11):10831-11086

[28] Cao J et al. Joint profiling of chromatin accessibility and gene expression in thousands of single cells. Science. 2018;**361**(6409):1380-1385

[29] Crucianu M, Bone R, de Beauville JPA. Bayesian learning for recurrent neural networks. Neurocomputing. 2001;**36**:235-242

**Chapter 6**

*Shahid Naseem*

from healthcare data.

**1. Introduction**

adaptive

**79**

**Abstract**

Patient Bayesian Inference:

Adaptive Boost Algorithm

Cloud-Based Healthcare Data

Analysis Using Constraint-Based

Cloud-based healthcare data are a form of distributed data over the internet. The internet has become the most vulnerable part of critical healthcare infrastructures. Healthcare data are considered to be sensitive information, which can reveal a lot about a patient. For healthcare data, apart from confidentiality, privacy and protection of data are very sensitive issues. Proactive measures such as early warning are required to reduce the risk of patient's data violation. This chapter investigates the ability of Patient Bayesian Inference (PBI) for network scenario analysis with violation of patient data to produce early warning. The Bayesian inference allows modeling the uncertainties that come with the problem of dealing with missing data, allows integrating data from remote nodes, and explicitly indicates dependence and independence. The use of constraint-based adaptive boost algorithm can demonstrate the patient's Bayesian inference performance in the real-world datasets

**Keywords:** Bayesian inference, healthcare, constraint-based learning, explicitly,

Healthcare data have always been considered to be sensitive information, which

The processing of healthcare data is likely to lead violation of individual rights and interests. Patients' data, which are, by their nature, particularly sensitive in

can reveal a lot about a patient. This is why medical confidentiality prohibits a medical professional to disclose information about a patient's case. If a physician does not have accurate information on a patient's health, it may lead to an inaccurate diagnosis and improper treatment. Data concerning health mean personal data related to the physical or mental health of patients, including the provision of healthcare, which are real information about patient's health. Sensitive data concerning health require additional protection as it can go to the core of a human being. Healthcare data come within a person's most intimate sphere. Unauthorized disclosure may lead to various forms of discrimination and violation of fundamental rights. The risk of data processing generally does not depend on the contents of the

data but on the context in which they are used [1].

#### **Chapter 6**

## Patient Bayesian Inference: Cloud-Based Healthcare Data Analysis Using Constraint-Based Adaptive Boost Algorithm

*Shahid Naseem*

### **Abstract**

Cloud-based healthcare data are a form of distributed data over the internet. The internet has become the most vulnerable part of critical healthcare infrastructures. Healthcare data are considered to be sensitive information, which can reveal a lot about a patient. For healthcare data, apart from confidentiality, privacy and protection of data are very sensitive issues. Proactive measures such as early warning are required to reduce the risk of patient's data violation. This chapter investigates the ability of Patient Bayesian Inference (PBI) for network scenario analysis with violation of patient data to produce early warning. The Bayesian inference allows modeling the uncertainties that come with the problem of dealing with missing data, allows integrating data from remote nodes, and explicitly indicates dependence and independence. The use of constraint-based adaptive boost algorithm can demonstrate the patient's Bayesian inference performance in the real-world datasets from healthcare data.

**Keywords:** Bayesian inference, healthcare, constraint-based learning, explicitly, adaptive

#### **1. Introduction**

Healthcare data have always been considered to be sensitive information, which can reveal a lot about a patient. This is why medical confidentiality prohibits a medical professional to disclose information about a patient's case. If a physician does not have accurate information on a patient's health, it may lead to an inaccurate diagnosis and improper treatment. Data concerning health mean personal data related to the physical or mental health of patients, including the provision of healthcare, which are real information about patient's health. Sensitive data concerning health require additional protection as it can go to the core of a human being. Healthcare data come within a person's most intimate sphere. Unauthorized disclosure may lead to various forms of discrimination and violation of fundamental rights. The risk of data processing generally does not depend on the contents of the data but on the context in which they are used [1].

The processing of healthcare data is likely to lead violation of individual rights and interests. Patients' data, which are, by their nature, particularly sensitive in

relation to fundamental rights and freedoms. Data processing could create significant risk to the patient's rights and freedoms. In principle, processing of sensitive data is prohibited, unless a suitable safeguard method is used to protect the data [2]. Derogating from the prohibition to process special categories of a patient data including health data is allowed with the following cases [3]:

medical care altogether or will withhold information during a consultation. If a physician does not have accurate information about a patient's health, this may lead to an inaccurate diagnosis and improper treatment, which may lead to great harm to

*Patient Bayesian Inference: Cloud-Based Healthcare Data Analysis Using Constraint-Based…*

**Figure 1** shows a typical information flows in the healthcare network. Patient information serves as a range of purposes apart from diagnosis and treatment provision. Patient information could be used to improve efficiency within the healthcare system. Patient information could be shared with finance facilitators to justify payment of service rendered. Health service providers may share health information through improved service quality. Furthermore, these providers may share health information through Regional Services to facilitate care services in the

Credentialing is a vital process for all healthcare systems that must be performed to ensure that those healthcare workers who will be providing the clinical services are qualified to do so. The cloud-based healthcare system is capable to ensure patient safety and deliver an acceptable standard of care. While employing excellent medical staff is vital for success, the healthcare system must have to define the required minimum credentialing and privileging requirements to validate the competency of healthcare providers. In the classical systems, only hospitals used to perform credentialing, but our proposed system has capability to provide all

In this framework, we classify different modules based on the probability (i.e., trust level) of each provider in violating the patient's data in detail. Honestly, I cannot understand exactly what this statement means. Remote nodes (healthcare physicians, nurses, family members, and other authorized individuals) are different from main modules (patients, health service providers, finance facilitators, regional services, and evaluative decisions), and so it is necessary to make clear remote nodes and modules because the patient Bayesian model only evaluates the trusty of

healthcare facilities and also to perform credentialing [8].

the patient's health [6].

*DOI: http://dx.doi.org/10.5772/intechopen.91171*

regional areas [7].

**Figure 1.**

**81**

*Cloud-based healthcare system.*


#### **2. Risks in cloud-based healthcare data**

Cloud computing has many risks related to data confidentiality and data security. The data stored in the cloud are highly confidential, such as patient records. Most of time, data being stored or processed in cloud are in large numbers, and the cloud servers sometimes become lazy because of the computation that affects correctness of final result. Therefore, the computation has to be made transparent. Healthcare data mainly contain of large media files such as X-ray, CT scans, radiology, and other type of images and videos. Such files are called as the Electronic Health Records that are stored in distributed storage. Possibly, this patient perception is fueled by the fact that healthcare data may be disclosure to unauthorized person [4].

In order to secure the patient's sensitive data from unauthorized access, an appropriate encryption standard must be applied to data stored in cloud. This sensitive information is most confidential and needs to be protected. To put everything in the cloud in an unencrypted is a big risk. Over the past four decades, a lot of efforts have been put into developing healthcare information security systems. There is a great variety of commercially available programs to assist clinicians with diagnosis, decision making, pattern recognition, medical reasoning, filtering, and so on for general and very specialized domain applications. If a healthcare system is not secured, an adversary could read, modify, and inject messages into the network. Such incorrect information, even when not for nefarious reasons, can lead to serious consequences for patients and for safe services such as remote healthcare monitoring due to using heterogeneous devices that use a variety of communication rules. Most of the rules that are designed for cloud-based communication cannot be directly applied in the cloud-based healthcare network. In cloud-based healthcare system, remote nodes have limited computation, processing, and communication rights [5].

The existing techniques for healthcare data include pseudo copulation (replacing the most identifying fields in a data) and encryption (encoding the data in such a way that only authorized remote institutions can access it). The existing safeguards are referred to as medical confidentiality or doctor-patient privilege, which prohibit a medical professional to disclose information about a patient's case. This is an important obligation within the medical professional in order to create trust between a doctor and his patient and a trusting environment in which the patient feels comfortable. If a patient cannot trust a physician's discretion, he will not seek

*Patient Bayesian Inference: Cloud-Based Healthcare Data Analysis Using Constraint-Based… DOI: http://dx.doi.org/10.5772/intechopen.91171*

medical care altogether or will withhold information during a consultation. If a physician does not have accurate information about a patient's health, this may lead to an inaccurate diagnosis and improper treatment, which may lead to great harm to the patient's health [6].

**Figure 1** shows a typical information flows in the healthcare network. Patient information serves as a range of purposes apart from diagnosis and treatment provision. Patient information could be used to improve efficiency within the healthcare system. Patient information could be shared with finance facilitators to justify payment of service rendered. Health service providers may share health information through improved service quality. Furthermore, these providers may share health information through Regional Services to facilitate care services in the regional areas [7].

Credentialing is a vital process for all healthcare systems that must be performed to ensure that those healthcare workers who will be providing the clinical services are qualified to do so. The cloud-based healthcare system is capable to ensure patient safety and deliver an acceptable standard of care. While employing excellent medical staff is vital for success, the healthcare system must have to define the required minimum credentialing and privileging requirements to validate the competency of healthcare providers. In the classical systems, only hospitals used to perform credentialing, but our proposed system has capability to provide all healthcare facilities and also to perform credentialing [8].

In this framework, we classify different modules based on the probability (i.e., trust level) of each provider in violating the patient's data in detail. Honestly, I cannot understand exactly what this statement means. Remote nodes (healthcare physicians, nurses, family members, and other authorized individuals) are different from main modules (patients, health service providers, finance facilitators, regional services, and evaluative decisions), and so it is necessary to make clear remote nodes and modules because the patient Bayesian model only evaluates the trusty of

**Figure 1.** *Cloud-based healthcare system.*

relation to fundamental rights and freedoms. Data processing could create significant risk to the patient's rights and freedoms. In principle, processing of sensitive data is prohibited, unless a suitable safeguard method is used to protect the data [2]. Derogating from the prohibition to process special categories of a patient data

• Processing is necessary to protect the vital interest of a patient if this patient is physically or legally incapable to give consent, for example, in emergency

• Processing is necessary in order to provide healthcare if the data are processed by or under the responsibility of a professional subject to the obligation of

Cloud computing has many risks related to data confidentiality and data security. The data stored in the cloud are highly confidential, such as patient records. Most of time, data being stored or processed in cloud are in large numbers, and the cloud servers sometimes become lazy because of the computation that affects correctness of final result. Therefore, the computation has to be made transparent. Healthcare data mainly contain of large media files such as X-ray, CT scans, radiology, and other type of images and videos. Such files are called as the Electronic Health Records that are stored in distributed storage. Possibly, this patient perception is fueled by the fact that healthcare data may be disclosure to unauthorized

In order to secure the patient's sensitive data from unauthorized access, an appropriate encryption standard must be applied to data stored in cloud. This sensitive information is most confidential and needs to be protected. To put everything in the cloud in an unencrypted is a big risk. Over the past four decades, a lot of efforts have been put into developing healthcare information security systems. There is a great variety of commercially available programs to assist clinicians with diagnosis, decision making, pattern recognition, medical reasoning, filtering, and so on for general and very specialized domain applications. If a healthcare system is not secured, an adversary could read, modify, and inject messages into the network. Such incorrect information, even when not for nefarious reasons, can lead to serious consequences for patients and for safe services such as remote healthcare monitoring due to using heterogeneous devices that use a variety of communication rules. Most of the rules that are designed for cloud-based communication cannot be directly applied in the cloud-based healthcare network. In cloud-based healthcare system, remote nodes have limited computation, processing, and communication

The existing techniques for healthcare data include pseudo copulation (replacing the most identifying fields in a data) and encryption (encoding the data in such a way that only authorized remote institutions can access it). The existing safeguards are referred to as medical confidentiality or doctor-patient privilege, which prohibit a medical professional to disclose information about a patient's case. This is an important obligation within the medical professional in order to create trust between a doctor and his patient and a trusting environment in which the patient feels comfortable. If a patient cannot trust a physician's discretion, he will not seek

including health data is allowed with the following cases [3]:

• Explicit consent is given by the data subject.

situations or with minors.

*Bayesian Inference on Complicated Data*

**2. Risks in cloud-based healthcare data**

professional secrecy.

person [4].

rights [5].

**80**

remote nodes and whole network based on service level expectation (SLE) as evidence, following the statement "we take advantage of the nature of the Bayesian inference to calculate the probability of wireless communication between the healthcare system and its remote institutions" [9].

As a result, data mining techniques include swapping attribute values and principal component analysis-based techniques, adding random components have gained

*Patient Bayesian Inference: Cloud-Based Healthcare Data Analysis Using Constraint-Based…*

In healthcare system, PBI is an extremely powerful set of tools that use some knowledge or beliefs to calculate the probability of biomedical and healthcare events, statics, and Service Level Expectation (SLE). PBI can be used for mapping our understanding of a problem and evaluating observed data into a quantitative measure of how certain we are of a particular fact in probabilistic terms, where the probability of a proposition simply represents a degree of belief in the trust of that proposition. PBI can also be used as data mining technique for analyzing network healthcare system variables, virtual assistants, and other variable analytics [12]. PBI uses data and evidence that certain facts are more likely than others. Prior distribution reflects our belief before seeing any data, whereas posterior distributions

Cloud-based PBI consists of five main modules: (i) patients; (ii) health service providers; (iii) finance facilitators; (iv) regional services; and (v) traditional and evaluative decisions and four submodules: (a) employers; (b) pharmacists; (c) regional health organizations; and (d) business associates. In this framework, we classify different modules based on the probability (i.e., trust level) of each provider in violating the patient's data. Bayesian rules allow calculating the posterior probability of any information violation events as hypothesis (*H*) based on a set of historical data (*D*).

*P H*ð Þ¼ <sup>j</sup>*<sup>D</sup> P D*ð Þ <sup>j</sup>*<sup>H</sup> P H*ð Þ

*<sup>P</sup>* SLE, healthcare system, remote nodes <sup>¼</sup> *<sup>P</sup>* SLE <sup>j</sup> healthcare system, remote nodes

SLE is abbreviation of service level expectations. In cloud-based healthcare system, SLE is responsible to provide the quality of services to the remote nodes. It can also be variable that has enough relevance for the service and can be quantitatively and objectively measured. It strengthens the processes to improve the outcomes. In Bayesian Inference, our initial beliefs are represented by the prior

In **Figure 2**, remote nodes and healthcare network are hidden variables, and the

only observable variable is the SLE metric. An SLE node forecasts how long it should take a share healthcare information to the remote nodes. The SLE itself has two parts: a period of elapsed time and a probability associated with that period (e.g., 38% of healthcare information is shared in 5 min or less, which can also be stated as "5 min with 38% confidence/probability"). However, the healthcare network is a complete system for the variables and their dependencies. Healthcare system can also calculate the services provides to the services provided to the remote nodes like "what is the probability that network successfully passes and the given SLE has failed, *P*(healthcare system = true | SLE = false), which shows that the sharing of the healthcare information with the remote nodes is not completed within the threshold level. In general, the ultimate purpose of the proposed patient

<sup>∗</sup> *<sup>P</sup>* remote nodes <sup>j</sup> healthcare system <sup>∗</sup> *<sup>P</sup>* healthcare system

distribution *P*(healthcare system) as shown in **Figure 2**.

where *P*(*H*|*D*) is the posterior probability of *H* given knowledge data *D*; *P*(*H*) is the prior probability for *H*; *P*(*D*|*H)* is the likelihood probability of *H* given *D*; and *P* (*D*) is the marginal probability that would have happened whether or not *H* is true. In cloud-based healthcare system, we use Bayes' rule to find the probability func-

*P D*ð Þ (1)

(2)

much more attention in the healthcare data analysis [11].

*DOI: http://dx.doi.org/10.5772/intechopen.91171*

reflect our belief after we have considered all the evidence.

tion as in Eq. (2):

**83**

#### **3. Problem statement**

Healthcare system has become the inspiration for patients' data in terms of wirelessly communication for decision making and logical functionality of the remote institutions such as health physicians, nurses, family members, and authorized individuals. Conventional healthcare systems are using various encryption methods to secure patients' data. Observing the limitations of the existing encryption methods, we take advantage of the nature of the Bayesian inference to calculate the probability of wireless communication between the healthcare system and its remote institutions. The dynamics of the cloud environment requires the healthcare system being able to self-adapt, being aware of its surrounding environment's changing parameters, and being able to create new rules based on past experience. To eliminate the problem of repetition in the cloud environment, the security algorithm must maintain the remote institution limitations and at the same time must provide high level of data protection. Constraint-based adaptive boost algorithm has progressed to an advanced level data analysis for cloud-based healthcare system. The implementation of patient Bayesian Inference for cloud-based healthcare system will be suitable to demonstrate its performance in the real-world patients' datasets. Protection of patient's sensitive data is one of the main obstacles to the growth of cloud computing in the health field because of the need for high level of data integration, interoperability, and sharing among healthcare institutions. It is necessary to create standard guidelines and identify security challenges for improving information security in healthcare system. There are multiple remote institutions (nodes) that have to deal with healthcare data such as healthcare physicians, nurses, family members, and other authorized individuals. Similarly, within healthcare system, there are multiple entities that have to deal with healthcare data such as healthcare providers, hospital administration staff, finance providers, and patients themselves. Cloud services suffer from certain vulnerabilities [10]. By contrast, Bayesian model as an uncertain reasoning tool is more efficient for dynamic trust evaluation. Bayesian inference combined with cloud model and Bayesian network is proposed in this research.

#### **4. Patient Bayesian inference**

In cloud-based healthcare systems, patients' electronic data have been widely adapted to improve the quality of patient care and increase the productivity and efficiency of healthcare delivery. In cloud-based systems, patients' data can be helpful to resolve many of the existing problems associated with disease diagnosis and also maintaining the privacy and sensitivity of the patient's medical information. PBI can be beneficial in the healthcare system for tracking fatigue by using multiarmed bandits, which facilitate the healthcare doctors in treatment by dynamically taking more samples from those treatments, which are most likely to be the best. PBI may facilitate the doctors in better understanding the patient's data and make decisions based on it. Because of security in cloud computing, outcomes can be measured in real time, rather than waiting for enough data. Recently, health data privacy has become an important issue in the cloud-based healthcare systems.

*Patient Bayesian Inference: Cloud-Based Healthcare Data Analysis Using Constraint-Based… DOI: http://dx.doi.org/10.5772/intechopen.91171*

As a result, data mining techniques include swapping attribute values and principal component analysis-based techniques, adding random components have gained much more attention in the healthcare data analysis [11].

In healthcare system, PBI is an extremely powerful set of tools that use some knowledge or beliefs to calculate the probability of biomedical and healthcare events, statics, and Service Level Expectation (SLE). PBI can be used for mapping our understanding of a problem and evaluating observed data into a quantitative measure of how certain we are of a particular fact in probabilistic terms, where the probability of a proposition simply represents a degree of belief in the trust of that proposition. PBI can also be used as data mining technique for analyzing network healthcare system variables, virtual assistants, and other variable analytics [12]. PBI uses data and evidence that certain facts are more likely than others. Prior distribution reflects our belief before seeing any data, whereas posterior distributions reflect our belief after we have considered all the evidence.

Cloud-based PBI consists of five main modules: (i) patients; (ii) health service providers; (iii) finance facilitators; (iv) regional services; and (v) traditional and evaluative decisions and four submodules: (a) employers; (b) pharmacists; (c) regional health organizations; and (d) business associates. In this framework, we classify different modules based on the probability (i.e., trust level) of each provider in violating the patient's data. Bayesian rules allow calculating the posterior probability of any information violation events as hypothesis (*H*) based on a set of historical data (*D*).

$$P(H|D) = \frac{P(D|H)P(H)}{P(D)}\tag{1}$$

where *P*(*H*|*D*) is the posterior probability of *H* given knowledge data *D*; *P*(*H*) is the prior probability for *H*; *P*(*D*|*H)* is the likelihood probability of *H* given *D*; and *P* (*D*) is the marginal probability that would have happened whether or not *H* is true. In cloud-based healthcare system, we use Bayes' rule to find the probability function as in Eq. (2):

*<sup>P</sup>* SLE, healthcare system, remote nodes <sup>¼</sup> *<sup>P</sup>* SLE <sup>j</sup> healthcare system, remote nodes <sup>∗</sup> *<sup>P</sup>* remote nodes <sup>j</sup> healthcare system <sup>∗</sup> *<sup>P</sup>* healthcare system

(2)

SLE is abbreviation of service level expectations. In cloud-based healthcare system, SLE is responsible to provide the quality of services to the remote nodes. It can also be variable that has enough relevance for the service and can be quantitatively and objectively measured. It strengthens the processes to improve the outcomes. In Bayesian Inference, our initial beliefs are represented by the prior distribution *P*(healthcare system) as shown in **Figure 2**.

In **Figure 2**, remote nodes and healthcare network are hidden variables, and the only observable variable is the SLE metric. An SLE node forecasts how long it should take a share healthcare information to the remote nodes. The SLE itself has two parts: a period of elapsed time and a probability associated with that period (e.g., 38% of healthcare information is shared in 5 min or less, which can also be stated as "5 min with 38% confidence/probability"). However, the healthcare network is a complete system for the variables and their dependencies. Healthcare system can also calculate the services provides to the services provided to the remote nodes like "what is the probability that network successfully passes and the given SLE has failed, *P*(healthcare system = true | SLE = false), which shows that the sharing of the healthcare information with the remote nodes is not completed within the threshold level. In general, the ultimate purpose of the proposed patient

remote nodes and whole network based on service level expectation (SLE) as evidence, following the statement "we take advantage of the nature of the Bayesian inference to calculate the probability of wireless communication between the

Healthcare system has become the inspiration for patients' data in terms of wirelessly communication for decision making and logical functionality of the remote institutions such as health physicians, nurses, family members, and authorized individuals. Conventional healthcare systems are using various encryption methods to secure patients' data. Observing the limitations of the existing encryption methods, we take advantage of the nature of the Bayesian inference to calculate the probability of wireless communication between the healthcare system and its remote institutions. The dynamics of the cloud environment requires the healthcare system being able to self-adapt, being aware of its surrounding environment's changing parameters, and being able to create new rules based on past experience. To eliminate the problem of repetition in the cloud environment, the security algorithm must maintain the remote institution limitations and at the same time must provide high level of data protection. Constraint-based adaptive boost algorithm has progressed to an advanced level data analysis for cloud-based healthcare

system. The implementation of patient Bayesian Inference for cloud-based

healthcare system will be suitable to demonstrate its performance in the real-world patients' datasets. Protection of patient's sensitive data is one of the main obstacles to the growth of cloud computing in the health field because of the need for high level of data integration, interoperability, and sharing among healthcare institutions. It is necessary to create standard guidelines and identify security challenges for improving information security in healthcare system. There are multiple remote institutions (nodes) that have to deal with healthcare data such as healthcare physicians, nurses, family members, and other authorized individuals. Similarly, within healthcare system, there are multiple entities that have to deal with healthcare data such as healthcare providers, hospital administration staff, finance providers, and patients themselves. Cloud services suffer from certain vulnerabilities [10]. By contrast, Bayesian model as an uncertain reasoning tool is more efficient for dynamic trust evaluation. Bayesian inference combined with cloud model and

In cloud-based healthcare systems, patients' electronic data have been widely adapted to improve the quality of patient care and increase the productivity and efficiency of healthcare delivery. In cloud-based systems, patients' data can be helpful to resolve many of the existing problems associated with disease diagnosis and also maintaining the privacy and sensitivity of the patient's medical information. PBI can be beneficial in the healthcare system for tracking fatigue by using multiarmed bandits, which facilitate the healthcare doctors in treatment by dynamically taking more samples from those treatments, which are most likely to be the best. PBI may facilitate the doctors in better understanding the patient's data and make decisions based on it. Because of security in cloud computing, outcomes can be measured in real time, rather than waiting for enough data. Recently, health data privacy has become an important issue in the cloud-based healthcare systems.

healthcare system and its remote institutions" [9].

Bayesian network is proposed in this research.

**4. Patient Bayesian inference**

**82**

**3. Problem statement**

*Bayesian Inference on Complicated Data*

**Figure 2.**

*Communication between healthcare system and remote nodes.*

**Figure 3.** *Bayesian rules.*

Bayesian model is to calculate the posterior (conditional) probability of the healthcare system given SLE, *P*(healthcare system| SLE), which reflects the trusty of the healthcare system. Eq. (3) calculates the posterior probability *P*(healthcare system | SLE), according to Bayes' rule (**Figure 3**).

It is necessary to choose a probabilistic model represented by Eq. (2) that relates to the random variables and the model parameters associated with it. At the end, Bayes' rules are applied to combine the prior knowledge and the new observed data to find the posterior probability distribution, following Eq. (3).

*<sup>P</sup>* healthcare system <sup>j</sup> SLE <sup>¼</sup> Sum of *<sup>P</sup>* SLE, healthcare system, remote nodes over all values of remote nodes divided by sum of *P* SLE, healthcare system, remote nodes over all values of remote nodes and SLE*:*

(3)

Let *D*<sup>1</sup>

*<sup>n</sup>*ð Þ<sup>1</sup> , *<sup>D</sup>*<sup>2</sup>

where *X<sup>p</sup>*

*<sup>n</sup>*ð Þ<sup>2</sup> , … *<sup>D</sup><sup>M</sup>*

*DP*

*n p*ð Þ <sup>¼</sup> *<sup>X</sup><sup>p</sup>*

*DOI: http://dx.doi.org/10.5772/intechopen.91171*

where *hP t*ð Þ is the unidentified data at *t*

diction made by the *Pth* patient can be defined as:

∈ð Þ*<sup>q</sup> <sup>P</sup>* <sup>¼</sup> <sup>1</sup> *n*ð Þ *<sup>P</sup>*

> <sup>∈</sup>*<sup>P</sup>* <sup>¼</sup> <sup>1</sup> *nP*

given by:

for *Pth* node

**85**

contains a total of *n*ð Þ *<sup>p</sup>* samples, and it can be represented as:

<sup>1</sup>ð Þ *<sup>p</sup>* , *<sup>Y</sup><sup>p</sup>* 1ð Þ *p* � �, *<sup>X</sup><sup>p</sup>*

For the *Pth* patient, *<sup>H</sup>P*ð Þ*:* is the set of T unidentified data:

*<sup>i</sup>* is the patient's data at *<sup>P</sup>th* node and *<sup>Y</sup><sup>p</sup>*

*n M*ð Þ are the datasets of M patients and the dataset of *Pth* node

n o � � (4)

*<sup>h</sup><sup>P</sup>*ð Þ<sup>1</sup> ð Þ*:* <sup>∝</sup>*<sup>P</sup>*ð Þ<sup>1</sup> , *hP*ð Þ<sup>2</sup> ð Þ*:* <sup>∝</sup>*<sup>P</sup>*ð Þ<sup>2</sup> , … *hP T*ð Þð Þ*:* <sup>∝</sup>*P T*ð Þ n o (5)

*T*

*t*¼0

*n p*ð Þ, *<sup>Y</sup><sup>p</sup> n*ð*p*

*th* iteration and ∝*P t*ð Þ is the corresponding

(6)

<sup>∝</sup>*P t*ð Þ*hP t*ð Þð Þ *Xi* ( )

*<sup>i</sup>* is the decision making that is

<sup>2</sup>ð Þ *<sup>p</sup>* , *<sup>Y</sup><sup>p</sup>* 2ð Þ *p* � �, … *<sup>X</sup><sup>p</sup>*

being consider here. The CBAB algorithm is applied to analyze the health information of each patient for "t" boosting iterations. In the decision making, each unidentified data is represented by *<sup>f</sup> <sup>n</sup>*, *<sup>θ</sup>*, *<sup>δ</sup>* � �, where *fn* represents the selected health parameter, *θ* is the decision threshold, and *δ* is the sign of decision, i.e., +1 or �1. CBAB calls a given learning algorithm in a series of loops t = 1, 2, … t. For any health information *Xi*, the hypothesis h(*Xi*) means the decision is either +1 or �1.

*Patient Bayesian Inference: Cloud-Based Healthcare Data Analysis Using Constraint-Based…*

weight of the unidentified data. For a particular patient's information *Xi*, the pre-

In a cloud-based healthcare system, all the nodes can share a patient's data to each other, and hence each node will receive M-1 information from other nodes. Therefore, each node would integrate specific information. In this way, the healthcare system would value the sensitivity of the patient's information for decision making. To analyze the original patient's data among different nodes in healthcare system is infeasible due to patient's privacy, therefore, we alternate to applying all the other nodes in the training set of *Pth* node, and compare the error

The node receiving information from any other node might be changed data, hence before using such data, the *Pth*node should select a suitable subset of relevant

where *<sup>H</sup><sup>q</sup>*ð Þ*:* is the selected information patterns from patient's shared data by node q, and I(.) is the indicator function. The training rate of the *Pth*trained node is

*<sup>I</sup>*ð*sign H<sup>P</sup> <sup>X</sup><sup>P</sup>*

∈ *<sup>P</sup>*Þ is less than a certain threshold level, then we can assume that the patient's data shared between *Pth* and *qth* nodes are similar and we can use *qth* node as trust node

*<sup>I</sup>*ð*sign H<sup>q</sup> <sup>X</sup><sup>P</sup>*

*i* � � 6¼ *<sup>Y</sup><sup>P</sup> i* � � " # (7)

*i* � � 6¼ *<sup>Y</sup><sup>P</sup> i* � � " # (8)

*<sup>P</sup>* and <sup>∈</sup>*P*. If <sup>∈</sup>ð Þ*<sup>q</sup>*

*P* �

�

*<sup>H</sup><sup>P</sup>*ð Þ¼ *<sup>X</sup> sign H<sup>P</sup>*ð Þ *<sup>X</sup>* � � <sup>¼</sup> *sign* <sup>X</sup>

rate of each node with the training rate of *Pth* node as shown in Eq.(7):

data based on *fn*. For the *Pth*node, the error rate of *qth*node is given by:

X*n*ð Þ *P i*¼1

X*n*ð Þ *P i*¼1

For every node, we compute the difference between ∈ð Þ*<sup>q</sup>*

#### **5. Constraint-based adaptive boost algorithm**

The constraint-based adaptive boost (CBAB) algorithm is a simple, flexible, and effective classifier [13]. In cloud-based healthcare system, CBAB is used for patient's data analysis. In healthcare system, each patient has different set of records with some common features and unique attributes such as name, age, disease, etc.

*Patient Bayesian Inference: Cloud-Based Healthcare Data Analysis Using Constraint-Based… DOI: http://dx.doi.org/10.5772/intechopen.91171*

Let *D*<sup>1</sup> *<sup>n</sup>*ð Þ<sup>1</sup> , *<sup>D</sup>*<sup>2</sup> *<sup>n</sup>*ð Þ<sup>2</sup> , … *<sup>D</sup><sup>M</sup> n M*ð Þ are the datasets of M patients and the dataset of *Pth* node contains a total of *n*ð Þ *<sup>p</sup>* samples, and it can be represented as:

$$D\_{n(p)}^p = \left\{ \left(X\_{1(p)}^p, Y\_{1(p)}^p\right), \left(X\_{2(p)}^p, Y\_{2(p)}^p\right), \dots \left(X\_{n(p)}^p, Y\_{n(p)}^p\right) \right\} \tag{4}$$

where *X<sup>p</sup> <sup>i</sup>* is the patient's data at *<sup>P</sup>th* node and *<sup>Y</sup><sup>p</sup> <sup>i</sup>* is the decision making that is being consider here. The CBAB algorithm is applied to analyze the health information of each patient for "t" boosting iterations. In the decision making, each unidentified data is represented by *<sup>f</sup> <sup>n</sup>*, *<sup>θ</sup>*, *<sup>δ</sup>* � �, where *fn* represents the selected health parameter, *θ* is the decision threshold, and *δ* is the sign of decision, i.e., +1 or �1. CBAB calls a given learning algorithm in a series of loops t = 1, 2, … t. For any health information *Xi*, the hypothesis h(*Xi*) means the decision is either +1 or �1. For the *Pth* patient, *<sup>H</sup>P*ð Þ*:* is the set of T unidentified data:

$$\left\{ h^{P(1)}(.)\infty^{P(1)}, h^{P(2)}(.)\infty^{P(2)}, \dots h^{P(T)}(.)\infty^{P(T)} \right\} \tag{5}$$

where *hP t*ð Þ is the unidentified data at *t th* iteration and ∝*P t*ð Þ is the corresponding weight of the unidentified data. For a particular patient's information *Xi*, the prediction made by the *Pth* patient can be defined as:

$$\text{sHP}^{p}(\mathbf{X}) = \text{sign}\{H^{p}(\mathbf{X})\} = \text{sign}\left\{\sum\_{t=0}^{T} \mathbf{x}^{P(t)} h^{P(t)}(\mathbf{X}\_{i})\right\} \tag{6}$$

In a cloud-based healthcare system, all the nodes can share a patient's data to each other, and hence each node will receive M-1 information from other nodes. Therefore, each node would integrate specific information. In this way, the healthcare system would value the sensitivity of the patient's information for decision making. To analyze the original patient's data among different nodes in healthcare system is infeasible due to patient's privacy, therefore, we alternate to applying all the other nodes in the training set of *Pth* node, and compare the error rate of each node with the training rate of *Pth* node as shown in Eq.(7):

The node receiving information from any other node might be changed data, hence before using such data, the *Pth*node should select a suitable subset of relevant data based on *fn*. For the *Pth*node, the error rate of *qth*node is given by:

$$\boldsymbol{\Xi}\_{P}^{(q)} = \frac{\mathbf{1}}{\boldsymbol{n}^{(P)}} \left[ \sum\_{i=1}^{n^{(P)}} I(\operatorname{sign}\{\boldsymbol{H}^{q}\{\mathbf{X}\_{i}^{P}\} \neq \boldsymbol{Y}\_{i}^{P}\} )\right] \tag{7}$$

where *<sup>H</sup><sup>q</sup>*ð Þ*:* is the selected information patterns from patient's shared data by node q, and I(.) is the indicator function. The training rate of the *Pth*trained node is given by:

$$\in\_P = \frac{1}{n^P} \left[ \sum\_{i=1}^{n^{(P)}} I(\operatorname{sign}\{H^P(X\_i^P) \neq Y\_i^P\} )\right] \tag{8}$$

For every node, we compute the difference between ∈ð Þ*<sup>q</sup> <sup>P</sup>* and <sup>∈</sup>*P*. If <sup>∈</sup>ð Þ*<sup>q</sup> P* � � ∈ *<sup>P</sup>*Þ is less than a certain threshold level, then we can assume that the patient's data shared between *Pth* and *qth* nodes are similar and we can use *qth* node as trust node for *Pth* node

Bayesian model is to calculate the posterior (conditional) probability of the healthcare system given SLE, *P*(healthcare system| SLE), which reflects the trusty of the healthcare system. Eq. (3) calculates the posterior probability *P*(healthcare

It is necessary to choose a probabilistic model represented by Eq. (2) that relates to the random variables and the model parameters associated with it. At the end, Bayes' rules are applied to combine the prior knowledge and the new observed data

over all values of remote nodes divided by sum of *P* SLE, healthcare system, remote nodes

The constraint-based adaptive boost (CBAB) algorithm is a simple, flexible, and

patient's data analysis. In healthcare system, each patient has different set of records with some common features and unique attributes such as name, age, disease, etc.

effective classifier [13]. In cloud-based healthcare system, CBAB is used for

(3)

system | SLE), according to Bayes' rule (**Figure 3**).

*Communication between healthcare system and remote nodes.*

*Bayesian Inference on Complicated Data*

**5. Constraint-based adaptive boost algorithm**

over all values of remote nodes and SLE*:*

**Figure 3.** *Bayesian rules.*

**84**

**Figure 2.**

to find the posterior probability distribution, following Eq. (3).

*<sup>P</sup>* healthcare system <sup>j</sup> SLE <sup>¼</sup> Sum of *<sup>P</sup>* SLE, healthcare system, remote nodes

### **6. Conclusion**

In our research, we have proposed a patient Bayesian Interference for analyzing the healthcare system. The Bayesian Inference is used to model the uncertainties that come with the problems and dealing with missing data and also allow integrating data from remote resources. We have also used the concept of constraint-based adaptive boosting to demonstrate the patient's Bayesian inference performance in the real datasets from healthcare system to remote resources. In the future, we will try to find more accurate ways to protect the patient's data more accurately without compromising on patient's privacy.

**References**

(1):1-8

273-281

**6**:35-55

**269**(19):10-25

[1] Karim A, Abderrahim H, Hayat K. Big healthcare data: Preserving security and privacy. Journal of Big Data. 2018;**5**

*DOI: http://dx.doi.org/10.5772/intechopen.91171*

[11] Alther M, Redday C. Clinical decision support systems. In: Redday C, Aggarwal C, editors. Healthcare Data Analytics. London: Chapman and Hall

[12] Rafiqullah S, Sagar R, Anshul S, Nasar Uddin A. Bayesian method for modeling male breast cancer survival data. APJCP. 2014;**15**(2):663-669

[13] Li Y, Bai C, Reddy C. A distributed

constraints. Information Sciences. 2016;

ensemble approach for mining healthcare data under privacy

**330**:245-259

Press; 2015. pp. 225-260

*Patient Bayesian Inference: Cloud-Based Healthcare Data Analysis Using Constraint-Based…*

[2] Jawwad A, Ali K. Understanding privacy violations in healthcare big data systems. IEEE. 2018;**20**(3):

[3] O'Mareen S, Richie O, Peter D. Patient consent to publication and data sharing in industry and NIH-funded clinical trials. Springer Nature. 2018;

[4] Hina A, Jawad H, Junaid C, Kashif S, Mehmet A, Jalal M, et al. Risk analysis of cloud sourcing in healthcare and public health industry. IEEE Access.2018;

[5] Isra's Ahmed S, Ahmed Mousa A. Security and privacy issues in

Ehealthcare systems: Towards trusted security. IJACSA. 2016;**7**(9):229-236

[6] Aaron B, William H, Michal M, Abdemour K, Thar B, Casimino A. An investigation into healthcare data patterns. MDPI. 2019;**11**(2):1-23

[7] Faruk A, Aftab A, Haider A, Nur Al Hassan H. A cloud-based healthcare framework for security & patient's data privacy using Wireless Body Area Network. Elsevier. 2014;**34**:511-517

[8] Roshan P, Sandeep S. Credentialing.

[9] Ella G, Jae S, Amanda D, Wenday S. Averse health events associated with clinical placement: a systematic review. Elsevier: Nurse Education Today. 2019;

[10] Tatiana E, Geboren M. Security and Acceptance of Cloud Computing in Healthcare. Berlin: Der Technischen

India: NCBI; 2019

**76**(1):178-190

Universiat; 2015

**87**

#### **Author details**

Shahid Naseem Department of Information Sciences, Division of Science and Technology, University of Education, Lahore, Pakistan

\*Address all correspondence to: shahid.naseem@ue.edu.pk

© 2020 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

*Patient Bayesian Inference: Cloud-Based Healthcare Data Analysis Using Constraint-Based… DOI: http://dx.doi.org/10.5772/intechopen.91171*

#### **References**

**6. Conclusion**

**Author details**

Shahid Naseem

**86**

University of Education, Lahore, Pakistan

provided the original work is properly cited.

compromising on patient's privacy.

*Bayesian Inference on Complicated Data*

In our research, we have proposed a patient Bayesian Interference for analyzing the healthcare system. The Bayesian Inference is used to model the uncertainties that come with the problems and dealing with missing data and also allow integrating data from remote resources. We have also used the concept of constraint-based adaptive boosting to demonstrate the patient's Bayesian inference performance in the real datasets from healthcare system to remote resources. In the future, we will try to find more accurate ways to protect the patient's data more accurately without

Department of Information Sciences, Division of Science and Technology,

© 2020 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium,

\*Address all correspondence to: shahid.naseem@ue.edu.pk

[1] Karim A, Abderrahim H, Hayat K. Big healthcare data: Preserving security and privacy. Journal of Big Data. 2018;**5** (1):1-8

[2] Jawwad A, Ali K. Understanding privacy violations in healthcare big data systems. IEEE. 2018;**20**(3): 273-281

[3] O'Mareen S, Richie O, Peter D. Patient consent to publication and data sharing in industry and NIH-funded clinical trials. Springer Nature. 2018; **269**(19):10-25

[4] Hina A, Jawad H, Junaid C, Kashif S, Mehmet A, Jalal M, et al. Risk analysis of cloud sourcing in healthcare and public health industry. IEEE Access.2018; **6**:35-55

[5] Isra's Ahmed S, Ahmed Mousa A. Security and privacy issues in Ehealthcare systems: Towards trusted security. IJACSA. 2016;**7**(9):229-236

[6] Aaron B, William H, Michal M, Abdemour K, Thar B, Casimino A. An investigation into healthcare data patterns. MDPI. 2019;**11**(2):1-23

[7] Faruk A, Aftab A, Haider A, Nur Al Hassan H. A cloud-based healthcare framework for security & patient's data privacy using Wireless Body Area Network. Elsevier. 2014;**34**:511-517

[8] Roshan P, Sandeep S. Credentialing. India: NCBI; 2019

[9] Ella G, Jae S, Amanda D, Wenday S. Averse health events associated with clinical placement: a systematic review. Elsevier: Nurse Education Today. 2019; **76**(1):178-190

[10] Tatiana E, Geboren M. Security and Acceptance of Cloud Computing in Healthcare. Berlin: Der Technischen Universiat; 2015

[11] Alther M, Redday C. Clinical decision support systems. In: Redday C, Aggarwal C, editors. Healthcare Data Analytics. London: Chapman and Hall Press; 2015. pp. 225-260

[12] Rafiqullah S, Sagar R, Anshul S, Nasar Uddin A. Bayesian method for modeling male breast cancer survival data. APJCP. 2014;**15**(2):663-669

[13] Li Y, Bai C, Reddy C. A distributed ensemble approach for mining healthcare data under privacy constraints. Information Sciences. 2016; **330**:245-259

**Chapter 7**

The Bayesian Posterior Estimators

In this chapter, we have investigated six loss functions. In particular, the squared

recommended for the positive restricted parameter space 0ð Þ *;* ∞ ; the power-log loss function and Zhang's loss function, which penalize gross overestimation and gross underestimation equally, are recommended for 0ð Þ *;* 1 . Among the six Bayesian estimators that minimize the corresponding posterior expected losses (PELs), there exist three strings of inequalities. However, a string of inequalities among the six smallest PELs does not exist. Moreover, we summarize three hierarchical models where the unknown parameter of interest belongs to 0ð Þ *;* ∞ , that is, the hierarchical normal and inverse gamma model, the hierarchical Poisson and gamma model, and the hierarchical normal and normal-inverse-gamma model. In addition, we summarize two hierarchical models where the unknown parameter of interest belongs to 0ð Þ *;* 1 , that is, the beta-binomial model and the beta-negative binomial model. For empirical Bayesian analysis of the unknown parameter of interest of the hierarchical models, we use two common methods to obtain the estimators of the hyperparameters, that is, the moment method and the maximum likelihood estimator (MLE) method.

error loss function and the weighted squared error loss function that penalize overestimation and underestimation equally are recommended for the unrestricted parameter space ð Þ �∞*;* ∞ ; Stein's loss function and the power-power loss function, which penalize gross overestimation and gross underestimation equally, are

**Keywords:** Bayesian estimators, power-log loss function, power-power loss function, restricted parameter spaces, Stein's loss function, Zhang's loss function

In Bayesian analysis, there are four basic elements: the data, the model, the prior, and the loss function. A Bayesian estimator minimizes some posterior expected loss (PEL) function. We confine our interests to six loss functions in this chapter: the squared error loss function (well known), the weighted squared error loss function ([1], p. 78), Stein's loss function [2–10], the power-power loss function [11], the power-log loss function [12], and Zhang's loss function [13]. It is worthy to note that among the six loss functions, the first and second loss functions

are defined on Θ ¼ �ð Þ ∞*;* ∞ , and they penalize overestimation and

under Six Loss Functions for

Unrestricted and Restricted

Parameter Spaces

*Ying-Ying Zhang*

**Abstract**

**1. Introduction**

**89**

#### **Chapter 7**

## The Bayesian Posterior Estimators under Six Loss Functions for Unrestricted and Restricted Parameter Spaces

*Ying-Ying Zhang*

### **Abstract**

In this chapter, we have investigated six loss functions. In particular, the squared error loss function and the weighted squared error loss function that penalize overestimation and underestimation equally are recommended for the unrestricted parameter space ð Þ �∞*;* ∞ ; Stein's loss function and the power-power loss function, which penalize gross overestimation and gross underestimation equally, are recommended for the positive restricted parameter space 0ð Þ *;* ∞ ; the power-log loss function and Zhang's loss function, which penalize gross overestimation and gross underestimation equally, are recommended for 0ð Þ *;* 1 . Among the six Bayesian estimators that minimize the corresponding posterior expected losses (PELs), there exist three strings of inequalities. However, a string of inequalities among the six smallest PELs does not exist. Moreover, we summarize three hierarchical models where the unknown parameter of interest belongs to 0ð Þ *;* ∞ , that is, the hierarchical normal and inverse gamma model, the hierarchical Poisson and gamma model, and the hierarchical normal and normal-inverse-gamma model. In addition, we summarize two hierarchical models where the unknown parameter of interest belongs to 0ð Þ *;* 1 , that is, the beta-binomial model and the beta-negative binomial model. For empirical Bayesian analysis of the unknown parameter of interest of the hierarchical models, we use two common methods to obtain the estimators of the hyperparameters, that is, the moment method and the maximum likelihood estimator (MLE) method.

**Keywords:** Bayesian estimators, power-log loss function, power-power loss function, restricted parameter spaces, Stein's loss function, Zhang's loss function

#### **1. Introduction**

In Bayesian analysis, there are four basic elements: the data, the model, the prior, and the loss function. A Bayesian estimator minimizes some posterior expected loss (PEL) function. We confine our interests to six loss functions in this chapter: the squared error loss function (well known), the weighted squared error loss function ([1], p. 78), Stein's loss function [2–10], the power-power loss function [11], the power-log loss function [12], and Zhang's loss function [13]. It is worthy to note that among the six loss functions, the first and second loss functions are defined on Θ ¼ �ð Þ ∞*;* ∞ , and they penalize overestimation and

underestimation equally. The third and fourth loss functions are defined on Θ ¼ ð Þ 0*;* ∞ , and they penalize gross overestimation and gross underestimation equally, that is, an action *a* will suffer an infinite loss when it tends to 0 or ∞. The fifth and sixth loss functions are defined on Θ ¼ ð Þ 0*;* 1 , and they penalize gross overestimation and gross underestimation equally, that is, an action *a* will suffer an infinite loss when it tends to 0 or 1.

**2.1 Squared error loss function**

*DOI: http://dx.doi.org/10.5772/intechopen.88587*

action (estimator),

It is found in [16] that

E½ � *Lw*2ð Þj *θ; a x* , that is,

action (estimator),

It is found in [1] that

**91**

The Bayesian estimator under the squared error loss function (well known), *δπ*

where Af g *a*ð Þ *x* : *a*ð Þ *x* ∈ ð Þ �∞*;* ∞ is the action space, *a* ¼ *a*ð Þ *x* ∈ð Þ �∞*;* ∞ is an

*L*2ð Þ¼ *θ; a* ð Þ *θ* � *a*

*PESEL*ð Þ¼ *<sup>π</sup>; <sup>a</sup>*j*<sup>x</sup>* <sup>E</sup>½ �¼ *<sup>L</sup>*2ð Þj *<sup>θ</sup>; <sup>a</sup> <sup>x</sup> <sup>a</sup>*<sup>2</sup> � <sup>2</sup>*a*Eð Þþ *<sup>θ</sup>*j**<sup>x</sup>** <sup>E</sup> *<sup>θ</sup>*<sup>2</sup>

by taking partial derivative of the PESEL with respect to *a* and setting it to 0.

The Bayesian estimator under the weighted squared error loss function, *δπ*

minimizes the posterior expected weighted squared error loss (PEWSEL) (see [1]),

where Af g *a*ð Þ *x* : *a*ð Þ *x* ∈ ð Þ �∞*;* ∞ is the action space, *a* ¼ *a*ð Þ *x* ∈ð Þ �∞*;* ∞ is an

1 *<sup>θ</sup>*<sup>2</sup> ð Þ *<sup>θ</sup>* � *<sup>a</sup>*

> <sup>E</sup> <sup>1</sup> *<sup>θ</sup>*<sup>2</sup> <sup>j</sup>*<sup>x</sup>*

E <sup>1</sup> *<sup>θ</sup>* <sup>j</sup>*<sup>x</sup>*

E <sup>1</sup> *<sup>θ</sup>*<sup>2</sup> j*x*

by taking partial derivative of the PEWSEL with respect to *a* and setting it to 0.

There are many hierarchical models where the parameter of interest is *θ* ∈ Θ ¼ ð Þ 0*;* ∞ . As pointed out in the introduction, we should calculate and use the

is the weighted squared error loss function, and *θ* ∈ ð Þ �∞*;* ∞ is the unknown

*δπ*

*<sup>w</sup>*2ð Þ¼ *<sup>x</sup>* arg min *<sup>a</sup>*<sup>∈</sup> <sup>A</sup>

*Lw*2ð Þ¼ *θ; a*

parameter of interest. The PEWSEL is easy to obtain (see [1]):

*δπ <sup>w</sup>*2ð Þ¼ *x*

*PEWSEL*ð Þ¼ *<sup>π</sup>; <sup>a</sup>*j*<sup>x</sup>* <sup>E</sup>½ �¼ *Lw*2ð Þj *<sup>θ</sup>; <sup>a</sup> <sup>x</sup> <sup>a</sup>*<sup>2</sup>

**3. Bayesian estimation for** *θ* **∈ (0,∞)**

is the squared error loss function, and *θ* ∈ ð Þ �∞*;* ∞ is the unknown parameter of

minimizes the posterior expected squared error loss (PESEL), E½ � *L*2ð Þj *θ; a x* , that is,

*The Bayesian Posterior Estimators under Six Loss Functions for Unrestricted and Restricted…*

<sup>2</sup> ð Þ¼ *<sup>x</sup>* arg min *<sup>a</sup>*<sup>∈</sup> <sup>A</sup>

*δπ*

interest. The PESEL is easy to obtain (see [16]):

**2.2 Weighted squared error loss function**

*δπ*

<sup>2</sup> ð Þ *x* ,

E½ � *L*2ð Þj *θ; a x ,* (1)

<sup>2</sup> ð Þ¼ *x* Eð Þ *θ*j*x* (4)

E½ � *Lw*2ð Þj *θ; a x ,* (5)

� <sup>2</sup>*a*<sup>E</sup> <sup>1</sup> *θ* j*x* 

(8)

<sup>2</sup> (6)

þ 1*:* (7)

<sup>2</sup> (2)

<sup>j</sup>*<sup>x</sup> :* (3)

*<sup>w</sup>*2ð Þ *x* ,

The squared error loss function and the weighted squared error loss function have been used by many authors for the problem of estimating the variance, *σ*2, based on a random sample from a normal distribution with mean *μ* unknown (see, for instance, [14, 15]). As pointed out by [16], the two loss functions penalize equally for overestimation and underestimation, which is fine for the unrestricted parameter space Θ ¼ �ð Þ ∞*;* ∞ .

For Θ ¼ ð Þ 0*;* ∞ , the positive restricted parameter space, where 0 is a natural lower bound and the estimation problem is not symmetric, we should not choose the squared error loss function and the weighted squared error loss function but choose a loss function which can penalize gross overestimation and gross underestimation equally, that is, an action *a* will suffer an infinite loss when it tends to 0 or ∞. Stein's loss function owns this property, and thus it is recommended for Θ ¼ ð Þ 0*;* ∞ by many researchers (e.g., see [2–10]). Moreover, [11] proposes the power-power loss function which not only penalizes gross overestimation and gross underestimation equally but also has balanced convergence rates or penalties for its argument too large and too small. Therefore, Stein's loss function and the powerpower loss function are recommended for Θ ¼ ð Þ 0*;* ∞ .

Analogously, for a restricted parameter space Θ ¼ ð Þ 0*;* 1 , where 0 and 1 are two natural bounds and the estimation problem is not symmetric, we should not select the squared error loss function and the weighted squared error loss function but select a loss function which can penalize gross overestimation and gross underestimation equally, that is, an action *a* will suffer an infinite loss when it tends to 0 or 1. It is worthy to note that Stein's loss function and the power-power loss function are also not appropriate in this case. The power-log loss function proposed by [12] has this property. Moreover, they propose six properties for a good loss function on Θ ¼ ð Þ 0*;* 1 . Specifically, the power-log loss function is convex in its argument, attains its global minimum at the true unknown parameter, and penalizes gross overestimation and gross underestimation equally. Apart from the six properties, [13] proposes the seventh property, that is, balanced convergence rates or penalties for the argument too large and too small, for a good loss function on Θ ¼ ð Þ 0*;* 1 . Therefore, the power-log loss function and Zhang's loss function are recommended for Θ ¼ ð Þ 0*;* 1 .

The rest of the chapter is organized as follows. In Section 2, we obtain two Bayesian estimators for *θ* ∈ Θ ¼ �ð Þ ∞*;* ∞ under the squared error loss function and the weighted squared error loss function. In Section 3, we obtain two Bayesian estimators for *θ* ∈ Θ ¼ ð Þ 0*;* ∞ under Stein's loss function and the power-power loss function. In Section 4, we obtain two Bayesian estimators for *θ* ∈ Θ ¼ ð Þ 0*;* 1 under the power-log loss function and Zhang's loss function. In Section 5, we summarize three strings of inequalities in a theorem. Some conclusions and discussions are provided in Section 6.

### **2. Bayesian estimation for** *θ* **∈(**�**∞,∞)**

There are two loss functions which are defined on Θ ¼ �ð Þ ∞*;* ∞ and penalize overestimation and underestimation equally, that is, the squared error loss function (well known) and the weighted squared error loss function (see [1], p. 78).

*The Bayesian Posterior Estimators under Six Loss Functions for Unrestricted and Restricted… DOI: http://dx.doi.org/10.5772/intechopen.88587*

#### **2.1 Squared error loss function**

underestimation equally. The third and fourth loss functions are defined on Θ ¼ ð Þ 0*;* ∞ , and they penalize gross overestimation and gross underestimation equally, that is, an action *a* will suffer an infinite loss when it tends to 0 or ∞. The fifth and sixth loss functions are defined on Θ ¼ ð Þ 0*;* 1 , and they penalize gross overestimation and gross underestimation equally, that is, an action *a* will suffer an

The squared error loss function and the weighted squared error loss function have been used by many authors for the problem of estimating the variance, *σ*2, based on a random sample from a normal distribution with mean *μ* unknown (see, for instance, [14, 15]). As pointed out by [16], the two loss functions penalize equally for overestimation and underestimation, which is fine for the unrestricted

For Θ ¼ ð Þ 0*;* ∞ , the positive restricted parameter space, where 0 is a natural lower bound and the estimation problem is not symmetric, we should not choose the squared error loss function and the weighted squared error loss function but choose a loss function which can penalize gross overestimation and gross underestimation equally, that is, an action *a* will suffer an infinite loss when it tends to 0 or

Analogously, for a restricted parameter space Θ ¼ ð Þ 0*;* 1 , where 0 and 1 are two natural bounds and the estimation problem is not symmetric, we should not select the squared error loss function and the weighted squared error loss function but select a loss function which can penalize gross overestimation and gross underestimation equally, that is, an action *a* will suffer an infinite loss when it tends to 0 or 1. It is worthy to note that Stein's loss function and the power-power loss function are also not appropriate in this case. The power-log loss function proposed by [12] has this property. Moreover, they propose six properties for a good loss function on Θ ¼ ð Þ 0*;* 1 . Specifically, the power-log loss function is convex in its argument, attains its global minimum at the true unknown parameter, and penalizes gross overestimation and gross underestimation equally. Apart from the six properties, [13] proposes the seventh property, that is, balanced convergence rates or penalties for the argument too large and too small, for a good loss function on Θ ¼ ð Þ 0*;* 1 . Therefore, the power-log loss function and Zhang's loss function are recommended

The rest of the chapter is organized as follows. In Section 2, we obtain two Bayesian estimators for *θ* ∈ Θ ¼ �ð Þ ∞*;* ∞ under the squared error loss function and the weighted squared error loss function. In Section 3, we obtain two Bayesian estimators for *θ* ∈ Θ ¼ ð Þ 0*;* ∞ under Stein's loss function and the power-power loss function. In Section 4, we obtain two Bayesian estimators for *θ* ∈ Θ ¼ ð Þ 0*;* 1 under the power-log loss function and Zhang's loss function. In Section 5, we summarize three strings of inequalities in a theorem. Some conclusions and discussions are

There are two loss functions which are defined on Θ ¼ �ð Þ ∞*;* ∞ and penalize overestimation and underestimation equally, that is, the squared error loss function

(well known) and the weighted squared error loss function (see [1], p. 78).

∞. Stein's loss function owns this property, and thus it is recommended for Θ ¼ ð Þ 0*;* ∞ by many researchers (e.g., see [2–10]). Moreover, [11] proposes the power-power loss function which not only penalizes gross overestimation and gross underestimation equally but also has balanced convergence rates or penalties for its argument too large and too small. Therefore, Stein's loss function and the power-

power loss function are recommended for Θ ¼ ð Þ 0*;* ∞ .

infinite loss when it tends to 0 or 1.

*Bayesian Inference on Complicated Data*

parameter space Θ ¼ �ð Þ ∞*;* ∞ .

for Θ ¼ ð Þ 0*;* 1 .

provided in Section 6.

**90**

**2. Bayesian estimation for** *θ* **∈(**�**∞,∞)**

The Bayesian estimator under the squared error loss function (well known), *δπ* <sup>2</sup> ð Þ *x* , minimizes the posterior expected squared error loss (PESEL), E½ � *L*2ð Þj *θ; a x* , that is,

$$\delta\_2^{\pi}(\mathbf{x}) = \arg\min\_{a \in \mathcal{A}} \mathbb{E}[L\_2(\theta, a)|\mathbf{x}],\tag{1}$$

where Af g *a*ð Þ *x* : *a*ð Þ *x* ∈ ð Þ �∞*;* ∞ is the action space, *a* ¼ *a*ð Þ *x* ∈ð Þ �∞*;* ∞ is an action (estimator),

$$L\_2(\theta, a) = \left(\theta - a\right)^2\tag{2}$$

is the squared error loss function, and *θ* ∈ ð Þ �∞*;* ∞ is the unknown parameter of interest. The PESEL is easy to obtain (see [16]):

$$\text{PESEL}(\pi, a|\mathbf{x}) = \text{E}[L\_2(\theta, a)|\mathbf{x}] = a^2 - 2a\text{E}(\theta|\mathbf{x}) + \text{E}(\theta^2|\mathbf{x}).\tag{3}$$

It is found in [16] that

$$\delta\_2^{\pi}(\mathfrak{x}) = \mathbf{E}(\theta|\mathfrak{x}) \tag{4}$$

by taking partial derivative of the PESEL with respect to *a* and setting it to 0.

#### **2.2 Weighted squared error loss function**

The Bayesian estimator under the weighted squared error loss function, *δπ <sup>w</sup>*2ð Þ *x* , minimizes the posterior expected weighted squared error loss (PEWSEL) (see [1]), E½ � *Lw*2ð Þj *θ; a x* , that is,

$$\delta\_{w2}^{\pi}(\mathbf{x}) = \arg\min\_{a \in \mathcal{A}} \mathrm{E}[L\_{w2}(\theta, a)|\mathbf{x}],\tag{5}$$

where Af g *a*ð Þ *x* : *a*ð Þ *x* ∈ ð Þ �∞*;* ∞ is the action space, *a* ¼ *a*ð Þ *x* ∈ð Þ �∞*;* ∞ is an action (estimator),

$$L\_{w2}(\theta, a) = \frac{1}{\theta^2} (\theta - a)^2 \tag{6}$$

is the weighted squared error loss function, and *θ* ∈ ð Þ �∞*;* ∞ is the unknown parameter of interest. The PEWSEL is easy to obtain (see [1]):

$$PEWSEL(\mathfrak{x}, a|\mathfrak{x}) = \operatorname{E}[L\_{w2}(\theta, a)|\mathfrak{x}] = a^2 \operatorname{E}\left(\frac{1}{\theta^2}|\mathfrak{x}) - 2a \operatorname{E}\left(\frac{1}{\theta}|\mathfrak{x}\right) + 1. \tag{7}$$

It is found in [1] that

$$\delta\_{w2}^{\pi}(\mathfrak{x}) = \frac{\mathrm{E}\left(\frac{1}{\theta} \, \vert \, \mathfrak{x}\right)}{\mathrm{E}\left(\frac{1}{\theta^{2}} \, \vert \, \mathfrak{x}\right)}\tag{8}$$

by taking partial derivative of the PEWSEL with respect to *a* and setting it to 0.

#### **3. Bayesian estimation for** *θ* **∈ (0,∞)**

There are many hierarchical models where the parameter of interest is *θ* ∈ Θ ¼ ð Þ 0*;* ∞ . As pointed out in the introduction, we should calculate and use the Bayesian estimator of the parameter *θ* under Stein's loss function or the powerpower loss function because they penalize gross overestimation and gross underestimation equally. We list several such hierarchical models as follows.

**Model (a) (hierarchical normal and inverse gamma model)**. This hierarchical model has been investigated by [10, 16, 17]. Suppose that we observe *X*1*, X*2*,* …*, Xn* from the hierarchical normal and inverse gamma model:

$$\begin{cases} X\_i | \theta \overset{\text{id}}{\sim} N(\mu, \theta), & i = 1, 2, \dots, n, \\ \theta \sim IG(a, \beta), \end{cases} \tag{9}$$

estimator of *θ*. Moreover, the normal distribution with a normal-inverse-gamma prior which assumes that *μ* is unknown is more realistic than the normal distribution with an inverse gamma prior investigated by [10] which assumes that *μ* is

*The Bayesian Posterior Estimators under Six Loss Functions for Unrestricted and Restricted…*

*<sup>s</sup>* ð Þ *x* , minimizes the poste-

*<sup>θ</sup>* (13)

� 1 � log *a* þ E log ð Þ *θ*j*x :* (14)

*<sup>θ</sup>* <sup>j</sup>*<sup>x</sup>* � � (15)

*<sup>α</sup>*<sup>∗</sup> *<sup>β</sup>* <sup>∗</sup> *,* (17)

<sup>2</sup> ð Þ *<sup>x</sup> ,* (16)

*,* (18)

*<sup>s</sup>* ð Þ *<sup>x</sup> ,*

ð Þ *xi* � *μ* 2

" #�<sup>1</sup>

E½ � *Ls*ð Þj *θ; a x ,* (12)

The Bayesian estimator under Stein's loss function, *δ<sup>π</sup>*

*δπ*

*PESL*ð Þ¼ *<sup>π</sup>; <sup>a</sup>*j*<sup>x</sup>* <sup>E</sup>½ �¼ *Ls*ð Þj *<sup>θ</sup>; <sup>a</sup> <sup>x</sup> <sup>a</sup>*<sup>E</sup> <sup>1</sup>

rior expected Stein's loss (PESL) (see [1, 10, 16]), E½ � *Ls*ð Þj *θ; a x* , that is,

*<sup>s</sup>* ð Þ¼ *<sup>x</sup>* arg min *<sup>a</sup>* <sup>∈</sup> <sup>A</sup>

*Ls*ð Þ¼ *θ; a*

*δπ <sup>s</sup>* ð Þ¼ *x*

PESLs evaluated at the Bayesian estimators are (see [10])

*<sup>α</sup>*<sup>∗</sup> <sup>¼</sup> *<sup>α</sup>* <sup>þ</sup>

*n* 2

where Af g *a*ð Þ *x* : *a*ð Þ *x* > 0 is the action space, *a* ¼ *a*ð Þ *x* >0 is an action (estimator),

is Stein's loss function, and *θ* > 0 is the unknown parameter of interest. The PESL

*<sup>θ</sup>* � <sup>1</sup> � log *<sup>a</sup>*

1 E <sup>1</sup>

by taking partial derivative of the PESL with respect to *a* and setting it to 0. The

<sup>2</sup> ð Þ¼ *x* Eð Þ *θ*j*x* is the Bayesian estimator under the squared error loss

1

*β* þ 1 2 X*n i*¼1

with respect to *IG*ð Þ *α; β* prior under Stein's loss function. This estimator mini-

*PESLs*ð Þ¼ *<sup>π</sup>; <sup>x</sup>* <sup>E</sup>½ �j *Ls*ð Þj *<sup>θ</sup>; <sup>a</sup> <sup>x</sup> <sup>a</sup>*¼*δ<sup>π</sup>*

*PESL*2ð Þ¼ *<sup>π</sup>; <sup>x</sup>* <sup>E</sup>½ �j *Ls*ð Þj *<sup>θ</sup>; <sup>a</sup> <sup>x</sup> <sup>a</sup>*¼*δπ*

For the variance parameter *θ* of the hierarchical normal and inverse gamma model (9), [10] recommends and analytically calculates the Bayesian estimator:

> *δπ <sup>s</sup>* ð Þ¼ *x*

and*<sup>β</sup>* <sup>∗</sup> <sup>¼</sup> <sup>1</sup>

mizes the PESL. [10] also analytically calculates the Bayesian estimator,

*a*

*θ* j*x* � �

known.

**3.1 Stein's loss function**

*DOI: http://dx.doi.org/10.5772/intechopen.88587*

*3.1.1 One-dimensional case*

is easy to obtain (see [10]):

It is found in [10] that

where *δ<sup>π</sup>*

function.

where

**93**

where �∞ <*μ*< ∞, *α* >0, and *β* >0 are known constants, *θ* is the unknown parameter of interest, *N*ð Þ *μ; θ* is the normal distribution, and *IG*ð Þ *α; β* is the inverse gamma distribution. It is worthy to note that the problem of finding the Bayesian rule under a conjugate prior is a standard problem and the problem is treated in almost every text on mathematical statistics. The idea of selecting an appropriate prior from the conjugate family was put forward by [18]. Specifically, Bayesian estimation of *θ* under the prior *IG*ð Þ *α; β* is studied in Example 4.2.5 (p. 236) of [17] and in Exercise 7.23 (p. 359) of [16]. However, they only calculate the Bayesian estimator with respect to *IG*ð Þ *<sup>α</sup>; <sup>β</sup>* prior under the squared error loss, *<sup>δ</sup><sup>π</sup>* <sup>2</sup> ð Þ¼ *x* Eð Þ *θ*j*x* .

**Model (b) (hierarchical Poisson and gamma model)**. This hierarchical model has been investigated by [1, 16, 19, 20]. Suppose that *X*1*, X*2*,* …*, Xn* are observed from the hierarchical Poisson and gamma model:

$$\begin{cases} X\_i | \theta \overset{\text{id}}{\sim} P(\theta), & i = 1, 2, \dots, n, \\ \theta \sim G(a, \beta), \end{cases} \tag{10}$$

where *α* >0 and *β* >0 are hyperparameters to be determined, *P*ð Þ*θ* is the Poisson distribution with an unknown mean *θ* >0, and *G*ð Þ *α; β* is the gamma distribution with an unknown shape parameter *α* and an unknown rate parameter *β*. The gamma prior *G*ð Þ *α; β* is a conjugate prior for the Poisson model, so that the posterior distribution of *θ* is also a gamma distribution. The hierarchical Poisson and gamma model (10) has been considered in Exercise 4.32 (p. 196) of [4]. It has been shown that the marginal distribution of *X* is a negative binomial distribution if *α* is a positive integer. The Bayesian estimation of *θ* under the gamma prior is studied in [19] and in Tables 3.3.1 (p. 121) and 4.2.1 (p. 176) of [1]. However, they only calculated the Bayesian posterior estimator of *θ* under the squared error loss function.

**Model (c) (hierarchical normal and normal-inverse-gamma model)**. This hierarchical model has been investigated by [2, 21, 22]. Let the observations *X*1*, X*2*,* …*, Xn* be from the hierarchical normal and normal-inverse-gamma model:

$$\begin{cases} X\_i | (\mu, \theta) \overset{\text{id}}{\sim} N(\mu, \theta), & i = 1, 2, \dots, n, \\ \mu | \theta \sim N(\mu\_0, \theta/\kappa\_0), \theta \sim \text{IG}(v\_0/2, v\_0 \sigma\_0^2/2), \end{cases} \tag{11}$$

where �∞ <*μ*<sup>0</sup> < ∞, *κ*<sup>0</sup> >0, *v*<sup>0</sup> >0, and *σ*<sup>0</sup> >0 are known hyperparameters, *N*ð Þ *μ; θ* is a normal distribution with an unknown mean *μ* and an unknown variance *<sup>θ</sup>*, *<sup>μ</sup>*∣*<sup>θ</sup>* is *<sup>N</sup> <sup>μ</sup>*<sup>0</sup> <sup>ð</sup> *, <sup>θ</sup>=κ*0<sup>Þ</sup> which is a normal distribution, and *<sup>θ</sup>* is *IG v*0*=*2*; <sup>v</sup>*0*σ*<sup>2</sup> <sup>0</sup>*=*<sup>2</sup> � � which is an inverse gamma distribution. More specifically, with a joint conjugate prior *π μ*ð Þ� *; <sup>θ</sup> <sup>N</sup>* � *IG <sup>μ</sup>*0*; <sup>κ</sup>*0*; <sup>v</sup>*0*; <sup>σ</sup>*<sup>2</sup> 0 � �, which is the normal-inverse-gamma distribution, the posterior distribution of *θ* was studied in Example 1.5.1 (p. 20) of [21] and Part I (pp. 69–70) of [22]. However, they did not provide any Bayesian posterior

*The Bayesian Posterior Estimators under Six Loss Functions for Unrestricted and Restricted… DOI: http://dx.doi.org/10.5772/intechopen.88587*

estimator of *θ*. Moreover, the normal distribution with a normal-inverse-gamma prior which assumes that *μ* is unknown is more realistic than the normal distribution with an inverse gamma prior investigated by [10] which assumes that *μ* is known.

#### **3.1 Stein's loss function**

Bayesian estimator of the parameter *θ* under Stein's loss function or the powerpower loss function because they penalize gross overestimation and gross underes-

**Model (a) (hierarchical normal and inverse gamma model)**. This hierarchical model has been investigated by [10, 16, 17]. Suppose that we observe *X*1*, X*2*,* …*, Xn*

iid *<sup>N</sup>*ð Þ *<sup>μ</sup>; <sup>θ</sup> , i* <sup>¼</sup> <sup>1</sup>*,* <sup>2</sup>*,* …*, n,*

where �∞ <*μ*< ∞, *α* >0, and *β* >0 are known constants, *θ* is the unknown parameter of interest, *N*ð Þ *μ; θ* is the normal distribution, and *IG*ð Þ *α; β* is the inverse gamma distribution. It is worthy to note that the problem of finding the Bayesian rule under a conjugate prior is a standard problem and the problem is treated in almost every text on mathematical statistics. The idea of selecting an appropriate prior from the conjugate family was put forward by [18]. Specifically, Bayesian estimation of *θ* under the prior *IG*ð Þ *α; β* is studied in Example 4.2.5 (p. 236) of [17] and in Exercise 7.23 (p. 359) of [16]. However, they only calculate the Bayesian

**Model (b) (hierarchical Poisson and gamma model)**. This hierarchical model has been investigated by [1, 16, 19, 20]. Suppose that *X*1*, X*2*,* …*, Xn* are observed

iid *<sup>P</sup>*ð Þ*<sup>θ</sup> , i* <sup>¼</sup> <sup>1</sup>*,* <sup>2</sup>*,* …*, n,*

where *α* >0 and *β* >0 are hyperparameters to be determined, *P*ð Þ*θ* is the Poisson distribution with an unknown mean *θ* >0, and *G*ð Þ *α; β* is the gamma distribution with an unknown shape parameter *α* and an unknown rate parameter *β*. The gamma prior *G*ð Þ *α; β* is a conjugate prior for the Poisson model, so that the posterior distribution of *θ* is also a gamma distribution. The hierarchical Poisson and gamma model (10) has been considered in Exercise 4.32 (p. 196) of [4]. It has been shown that the marginal distribution of *X* is a negative binomial distribution if *α* is a positive integer. The Bayesian estimation of *θ* under the gamma prior is studied in [19] and in Tables 3.3.1 (p. 121) and 4.2.1 (p. 176) of [1]. However, they only calculated the

(9)

<sup>2</sup> ð Þ¼ *x* Eð Þ *θ*j*x* .

(10)

(11)

<sup>0</sup>*=*<sup>2</sup> � � which

timation equally. We list several such hierarchical models as follows.

*θ* � *IG*ð Þ *α; β ,*

estimator with respect to *IG*ð Þ *<sup>α</sup>; <sup>β</sup>* prior under the squared error loss, *<sup>δ</sup><sup>π</sup>*

*Xi*∣*θ* �

*θ* � *G*ð Þ *α; β ,*

Bayesian posterior estimator of *θ* under the squared error loss function.

hierarchical model has been investigated by [2, 21, 22]. Let the observations *X*1*, X*2*,* …*, Xn* be from the hierarchical normal and normal-inverse-gamma model:

> *Xi*∣ð Þ� *<sup>μ</sup>; <sup>θ</sup>* iid *<sup>N</sup>*ð Þ *<sup>μ</sup>; <sup>θ</sup> , i* <sup>¼</sup> <sup>1</sup>*,* <sup>2</sup>*,* …*, n, <sup>μ</sup>*∣*<sup>θ</sup>* � *<sup>N</sup> <sup>μ</sup>*<sup>0</sup> ð Þ *; <sup>θ</sup>=κ*<sup>0</sup> *, <sup>θ</sup>* � *IG v*0*=*2*; <sup>v</sup>*0*σ*<sup>2</sup>

*<sup>θ</sup>*, *<sup>μ</sup>*∣*<sup>θ</sup>* is *<sup>N</sup> <sup>μ</sup>*<sup>0</sup> <sup>ð</sup> *, <sup>θ</sup>=κ*0<sup>Þ</sup> which is a normal distribution, and *<sup>θ</sup>* is *IG v*0*=*2*; <sup>v</sup>*0*σ*<sup>2</sup>

(pp. 69–70) of [22]. However, they did not provide any Bayesian posterior

0

where �∞ <*μ*<sup>0</sup> < ∞, *κ*<sup>0</sup> >0, *v*<sup>0</sup> >0, and *σ*<sup>0</sup> >0 are known hyperparameters, *N*ð Þ *μ; θ* is a normal distribution with an unknown mean *μ* and an unknown variance

is an inverse gamma distribution. More specifically, with a joint conjugate prior

the posterior distribution of *θ* was studied in Example 1.5.1 (p. 20) of [21] and Part I

**Model (c) (hierarchical normal and normal-inverse-gamma model)**. This

<sup>0</sup>*=*<sup>2</sup> � �*,*

� �, which is the normal-inverse-gamma distribution,

from the hierarchical normal and inverse gamma model:

(

*Bayesian Inference on Complicated Data*

from the hierarchical Poisson and gamma model:

(

*π μ*ð Þ� *; <sup>θ</sup> <sup>N</sup>* � *IG <sup>μ</sup>*0*; <sup>κ</sup>*0*; <sup>v</sup>*0*; <sup>σ</sup>*<sup>2</sup>

**92**

(

*Xi*∣*θ* �

#### *3.1.1 One-dimensional case*

The Bayesian estimator under Stein's loss function, *δ<sup>π</sup> <sup>s</sup>* ð Þ *x* , minimizes the posterior expected Stein's loss (PESL) (see [1, 10, 16]), E½ � *Ls*ð Þj *θ; a x* , that is,

$$\delta\_\circ^\pi(\mathfrak{x}) = \arg\min\_{a \in \mathcal{A}} \mathbb{E}[L\_\circ(\theta, a) | \mathfrak{x}],\tag{12}$$

where Af g *a*ð Þ *x* : *a*ð Þ *x* > 0 is the action space, *a* ¼ *a*ð Þ *x* >0 is an action (estimator),

$$L\_s(\theta, a) = \frac{a}{\theta} - \mathbf{1} - \log \frac{a}{\theta} \tag{13}$$

is Stein's loss function, and *θ* > 0 is the unknown parameter of interest. The PESL is easy to obtain (see [10]):

$$PESL(\mathfrak{x}, a|\mathfrak{x}) = \operatorname{E}[L\_t(\theta, a)|\mathfrak{x}] = a\operatorname{E}\left(\frac{1}{\theta}|\mathfrak{x}\right) - \mathbf{1} - \log a + \operatorname{E}(\log \theta|\mathfrak{x}).\tag{14}$$

It is found in [10] that

$$\delta\_s^{\pi}(\mathbf{x}) = \frac{1}{\mathrm{E}(\frac{1}{\theta}|\mathbf{x})} \tag{15}$$

by taking partial derivative of the PESL with respect to *a* and setting it to 0. The PESLs evaluated at the Bayesian estimators are (see [10])

$$\begin{aligned} PESL\_{\mathfrak{s}}(\mathfrak{x}, \mathfrak{x}) &= \operatorname{E}[L\_{\mathfrak{s}}(\theta, a)|\mathfrak{x}]|\_{a = \delta\_{\mathfrak{s}}^{\mathfrak{s}}(\mathfrak{x})^{\mathfrak{s}}} \\ PESL\_{\mathfrak{2}}(\mathfrak{x}, \mathfrak{x}) &= \operatorname{E}[L\_{\mathfrak{s}}(\theta, a)|\mathfrak{x}]|\_{a = \delta\_{\mathfrak{s}}^{\mathfrak{s}}(\mathfrak{x})^{\mathfrak{s}}} \end{aligned} \tag{16}$$

where *δ<sup>π</sup>* <sup>2</sup> ð Þ¼ *x* Eð Þ *θ*j*x* is the Bayesian estimator under the squared error loss function.

For the variance parameter *θ* of the hierarchical normal and inverse gamma model (9), [10] recommends and analytically calculates the Bayesian estimator:

$$\delta\_s^{\pi}(\mathfrak{x}) = \frac{1}{a^\* \beta^\*},\tag{17}$$

where

$$a^\* = a + \frac{n}{2} \text{and} \\ \boldsymbol{\beta}^\* = \left[\frac{\mathbf{1}}{\boldsymbol{\beta}} + \frac{\mathbf{1}}{2} \sum\_{i=1}^n \left(\boldsymbol{\omega}\_i - \boldsymbol{\mu}\right)^2\right]^{-1},\tag{18}$$

with respect to *IG*ð Þ *α; β* prior under Stein's loss function. This estimator minimizes the PESL. [10] also analytically calculates the Bayesian estimator,

$$\delta\_2^{\pi}(\mathfrak{x}) = \mathrm{E}(\theta|\mathfrak{x}) = \frac{1}{(a^\*-1)\beta^\*},\tag{19}$$

than *PESLs*ð Þ *π; x* . Finally, [23] calculates the Bayesian posterior estimators and the PESLs of the monthly simple returns of the Shanghai Stock Exchange (SSE) Composite Index, which also exemplify the theoretical studies of the two inequalities of

*The Bayesian Posterior Estimators under Six Loss Functions for Unrestricted and Restricted…*

For estimating a covariance matrix which is assumed to be positive definite, many researchers exploit the multidimensional Stein's loss function (e.g., see [2, 8, 24–31]). The multidimensional Stein's loss function (see [2]) is originally defined to estimate the *<sup>p</sup>* � *<sup>p</sup>* unknown covariance matrix <sup>Σ</sup> by <sup>Σ</sup>^ with the loss

<sup>Σ</sup>^ � log detΣ�<sup>1</sup>

*<sup>σ</sup>*<sup>2</sup> � log *<sup>a</sup>*

<sup>Σ</sup>^ � *<sup>p</sup>:* (23)

*<sup>p</sup>*ð Þ *x* , minimizes

*<sup>σ</sup>*<sup>2</sup> � <sup>1</sup>*,* (24)

<sup>E</sup> *Lp*ð Þj *<sup>θ</sup>; <sup>a</sup> <sup>x</sup>* � �*,* (25)

*<sup>a</sup>* � <sup>2</sup> (26)

Eð Þ� *θ*j*x* 2*:* (27)

<sup>2</sup> ð Þ *<sup>x</sup> :* (29)

(28)

the Bayesian posterior estimators and the PESLs.

*DOI: http://dx.doi.org/10.5772/intechopen.88587*

*<sup>L</sup>* <sup>Σ</sup>*;* <sup>Σ</sup>^ � � <sup>¼</sup> trΣ�<sup>1</sup>

*Ls σ*<sup>2</sup>

*δπ*

est. The PEPL is easy to obtain (see [11]):

It is found in [11] that

**95**

When *p* ¼ 1, the multidimensional Stein's loss function reduces to

*; <sup>a</sup>* � � <sup>¼</sup> *<sup>a</sup>*

The Bayesian estimator under the power-power loss function, *δπ*

*<sup>p</sup>*ð Þ¼ *<sup>x</sup>* arg min *<sup>a</sup>* <sup>∈</sup> <sup>A</sup>

*Lp*ð Þ¼ *θ; a*

*PEPL*ð Þ¼ *<sup>π</sup>; <sup>a</sup>*j*<sup>x</sup>* <sup>E</sup> *Lp*ð Þj *<sup>θ</sup>; <sup>a</sup> <sup>x</sup>* � � <sup>¼</sup> *<sup>a</sup>*<sup>E</sup> <sup>1</sup>

*δπ <sup>p</sup>*ð Þ¼ *x*

PEPLs evaluated at the Bayesian estimators are (see [11])

which is in the form of (13), the one-dimensional Stein's loss function.

the posterior expected power-power loss (PEPL) (see [11]), E *Lp*ð Þj *<sup>θ</sup>; <sup>a</sup> <sup>x</sup>* � �, that is,

where Af g *a*ð Þ *x* : *a*ð Þ *x* > 0 is the action space, *a* ¼ *a*ð Þ *x* >0 is an action (estimator),

*a θ* þ *θ*

is the power-power loss function, and *θ* >0 is the unknown parameter of inter-

*θ* j*x* � �

> � *<sup>a</sup>*¼*δπ <sup>p</sup>*ð Þ *<sup>x</sup> ,*

� *<sup>a</sup>*¼*δπ*

ffiffiffiffiffiffiffiffiffiffiffiffiffiffi Eð Þ *θ*j*x* E <sup>1</sup> *<sup>θ</sup>* <sup>j</sup>*<sup>x</sup>* � � <sup>s</sup>

by taking partial derivative of the PEPL with respect to *a* and setting it to 0. The

*PEPLp*ð Þ¼ *<sup>π</sup>; <sup>x</sup>* <sup>E</sup> *Lp*ð Þj *<sup>θ</sup>; <sup>a</sup> <sup>x</sup>* � ��

*PEPL*2ð Þ¼ *<sup>π</sup>; <sup>x</sup>* <sup>E</sup> *Lp*ð Þj *<sup>θ</sup>; <sup>a</sup> <sup>x</sup>* � ��

þ 1 *a*

*3.1.2 Multidimensional case*

**3.2 Power-power loss function**

function:

with respect to *IG*ð Þ *α; β* prior under the squared error loss, and the corresponding PESL. [10] notes that

$$\mathbb{E}(\log \theta | \mathbf{x}) = -\log \beta^\* - \psi(a^\*),\tag{20}$$

which is essential for the calculation of

$$\text{PESL}\_{\mathfrak{s}}(\mathfrak{x}, \mathfrak{x}) = \log a^\* - \psi(a^\*) \tag{21}$$

and

$$PESL\_2(\pi, \mathfrak{x}) = \frac{1}{a^\*-1} + \log\left(a^\*-1\right) - \psi(a^\*),\tag{22}$$

depends on the digamma function *ψ*ð Þ� . Finally, the numerical simulations exemplify that *PESLs*ð Þ *π; x* and *PESL*2ð Þ *π; x* depend only on *α* and *n* and do not depend on *μ*, *β*, and *x*; the estimators *δ<sup>π</sup> <sup>s</sup>* ð Þ *x* are unanimously smaller than the estimators *δπ* <sup>2</sup> ð Þ *x* ; and *PESLs*ð Þ *π; x* are unanimously smaller than *PESL*2ð Þ *π; x* .

For the hierarchical Poisson and gamma model (43), [20] first calculates the posterior distribution of *θ*, *π θ*ð Þ j*x* , and the marginal pmf of *x*, *π*ð Þ *x* , in Theorem 1 of their paper. [20] then calculates the Bayesian posterior estimators *δ<sup>π</sup> <sup>s</sup>* ð Þ *x* and *δπ* <sup>2</sup> ð Þ *x* , and the PESLs *PESLs*ð Þ *π; x* and *PESL*2ð Þ *π; x* , and they satisfy two inequalities. After that, the estimators of the hyperparameters of the model (10) by the moment method *α*1ð Þ *n* and *β*1ð Þ *n* are summarized in Theorem 2 of their paper. Moreover, the estimators of the hyperparameters of the model (10) by the maximum likelihood estimator (MLE) method *α*2ð Þ *n* and *β*2ð Þ *n* are summarized in Theorem 3 of their paper. Finally, the empirical Bayesian estimators of the parameter of the model (10) under Stein's loss function by the moment method and the MLE method are summarized in Theorem 4 of their paper. In numerical simulations of [20], they have illustrated the two inequalities of the Bayesian posterior estimators and the PESLs, the moment estimators and the MLEs are consistent estimators of the hyperparameters, and the goodness of fit of the model to the simulated data. The numerical results indicate that the MLEs are better than the moment estimators when estimating the hyperparameters. Finally, [20] exploits the attendance data on 314 high school juniors from two urban high schools to illustrate their theoretical studies.

For the variance parameter *θ* of the normal distribution with a normal-inversegamma prior (11), [23] recommends and analytically calculates the Bayesian posterior estimator, *δ<sup>π</sup> <sup>s</sup>* ð Þ *x* , with respect to a conjugate prior *μ*∣*θ* � *N μ*<sup>0</sup> ð *, θ=κ*0Þ, and *<sup>θ</sup>* � *IG v*0*=*2*; <sup>v</sup>*0*σ*<sup>2</sup> <sup>0</sup>*=*<sup>2</sup> under Stein's loss function which penalizes gross overestimation and gross underestimation equally. This estimator minimizes the PESL. As comparisons, the Bayesian posterior estimator, *δπ* <sup>2</sup> ð Þ¼ *x* Eð Þ *θ*j*x* , with respect to the same conjugate prior under the squared error loss function, and the PESL at *δπ* <sup>2</sup> ð Þ **<sup>x</sup>** , are calculated. The calculations of *<sup>δ</sup><sup>π</sup> <sup>s</sup>* ð Þ *<sup>x</sup>* , *<sup>δ</sup><sup>π</sup>* <sup>2</sup> ð Þ *x* , *PESLs*ð Þ *π; x* , and *PESL*2ð Þ *<sup>π</sup>; <sup>x</sup>* depend only on Eð Þ *<sup>θ</sup>*j*<sup>x</sup>* , E *<sup>θ</sup>*�<sup>1</sup> <sup>j</sup>*<sup>x</sup>* , and E log ð Þ *<sup>θ</sup>*j*<sup>x</sup>* . The numerical simulations exemplify their theoretical studies that the PESLs depend only on *v*<sup>0</sup> and *n*, but do not depend on *μ*0, *κ*0, *σ*0, and especially **x**. The estimators *δ<sup>π</sup>* <sup>2</sup> ð Þ **x** are unanimously larger than the estimators *δ<sup>π</sup> <sup>s</sup>* ð Þ *x* , and *PESL*2ð Þ *π; x* are unanimously larger

*The Bayesian Posterior Estimators under Six Loss Functions for Unrestricted and Restricted… DOI: http://dx.doi.org/10.5772/intechopen.88587*

than *PESLs*ð Þ *π; x* . Finally, [23] calculates the Bayesian posterior estimators and the PESLs of the monthly simple returns of the Shanghai Stock Exchange (SSE) Composite Index, which also exemplify the theoretical studies of the two inequalities of the Bayesian posterior estimators and the PESLs.

#### *3.1.2 Multidimensional case*

*δπ*

corresponding PESL. [10] notes that

*Bayesian Inference on Complicated Data*

and

estimators *δπ*

their theoretical studies.

rior estimator, *δ<sup>π</sup>*

*<sup>θ</sup>* � *IG v*0*=*2*; <sup>v</sup>*0*σ*<sup>2</sup>

PESL at *δπ*

**94**

*δπ*

which is essential for the calculation of

*PESL*2ð Þ¼ *π; x*

depend on *μ*, *β*, and *x*; the estimators *δ<sup>π</sup>*

<sup>2</sup> ð Þ¼ *x* Eð Þ¼ *θ*j*x*

with respect to *IG*ð Þ *α; β* prior under the squared error loss, and the

1 *α*<sup>∗</sup> � 1

of their paper. [20] then calculates the Bayesian posterior estimators *δ<sup>π</sup>*

After that, the estimators of the hyperparameters of the model (10) by the moment method *α*1ð Þ *n* and *β*1ð Þ *n* are summarized in Theorem 2 of their paper. Moreover, the estimators of the hyperparameters of the model (10) by the

The numerical results indicate that the MLEs are better than the moment estimators when estimating the hyperparameters. Finally, [20] exploits the

<sup>0</sup>*=*<sup>2</sup> under Stein's loss function which penalizes gross overestimation and gross underestimation equally. This estimator minimizes the

respect to the same conjugate prior under the squared error loss function, and the

lations exemplify their theoretical studies that the PESLs depend only on *v*<sup>0</sup> and *n*,

PESL. As comparisons, the Bayesian posterior estimator, *δπ*

<sup>2</sup> ð Þ **<sup>x</sup>** , are calculated. The calculations of *<sup>δ</sup><sup>π</sup>*

but do not depend on *μ*0, *κ*0, *σ*0, and especially **x**. The estimators *δ<sup>π</sup>*

*PESL*2ð Þ *<sup>π</sup>; <sup>x</sup>* depend only on Eð Þ *<sup>θ</sup>*j*<sup>x</sup>* , E *<sup>θ</sup>*�<sup>1</sup>

mously larger than the estimators *δ<sup>π</sup>*

depends on the digamma function *ψ*ð Þ� . Finally, the numerical simulations exemplify that *PESLs*ð Þ *π; x* and *PESL*2ð Þ *π; x* depend only on *α* and *n* and do not

<sup>2</sup> ð Þ *x* ; and *PESLs*ð Þ *π; x* are unanimously smaller than *PESL*2ð Þ *π; x* . For the hierarchical Poisson and gamma model (43), [20] first calculates the posterior distribution of *θ*, *π θ*ð Þ j*x* , and the marginal pmf of *x*, *π*ð Þ *x* , in Theorem 1

<sup>2</sup> ð Þ *x* , and the PESLs *PESLs*ð Þ *π; x* and *PESL*2ð Þ *π; x* , and they satisfy two inequalities.

maximum likelihood estimator (MLE) method *α*2ð Þ *n* and *β*2ð Þ *n* are summarized in Theorem 3 of their paper. Finally, the empirical Bayesian estimators of the parameter of the model (10) under Stein's loss function by the moment method and the MLE method are summarized in Theorem 4 of their paper. In numerical simulations of [20], they have illustrated the two inequalities of the Bayesian posterior estimators and the PESLs, the moment estimators and the MLEs are consistent estimators of the hyperparameters, and the goodness of fit of the model to the simulated data.

attendance data on 314 high school juniors from two urban high schools to illustrate

For the variance parameter *θ* of the normal distribution with a normal-inversegamma prior (11), [23] recommends and analytically calculates the Bayesian poste-

*<sup>s</sup>* ð Þ *x* , with respect to a conjugate prior *μ*∣*θ* � *N μ*<sup>0</sup> ð *, θ=κ*0Þ, and

*<sup>s</sup>* ð Þ *<sup>x</sup>* , *<sup>δ</sup><sup>π</sup>*

<sup>j</sup>*<sup>x</sup>* , and E log ð Þ *<sup>θ</sup>*j*<sup>x</sup>* . The numerical simu-

*<sup>s</sup>* ð Þ *x* , and *PESL*2ð Þ *π; x* are unanimously larger

1

E log ð Þ¼� *<sup>θ</sup>*j*<sup>x</sup>* log *<sup>β</sup>* <sup>∗</sup> � *ψ α*<sup>∗</sup> ð Þ*,* (20)

*PESLs*ð Þ¼ *<sup>π</sup>; <sup>x</sup>* log *<sup>α</sup>*<sup>∗</sup> � *ψ α*<sup>∗</sup> ð Þ (21)

<sup>þ</sup> log *<sup>α</sup>*<sup>∗</sup> ð Þ� � <sup>1</sup> *ψ α*<sup>∗</sup> ð Þ*,* (22)

*<sup>s</sup>* ð Þ *x* and

<sup>2</sup> ð Þ¼ *x* Eð Þ *θ*j*x* , with

<sup>2</sup> ð Þ *x* , *PESLs*ð Þ *π; x* , and

<sup>2</sup> ð Þ **x** are unani-

*<sup>s</sup>* ð Þ *x* are unanimously smaller than the

*<sup>α</sup>*ð Þ <sup>∗</sup> � <sup>1</sup> *<sup>β</sup>* <sup>∗</sup> *,* (19)

For estimating a covariance matrix which is assumed to be positive definite, many researchers exploit the multidimensional Stein's loss function (e.g., see [2, 8, 24–31]). The multidimensional Stein's loss function (see [2]) is originally defined to estimate the *<sup>p</sup>* � *<sup>p</sup>* unknown covariance matrix <sup>Σ</sup> by <sup>Σ</sup>^ with the loss function:

$$L\left(\Sigma,\hat{\Sigma}\right) = \text{tr}\Sigma^{-1}\hat{\Sigma} - \log \det \Sigma^{-1}\hat{\Sigma} - p.\tag{23}$$

When *p* ¼ 1, the multidimensional Stein's loss function reduces to

$$L\_s(\sigma^2, a) = \frac{a}{\sigma^2} - \log \frac{a}{\sigma^2} - \mathbb{1},\tag{24}$$

which is in the form of (13), the one-dimensional Stein's loss function.

#### **3.2 Power-power loss function**

The Bayesian estimator under the power-power loss function, *δπ <sup>p</sup>*ð Þ *x* , minimizes the posterior expected power-power loss (PEPL) (see [11]), E *Lp*ð Þj *<sup>θ</sup>; <sup>a</sup> <sup>x</sup>* � �, that is,

$$\delta\_p^\pi(\mathbf{x}) = \arg\min\_{a \in \mathcal{A}} \mathbb{E}\left[L\_p(\theta, a)|\mathbf{x}\right],\tag{25}$$

where Af g *a*ð Þ *x* : *a*ð Þ *x* > 0 is the action space, *a* ¼ *a*ð Þ *x* >0 is an action (estimator),

$$L\_p(\theta, a) = \frac{a}{\theta} + \frac{\theta}{a} - 2 \tag{26}$$

is the power-power loss function, and *θ* >0 is the unknown parameter of interest. The PEPL is easy to obtain (see [11]):

$$PEPL(\pi, a|\mathbf{x}) = \mathbf{E}\left[L\_p(\theta, a)|\mathbf{x}\right] = a\mathbf{E}\left(\frac{\mathbf{1}}{\theta}|\mathbf{x}\right) + \frac{\mathbf{1}}{a}\mathbf{E}(\theta|\mathbf{x}) - \mathbf{2}.\tag{27}$$

It is found in [11] that

$$\delta\_p^{\pi}(\mathfrak{x}) = \sqrt{\frac{\mathrm{E}(\theta|\mathfrak{x})}{\mathrm{E}(\frac{1}{\theta}|\mathfrak{x})}}\tag{28}$$

by taking partial derivative of the PEPL with respect to *a* and setting it to 0. The PEPLs evaluated at the Bayesian estimators are (see [11])

$$\begin{split} \text{PEPL}\_p(\boldsymbol{\pi}, \boldsymbol{\mathfrak{x}}) &= \text{E}\left[L\_p(\boldsymbol{\theta}, \boldsymbol{\mathfrak{a}}) | \boldsymbol{\mathfrak{x}}\right]|\_{\boldsymbol{\mathfrak{a}} = \delta\_p^{\pi}(\boldsymbol{\mathfrak{x}})}, \\ \text{PEPL}\_2(\boldsymbol{\pi}, \boldsymbol{\mathfrak{x}}) &= \text{E}\left[L\_p(\boldsymbol{\theta}, \boldsymbol{\mathfrak{a}}) | \boldsymbol{\mathfrak{x}}\right]|\_{\boldsymbol{\mathfrak{a}} = \delta\_2^{\pi}(\boldsymbol{\mathfrak{x}})}. \end{split} \tag{29}$$

The power-power loss function is proposed in [11], and it has all the seven properties proposed in his paper. More specifically, it penalizes gross overestimation and gross underestimation equally, is convex in its argument, and has balanced convergence rates or penalties for its argument too large and too small. Therefore, it is recommended for the positive restricted parameter space Θ ¼ ð Þ 0*;* ∞ .

### **4. Bayesian estimation for** *θ* **∈** ð Þ **0***;* **1**

There are some hierarchical models where the unknown parameter of interest is *θ* ∈ Θ ¼ ð Þ 0*;* 1 . As pointed out in the introduction, we should calculate and use the Bayesian estimator of the parameter *θ* under the power-log loss function or Zhang's loss function because they penalize gross overestimation and gross underestimation equally. We list two such hierarchical models as follows.

**Model (d) (beta-binomial model)**. This hierarchical model has been investigated by [1, 12, 13, 16, 32, 33]. Suppose that *X*1*, X*2*,* …*, Xn* are from the beta-binomial model:

$$\begin{cases} X\_i | \theta \overset{\text{iid}}{\sim} \text{Bin}(m, \theta), & i = 1, 2, \dots, n, \\ \theta \sim \text{Be}(a, \beta), \end{cases} \tag{30}$$

In **Table 1**, property (a) means that any action *a* of the parameter *θ* should incur a nonnegative loss. Property (b) means that when *x* ¼ *a=θ* ¼ 1, or *a* ¼ *θ*, that is, *a* correctly estimates *θ*, the loss is 0. Property (c) means that when

*The Bayesian Posterior Estimators under Six Loss Functions for Unrestricted and Restricted…*

*x* ¼ *a=θ* ! ð Þ 1*=θ* �, that is, *a* is moving away from *θ* and tends to 1�, it will incur an infinite loss. Property (d) means that when *x* ¼ *a=θ* ! 0þ, that is, *a* is moving away from *θ* and tends to 0þ, it will also incur an infinite loss. Properties (c) and (d) mean that the loss function will penalize gross overestimation and gross underestimation equally. Property (e) is useful in the proofs of some propositions of the minimaxity and the admissibility of the Bayesian estimator (see [1]). Property (f) means that 1 and *θ* are the local extrema of *L x*ð Þ and *L a*ð Þ j*θ* , respectively. Property (f) also implies that *L*ð Þ¼ *θ* þ *Δa*j*θ o*ð Þ *Δa* , that is, the loss incurred by an action

*<sup>θ</sup>* � *<sup>x</sup>* � log *<sup>x</sup>* and *gpl*ð Þ¼ <sup>1</sup>

1 *<sup>θ</sup>* � <sup>1</sup> <sup>2</sup> 1

> 1 *<sup>θ</sup>* � <sup>1</sup> <sup>2</sup> 1 *θ* � *a θ*

> > *<sup>θ</sup>* � <sup>1</sup> *:*

1

*<sup>θ</sup>* � <sup>1</sup>

*<sup>θ</sup>* � <sup>1</sup>

*<sup>θ</sup> L a*ð Þ j*θ* ≥*0* for all *0*< *a*<*1*

*<sup>θ</sup>* Convex in *a* for all *0*< *a*<*1*

*<sup>∂</sup><sup>a</sup> L a*ð Þ <sup>j</sup>*<sup>θ</sup> <sup>a</sup>*¼*<sup>θ</sup>* <sup>¼</sup> *<sup>0</sup>*

� *L x*ð Þ¼ <sup>∞</sup> *L 1*� ð Þ¼ <sup>j</sup>*<sup>θ</sup>* lim*<sup>a</sup>*!*1*� *L a*ð Þ¼ <sup>j</sup>*<sup>θ</sup>* <sup>∞</sup>

<sup>¼</sup> *<sup>0</sup> <sup>∂</sup>*

*<sup>θ</sup>* � <sup>1</sup> 

*<sup>θ</sup>* � *<sup>x</sup>* � log *<sup>x</sup>* � <sup>1</sup>

� log *<sup>a</sup>*

*<sup>θ</sup>* � <sup>1</sup>*:* (32)

*:* (33)

(34)

*a* ¼ *θ* þ *Δa* near *θ* (*Δa*≈ 0), is very small compared to *Δa*.

*Lpl*ð Þ¼ *x gpl*ð Þ� *x gpl*ð Þ¼ 1

 *<sup>x</sup>*¼*a=<sup>θ</sup>* ¼

<sup>1</sup> � *<sup>a</sup>* � log *<sup>a</sup>* <sup>þ</sup> log *<sup>θ</sup>* � <sup>1</sup>

loss function for Θ ¼ ð Þ 0*;* 1 , and thus it is recommended for Θ ¼ ð Þ 0*;* 1 .

**Properties** *L x*ð Þ *L a*ð j*θ*Þ

*(Table 1 in [12]) The six properties of a good loss function for Θ* ¼ ð Þ *0; 1 . 0*<*θ* <*1 is fixed.*

(b) *L 1*ð Þ¼ *0 L*ð Þ¼ *<sup>θ</sup>*j*<sup>θ</sup> L a*ð Þj <sup>j</sup>*<sup>θ</sup> <sup>a</sup>*¼*<sup>θ</sup>* <sup>¼</sup> *<sup>0</sup>*

(d) *L 0*<sup>þ</sup> ð Þ¼ lim*<sup>x</sup>*!*0*<sup>þ</sup> *L x*ð Þ¼ ∞ *L 0*<sup>þ</sup> ð Þ¼ j*θ* lim*<sup>a</sup>*!*0*<sup>þ</sup> *L a*ð Þ¼ j*θ* ∞

It is easy to check (see the supplement of [12]) that *Lpl*ð Þ¼ *θ; a Lpl*ð Þ¼ *a*j*θ*

*<sup>x</sup>*¼*a=<sup>θ</sup>*, which is called the power-log loss function, satisfies all the six properties listed in **Table 1**. Consequently, the power-log loss function is a good

We remark that the power-log loss function on Θ ¼ ð Þ 0*;* 1 is an analog of the power-log loss function on Θ ¼ ð Þ 0*;* ∞ , which is the popular Stein's loss function.

1 *<sup>θ</sup>* � <sup>1</sup> <sup>2</sup> 1

*gpl*ð Þ¼ *x*

*DOI: http://dx.doi.org/10.5772/intechopen.88587*

*Lpl*ð Þ¼ *<sup>θ</sup>; <sup>a</sup> Lpl*ð Þ¼ *<sup>a</sup>*j*<sup>θ</sup> Lpl*ð Þ *<sup>x</sup>*

*<sup>θ</sup>* � <sup>1</sup> <sup>2</sup>

(a) *L x*ð Þ≥*<sup>0</sup>* for all *<sup>0</sup>*<sup>&</sup>lt; *<sup>x</sup>*<sup>&</sup>lt; *<sup>1</sup>*

*θ*

0 ð Þ¼ *<sup>1</sup> dL x*ð Þ *dx x*¼*1*

(e) Convex in *x* for all *0*<*x*< *<sup>1</sup>*

� <sup>¼</sup> lim*<sup>x</sup>*! *<sup>1</sup>* ð Þ*<sup>θ</sup>*

(c) *L <sup>1</sup>*

(f) *L*

<sup>¼</sup> *<sup>θ</sup>* <sup>1</sup>

Let

Define

Thus

*Lpl*ð Þ *<sup>x</sup>* 

**Table 1.**

**97**

where *α* >0 and *β* > 0 are known constants, *m* is a known positive integer, *θ* ∈ð Þ 0*;* 1 is the unknown parameter of interest, *Be*ð Þ *α; β* is the beta distribution, and *Bin m*ð Þ *; θ* is the binomial distribution. Specifically, Bayesian estimation of *θ* under the prior *Be*ð Þ *α; β* is studied in Example 7.2.14 (p. 324) of [16] and in Tables 3.3.1 (p. 121) and 4.2.1 (p. 176) of [1]. However, they only calculate the Bayesian estimator with respect to *Be*ð Þ *<sup>α</sup>; <sup>β</sup>* prior under the squared error loss, *δπ* <sup>2</sup> ð Þ¼ *x* Eð Þ *θ*j*x* . Moreover, they only consider one observation. The beta-binomial model has been investigated recently. For instance, [32] uses the beta-binomial to draw the random removals in progressive censoring; [12, 13] use the beta-binomial to model some magazine exposure data for the monthly magazine *Signature*; [33] develops estimation procedure for the parameters of a zero-inflated overdispersed binomial model in the presence of missing responses.

**Model (e) (beta-negative binomial model)**. This hierarchical model has been investigated by [1, 34]. Suppose that *X*1*, X*2*,* …*, Xn* are from the beta-negative binomial model:

$$\begin{cases} X\_i | \theta \overset{\text{iid}}{\sim} \text{NB}(m, \theta), & i = 1, 2, \dots, n, \\ \theta \sim \text{Be}(a, \beta), \end{cases} \tag{31}$$

where *α* >0 and *β* > 0 are known constants, *m* is a known positive integer, *θ* ∈ð Þ 0*;* 1 is the unknown parameter of interest, *Be*ð Þ *α; β* is the beta distribution, and *NB m*ð Þ *; θ* is the negative binomial distribution. Specifically, Bayesian estimation of *θ* under the prior *Be*ð Þ *α; β* is studied in Tables 3.3.1 (p. 121) and 4.2.1 (p. 176) of [1]. However, he only calculates the Bayesian estimator with respect to *Be*ð Þ *α; β* prior under the squared error loss function, *δ<sup>π</sup>* <sup>2</sup> ð Þ¼ *x* Eð Þ *θ*j*x* . Moreover, he only considers one observation.

#### **4.1 Power-log loss function**

A good loss function *<sup>L</sup>*ð Þ¼ *<sup>θ</sup>; <sup>a</sup> L a*ð Þ¼ <sup>j</sup>*<sup>θ</sup> L x*ð Þj*<sup>x</sup>*¼*a=<sup>θ</sup>* for <sup>Θ</sup> <sup>¼</sup> ð Þ <sup>0</sup>*;* <sup>1</sup> should have the six properties summarized in **Table 1** (see **Table 1** in [12]).

*The Bayesian Posterior Estimators under Six Loss Functions for Unrestricted and Restricted… DOI: http://dx.doi.org/10.5772/intechopen.88587*

In **Table 1**, property (a) means that any action *a* of the parameter *θ* should incur a nonnegative loss. Property (b) means that when *x* ¼ *a=θ* ¼ 1, or *a* ¼ *θ*, that is, *a* correctly estimates *θ*, the loss is 0. Property (c) means that when *x* ¼ *a=θ* ! ð Þ 1*=θ* �, that is, *a* is moving away from *θ* and tends to 1�, it will incur an infinite loss. Property (d) means that when *x* ¼ *a=θ* ! 0þ, that is, *a* is moving away from *θ* and tends to 0þ, it will also incur an infinite loss. Properties (c) and (d) mean that the loss function will penalize gross overestimation and gross underestimation equally. Property (e) is useful in the proofs of some propositions of the minimaxity and the admissibility of the Bayesian estimator (see [1]). Property (f) means that 1 and *θ* are the local extrema of *L x*ð Þ and *L a*ð Þ j*θ* , respectively. Property (f) also implies that *L*ð Þ¼ *θ* þ *Δa*j*θ o*ð Þ *Δa* , that is, the loss incurred by an action *a* ¼ *θ* þ *Δa* near *θ* (*Δa*≈ 0), is very small compared to *Δa*.

Let

The power-power loss function is proposed in [11], and it has all the seven properties proposed in his paper. More specifically, it penalizes gross overestimation and gross underestimation equally, is convex in its argument, and has balanced convergence rates or penalties for its argument too large and too small. Therefore, it is

There are some hierarchical models where the unknown parameter of interest is *θ* ∈ Θ ¼ ð Þ 0*;* 1 . As pointed out in the introduction, we should calculate and use the Bayesian estimator of the parameter *θ* under the power-log loss function or Zhang's loss function because they penalize gross overestimation and gross underestimation

**Model (d) (beta-binomial model)**. This hierarchical model has been investigated by [1, 12, 13, 16, 32, 33]. Suppose that *X*1*, X*2*,* …*, Xn* are from the beta-binomial

where *α* >0 and *β* > 0 are known constants, *m* is a known positive integer, *θ* ∈ð Þ 0*;* 1 is the unknown parameter of interest, *Be*ð Þ *α; β* is the beta distribution, and *Bin m*ð Þ *; θ* is the binomial distribution. Specifically, Bayesian estimation of *θ* under the prior *Be*ð Þ *α; β* is studied in Example 7.2.14 (p. 324) of [16] and in Tables 3.3.1 (p. 121) and 4.2.1 (p. 176) of [1]. However, they only calculate the Bayesian esti-

Moreover, they only consider one observation. The beta-binomial model has been investigated recently. For instance, [32] uses the beta-binomial to draw the random removals in progressive censoring; [12, 13] use the beta-binomial to model some magazine exposure data for the monthly magazine *Signature*; [33] develops estimation procedure for the parameters of a zero-inflated overdispersed binomial model

**Model (e) (beta-negative binomial model)**. This hierarchical model has been investigated by [1, 34]. Suppose that *X*1*, X*2*,* …*, Xn* are from the beta-negative bino-

iid *NB m*ð Þ *; <sup>θ</sup> , i* <sup>¼</sup> <sup>1</sup>*,* <sup>2</sup>*,* …*, n,*

<sup>2</sup> ð Þ¼ *x* Eð Þ *θ*j*x* . Moreover, he only considers

where *α* >0 and *β* > 0 are known constants, *m* is a known positive integer, *θ* ∈ð Þ 0*;* 1 is the unknown parameter of interest, *Be*ð Þ *α; β* is the beta distribution, and *NB m*ð Þ *; θ* is the negative binomial distribution. Specifically, Bayesian estimation of *θ* under the prior *Be*ð Þ *α; β* is studied in Tables 3.3.1 (p. 121) and 4.2.1 (p. 176) of [1]. However, he only calculates the Bayesian estimator with respect to *Be*ð Þ *α; β* prior

A good loss function *<sup>L</sup>*ð Þ¼ *<sup>θ</sup>; <sup>a</sup> L a*ð Þ¼ <sup>j</sup>*<sup>θ</sup> L x*ð Þj*<sup>x</sup>*¼*a=<sup>θ</sup>* for <sup>Θ</sup> <sup>¼</sup> ð Þ <sup>0</sup>*;* <sup>1</sup> should have the

iid *Bin m*ð Þ *; <sup>θ</sup> , i* <sup>¼</sup> <sup>1</sup>*,* <sup>2</sup>*,* …*, n,*

(30)

(31)

<sup>2</sup> ð Þ¼ *x* Eð Þ *θ*j*x* .

recommended for the positive restricted parameter space Θ ¼ ð Þ 0*;* ∞ .

**4. Bayesian estimation for** *θ* **∈** ð Þ **0***;* **1**

*Bayesian Inference on Complicated Data*

model:

mial model:

one observation.

**96**

**4.1 Power-log loss function**

equally. We list two such hierarchical models as follows.

*Xi*∣*θ* �

(

in the presence of missing responses.

under the squared error loss function, *δ<sup>π</sup>*

*θ* � *Be*ð Þ *α; β ,*

mator with respect to *Be*ð Þ *<sup>α</sup>; <sup>β</sup>* prior under the squared error loss, *δπ*

*Xi*∣*θ* �

six properties summarized in **Table 1** (see **Table 1** in [12]).

(

*θ* � *Be*ð Þ *α; β ,*

$$\mathcal{g}\_{pl}(\mathbf{x}) = \frac{\left(\frac{1}{\theta} - 1\right)^2}{\frac{1}{\theta} - \mathbf{x}} - \log \mathbf{x} \text{ and } \mathcal{g}\_{pl}(\mathbf{1}) = \frac{1}{\theta} - \mathbf{1}. \tag{32}$$

Define

$$L\_{pl}(\mathbf{x}) = \mathbf{g}\_{pl}(\mathbf{x}) - \mathbf{g}\_{pl}(\mathbf{1}) = \frac{\left(\frac{1}{\theta} - \mathbf{1}\right)^2}{\frac{1}{\theta} - \mathbf{x}} - \log \mathbf{x} - \left(\frac{\mathbf{1}}{\theta} - \mathbf{1}\right). \tag{33}$$

Thus

$$\begin{split} \mathcal{L}\_{pl}(\theta, a) = \mathcal{L}\_{pl}(a|\theta) &= \mathcal{L}\_{pl}(\mathbf{x})|\_{\mathbf{x} = a/\theta} = \frac{\left(\frac{1}{\theta} - \mathbf{1}\right)^2}{\frac{1}{\theta} - \frac{a}{\theta}} - \log\frac{a}{\theta} - \left(\frac{\mathbf{1}}{\theta} - \mathbf{1}\right) \\ &= \frac{\theta\left(\frac{1}{\theta} - \mathbf{1}\right)^2}{\mathbf{1} - a} - \log a + \log\theta - \left(\frac{\mathbf{1}}{\theta} - \mathbf{1}\right). \end{split} \tag{34}$$

It is easy to check (see the supplement of [12]) that *Lpl*ð Þ¼ *θ; a Lpl*ð Þ¼ *a*j*θ Lpl*ð Þ *<sup>x</sup> <sup>x</sup>*¼*a=<sup>θ</sup>*, which is called the power-log loss function, satisfies all the six properties listed in **Table 1**. Consequently, the power-log loss function is a good loss function for Θ ¼ ð Þ 0*;* 1 , and thus it is recommended for Θ ¼ ð Þ 0*;* 1 .

We remark that the power-log loss function on Θ ¼ ð Þ 0*;* 1 is an analog of the power-log loss function on Θ ¼ ð Þ 0*;* ∞ , which is the popular Stein's loss function.


**Table 1.** *(Table 1 in [12]) The six properties of a good loss function for Θ* ¼ ð Þ *0; 1 . 0*<*θ* <*1 is fixed.*

The Bayesian estimator under the power-log loss function, *δπ pl*ð Þ *x* , minimizes the posterior expected power-log loss (PEPLL) (see [12]), E *Lpl*ð Þj *<sup>θ</sup>; <sup>a</sup> <sup>x</sup>* � �, that is,

$$\delta\_{pl}^{\pi}(\mathfrak{x}) = \arg\min\_{a \in \mathcal{A}} \mathbb{E}\left[L\_{pl}(\theta, a)|\mathfrak{x}\right],\tag{35}$$

And they say that *L k*1ð Þ*<sup>θ</sup>* <sup>1</sup>

*n*

*<sup>n</sup>* <sup>j</sup>*<sup>θ</sup>* � � and *<sup>L</sup>* <sup>1</sup> � <sup>1</sup>

*DOI: http://dx.doi.org/10.5772/intechopen.88587*

*gz*ð Þ¼ *x*

large and too small. It is worthy to note that *<sup>k</sup>*1ð Þ*<sup>θ</sup>* <sup>1</sup>

� �. Analogously, *<sup>k</sup>*2ð Þ*<sup>θ</sup>* <sup>1</sup>

*Lz*ð Þ¼ *x gz*ð Þ� *x gz*ð Þ¼ 1

*Lz*ð Þ¼ *<sup>θ</sup>; <sup>a</sup> Lz*ð Þ¼ *<sup>a</sup>*j*<sup>θ</sup> Lz*ð Þj *<sup>x</sup> <sup>x</sup>*¼*a=<sup>θ</sup>* <sup>¼</sup> <sup>1</sup>

and thus it is recommended for Θ ¼ ð Þ 0*;* 1 .

The Bayesian estimator under Zhang's loss function, *δπ*

*δπ*

*PEZL*ð Þ¼ *π; a*j*x* E½ �¼ *Lz*ð Þj *θ; a x*

rior expected Zhang's loss (PEZL) (see [13]), E½ � *Lz*ð Þj *θ; a x* , that is,

*<sup>z</sup>* ð Þ¼ *<sup>x</sup>* arg min *<sup>a</sup>*<sup>∈</sup> <sup>A</sup>

unknown parameter of interest. The PEZL is easy to obtain (see [13]):

*<sup>E</sup>*1ð Þ¼ *<sup>x</sup>* <sup>E</sup> *<sup>θ</sup>*<sup>3</sup>

*E*2ð Þ¼ *x* Eð Þ *θ*j*x ,*

*<sup>E</sup>*3ð Þ¼ *<sup>x</sup>* <sup>E</sup> *<sup>θ</sup>*

Similarly, *L k*2ð Þ*<sup>θ</sup>* <sup>1</sup>

same order *O* <sup>1</sup>

Let

Let

Thus

where

**99**

*n* � � and *L* <sup>1</sup>

> 1 1 *<sup>θ</sup>* � <sup>1</sup> � �<sup>2</sup> *x* þ 1 1 *<sup>θ</sup>* � *x*

*<sup>θ</sup>* <sup>1</sup> � <sup>1</sup> *n*

*The Bayesian Posterior Estimators under Six Loss Functions for Unrestricted and Restricted…*

also say that *L x*ð Þ (*L a*ð Þ j*θ* ) has balanced convergence rates or penalties for *x* (*a*) too

Finally, only when properties (c) and (d) hold, property (g) may hold.

*<sup>n</sup>* ! 0 and 1 � <sup>1</sup>

1 1 *<sup>θ</sup>* � <sup>1</sup> � �<sup>2</sup> *x* þ 1 1

> 1 *<sup>θ</sup>* � <sup>1</sup> � �<sup>2</sup> *<sup>a</sup> θ* þ 1 1 *θ* � *a θ*

<sup>¼</sup> *<sup>θ</sup>* 1 *<sup>θ</sup>* � <sup>1</sup> � �<sup>2</sup> *a* þ *θ* <sup>1</sup> � *<sup>a</sup>* � <sup>1</sup> *θ* <sup>1</sup> *<sup>θ</sup>* � <sup>1</sup> � �<sup>2</sup> *:*

where Af g *a*ð Þ *x* : *a*ð Þ *x* ∈ ð Þ 0*;* 1 is the action space, *a* ¼ *a*ð Þ *x* ∈ ð Þ 0*;* 1 is an action (estimator), *Lz*ð Þ *θ; a* given by (43) is Zhang's loss function, and *θ* ∈ð Þ 0*;* 1 is the

> *E*1ð Þ *x a* þ *E*2ð Þ *x*

ð Þ <sup>1</sup> � *<sup>θ</sup>* <sup>2</sup> <sup>j</sup>*<sup>x</sup>* " #

ð Þ <sup>1</sup> � *<sup>θ</sup>* <sup>2</sup> <sup>j</sup>*<sup>x</sup>* " #

*,*

*:*

It is easy to check (see the supplement of [13]) that *Lz*ð Þ¼ *θ; a Lz*ð Þ¼ *a*j*θ Lz*ð Þj *<sup>x</sup> <sup>x</sup>*¼*a=<sup>θ</sup>*, which is called Zhang's loss function, satisfies all the seven properties listed in **Table 1** of [13]. Consequently, Zhang's loss function is a good loss function,

and *gz*ð Þ¼ 1

� � � � are asymptotically equivalent.

1 *θ* <sup>1</sup>

*<sup>θ</sup>* � *<sup>x</sup>* � <sup>1</sup> *θ* <sup>1</sup>

> � <sup>1</sup> *θ* <sup>1</sup> *<sup>θ</sup>* � <sup>1</sup> � �<sup>2</sup>

E½ � *Lz*ð Þj *θ; a x ,* (44)

*<sup>θ</sup>* <sup>1</sup> � <sup>1</sup> *n* � � ! <sup>1</sup>

*<sup>n</sup>* ! 1 at the same order *<sup>O</sup>* <sup>1</sup>

*<sup>θ</sup>* � <sup>1</sup> � �<sup>2</sup> *:* (41)

*<sup>θ</sup>* � <sup>1</sup> � �<sup>2</sup> *:* (42)

*<sup>z</sup>* ð Þ *x* , minimizes the poste-

<sup>1</sup> � *<sup>a</sup>* � *<sup>E</sup>*3ð Þ *<sup>x</sup> ,* (45)

*<sup>θ</sup>* at the

*n* � �.

(43)

(46)

*<sup>n</sup>* <sup>j</sup>*<sup>θ</sup>* � � are said to be asymptotically equivalent. They

*<sup>n</sup>* ! 0 and <sup>1</sup>

where Af g *a*ð Þ *x* : *a*ð Þ *x* ∈ ð Þ 0*;* 1 is the action space, *a* ¼ *a*ð Þ *x* ∈ ð Þ 0*;* 1 is an action (estimator), *Lpl*ð Þ *θ; a* given by (34) is the power-log loss function, and *θ* ∈ ð Þ 0*;* 1 is the unknown parameter of interest. The PEPLL is easy to obtain (see [12]):

$$PEPLL(\pi, a|\mathbf{x}) = \mathbb{E}\left[L\_{pl}(\theta, a)|\mathbf{x}\right] = \frac{E\_1(\mathbf{x})}{\mathbf{1} - a} - \log a + E\_2(\mathbf{x}) - E\_3(\mathbf{x}) + \mathbf{1},\tag{36}$$

where

$$E\_1(\mathbf{x}) = \mathbb{E}\left[\theta^{-1}(1-\theta)^2|\mathbf{x}\right] > 0,$$

$$E\_2(\mathbf{x}) = \mathbb{E}[\log \theta | \mathbf{x}] < 0,\tag{37}$$

$$E\_3(\mathbf{x}) = \mathbb{E}\left[\theta^{-1}|\mathbf{x}\right] > 0.$$

It is found in [12] that

$$\delta\_{pl}^{\pi}(\mathfrak{x}) = \frac{2 + E\_1(\mathfrak{x}) - \sqrt{E\_1(\mathfrak{x})(E\_1(\mathfrak{x}) + 4)}}{2} \tag{38}$$

by taking partial derivative of the PEPLL with respect to *a* and setting it to 0. The PEPLLs evaluated at the Bayesian estimators are (see [12])

$$\begin{split} \text{PEPL}\_{pl}(\boldsymbol{\pi}, \boldsymbol{\mathfrak{x}}) &= \operatorname{E} \Big[ L\_{pl}(\boldsymbol{\theta}, a) |\boldsymbol{\mathfrak{x}} \Big] \big|\_{\boldsymbol{\mathfrak{a}} = \delta\_{pl}^{\boldsymbol{\mathfrak{x}}}(\mathbf{x})} \\ \text{PEPL}\_{2}(\boldsymbol{\mathfrak{x}}, \boldsymbol{\mathfrak{x}}) &= \operatorname{E} \Big[ L\_{pl}(\boldsymbol{\theta}, a) |\boldsymbol{\mathfrak{x}} \Big] \big|\_{\boldsymbol{\mathfrak{a}} = \delta\_{2}^{\boldsymbol{\mathfrak{x}}}(\mathbf{x})} . \end{split} \tag{39}$$

Finally, the numerical simulations and a real data example of some monthly magazine exposure data (see [35]) exemplify the theoretical studies of two size relationships about the Bayesian estimators and the PEPLLs in [12].

#### **4.2 Zhang's loss function**

Zhang et al. [12] proposed six properties for a good loss function *<sup>L</sup>*ð Þ¼ *<sup>θ</sup>; <sup>a</sup> L a*ð Þ¼ <sup>j</sup>*<sup>θ</sup> L x*ð Þj*<sup>x</sup>*¼*a=<sup>θ</sup>* on <sup>Θ</sup> <sup>¼</sup> ð Þ <sup>0</sup>*;* <sup>1</sup> . Apart from the six properties, [13] proposes the seventh property (balanced convergence rates or penalties for the argument too large and too small) for a good loss function on Θ ¼ ð Þ 0*;* 1 . Moreover, the seven properties for a good loss function on Θ ¼ ð Þ 0*;* 1 are summarized in **Table 1** of [13]. The explanations of the first six properties in **Table 1** of [13] can be found in the previous subsection (see also [12]). In **Table 1** of [13], property (g) (the seventh property) means that *L k*1ð Þ*<sup>θ</sup>* <sup>1</sup> *n* � � and *L* <sup>1</sup> *<sup>θ</sup>* <sup>1</sup> � <sup>1</sup> *n* � � � � tend to ∞ at the same rate and *L k*2ð Þ*<sup>θ</sup>* <sup>1</sup> *<sup>n</sup>* <sup>j</sup>*<sup>θ</sup>* � � and *<sup>L</sup>* <sup>1</sup> � <sup>1</sup> *<sup>n</sup>* <sup>j</sup>*<sup>θ</sup>* � � tend to <sup>∞</sup> at the same rate. In other words,

$$\lim\_{n \to \infty} \frac{L\left(k\_1(\theta)\frac{1}{n}\right)}{L\left(\frac{1}{\theta}\left(1-\frac{1}{n}\right)\right)} = \mathbf{1} \text{ and } \lim\_{n \to \infty} \frac{L\left(k\_2(\theta)\frac{1}{n}|\theta\right)}{L\left(1-\frac{1}{n}|\theta\right)} = \mathbf{1}.\tag{40}$$

*The Bayesian Posterior Estimators under Six Loss Functions for Unrestricted and Restricted… DOI: http://dx.doi.org/10.5772/intechopen.88587*

And they say that *L k*1ð Þ*<sup>θ</sup>* <sup>1</sup> *n* � � and *L* <sup>1</sup> *<sup>θ</sup>* <sup>1</sup> � <sup>1</sup> *n* � � � � are asymptotically equivalent. Similarly, *L k*2ð Þ*<sup>θ</sup>* <sup>1</sup> *<sup>n</sup>* <sup>j</sup>*<sup>θ</sup>* � � and *<sup>L</sup>* <sup>1</sup> � <sup>1</sup> *<sup>n</sup>* <sup>j</sup>*<sup>θ</sup>* � � are said to be asymptotically equivalent. They also say that *L x*ð Þ (*L a*ð Þ j*θ* ) has balanced convergence rates or penalties for *x* (*a*) too large and too small. It is worthy to note that *<sup>k</sup>*1ð Þ*<sup>θ</sup>* <sup>1</sup> *<sup>n</sup>* ! 0 and <sup>1</sup> *<sup>θ</sup>* <sup>1</sup> � <sup>1</sup> *n* � � ! <sup>1</sup> *<sup>θ</sup>* at the same order *O* <sup>1</sup> *n* � �. Analogously, *<sup>k</sup>*2ð Þ*<sup>θ</sup>* <sup>1</sup> *<sup>n</sup>* ! 0 and 1 � <sup>1</sup> *<sup>n</sup>* ! 1 at the same order *<sup>O</sup>* <sup>1</sup> *n* � �. Finally, only when properties (c) and (d) hold, property (g) may hold.

Let

The Bayesian estimator under the power-log loss function, *δπ*

*pl*ð Þ¼ *<sup>x</sup>* arg min *<sup>a</sup>* <sup>∈</sup> <sup>A</sup>

*<sup>E</sup>*1ð Þ¼ *<sup>x</sup>* <sup>E</sup> *<sup>θ</sup>*�<sup>1</sup>

*<sup>E</sup>*3ð Þ¼ *<sup>x</sup>* <sup>E</sup> *<sup>θ</sup>*�<sup>1</sup>

The PEPLLs evaluated at the Bayesian estimators are (see [12])

relationships about the Bayesian estimators and the PEPLLs in [12].

Zhang et al. [12] proposed six properties for a good loss function

*n* � � and *L* <sup>1</sup>

*L k*1ð Þ*<sup>θ</sup>* <sup>1</sup> *n* � �

*E*2ð Þ¼ *x* E log ½ � *θ*j*x* <0*,*

*δπ*

*Bayesian Inference on Complicated Data*

*PEPLL*ð Þ¼ *<sup>π</sup>; <sup>a</sup>*j*<sup>x</sup>* <sup>E</sup> *Lpl*ð Þj *<sup>θ</sup>; <sup>a</sup> <sup>x</sup>* � � <sup>¼</sup> *<sup>E</sup>*1ð Þ *<sup>x</sup>*

*δπ pl*ð Þ¼ *x*

where

It is found in [12] that

**4.2 Zhang's loss function**

property) means that *L k*1ð Þ*<sup>θ</sup>* <sup>1</sup>

lim*n*!∞

*L* <sup>1</sup> *<sup>θ</sup>* <sup>1</sup> � <sup>1</sup> *n* � � � � <sup>¼</sup> 1 and lim*<sup>n</sup>*!<sup>∞</sup>

*<sup>n</sup>* <sup>j</sup>*<sup>θ</sup>* � � and *<sup>L</sup>* <sup>1</sup> � <sup>1</sup>

*L k*2ð Þ*<sup>θ</sup>* <sup>1</sup>

**98**

posterior expected power-log loss (PEPLL) (see [12]), E *Lpl*ð Þj *<sup>θ</sup>; <sup>a</sup> <sup>x</sup>* � �, that is,

where Af g *a*ð Þ *x* : *a*ð Þ *x* ∈ ð Þ 0*;* 1 is the action space, *a* ¼ *a*ð Þ *x* ∈ ð Þ 0*;* 1 is an action (estimator), *Lpl*ð Þ *θ; a* given by (34) is the power-log loss function, and *θ* ∈ ð Þ 0*;* 1 is the unknown parameter of interest. The PEPLL is easy to obtain (see [12]):

ð Þ <sup>1</sup> � *<sup>θ</sup>* <sup>2</sup>

<sup>2</sup> <sup>þ</sup> *<sup>E</sup>*1ð Þ� *<sup>x</sup>* ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

*<sup>E</sup>*1ð Þ *<sup>x</sup>* ð Þ *<sup>E</sup>*1ð Þþ *<sup>x</sup>* <sup>4</sup> <sup>p</sup>

� *<sup>a</sup>*¼*δπ pl*ð Þ *<sup>x</sup> ,*

> � *<sup>a</sup>*¼*δ<sup>π</sup>* <sup>2</sup> ð Þ *<sup>x</sup> :*

� � � � tend to ∞ at the same rate and

*<sup>n</sup>* <sup>j</sup>*<sup>θ</sup>* � � <sup>¼</sup> <sup>1</sup>*:* (40)

*L k*2ð Þ*<sup>θ</sup>* <sup>1</sup> *<sup>n</sup>* <sup>j</sup>*<sup>θ</sup>* � � *<sup>L</sup>* <sup>1</sup> � <sup>1</sup>

h i

<sup>j</sup>*<sup>x</sup>* � �<sup>&</sup>gt; <sup>0</sup>*:*

by taking partial derivative of the PEPLL with respect to *a* and setting it to 0.

*PEPLLpl*ð Þ¼ *<sup>π</sup>; <sup>x</sup>* <sup>E</sup> *Lpl*ð Þj *<sup>θ</sup>; <sup>a</sup> <sup>x</sup>* � ��

*PEPLL*2ð Þ¼ *<sup>π</sup>; <sup>x</sup>* <sup>E</sup> *Lpl*ð Þj *<sup>θ</sup>; <sup>a</sup> <sup>x</sup>* � ��

Finally, the numerical simulations and a real data example of some monthly magazine exposure data (see [35]) exemplify the theoretical studies of two size

*<sup>L</sup>*ð Þ¼ *<sup>θ</sup>; <sup>a</sup> L a*ð Þ¼ <sup>j</sup>*<sup>θ</sup> L x*ð Þj*<sup>x</sup>*¼*a=<sup>θ</sup>* on <sup>Θ</sup> <sup>¼</sup> ð Þ <sup>0</sup>*;* <sup>1</sup> . Apart from the six properties, [13] proposes the seventh property (balanced convergence rates or penalties for the argument too large and too small) for a good loss function on Θ ¼ ð Þ 0*;* 1 . Moreover, the seven properties for a good loss function on Θ ¼ ð Þ 0*;* 1 are summarized in **Table 1** of [13]. The explanations of the first six properties in **Table 1** of [13] can be found in the previous subsection (see also [12]). In **Table 1** of [13], property (g) (the seventh

> *<sup>θ</sup>* <sup>1</sup> � <sup>1</sup> *n*

*<sup>n</sup>* <sup>j</sup>*<sup>θ</sup>* � � tend to <sup>∞</sup> at the same rate. In other words,

j*x*

>0*,*

*pl*ð Þ *x* , minimizes the

(37)

(39)

<sup>E</sup> *Lpl*ð Þj *<sup>θ</sup>; <sup>a</sup> <sup>x</sup>* � �*,* (35)

<sup>1</sup> � *<sup>a</sup>* � log *<sup>a</sup>* <sup>þ</sup> *<sup>E</sup>*2ð Þ� *<sup>x</sup> <sup>E</sup>*3ð Þþ *<sup>x</sup>* <sup>1</sup>*,* (36)

<sup>2</sup> (38)

$$\mathbf{g}\_x(\mathbf{x}) = \frac{\mathbf{1}}{\left(\frac{1}{\theta} - \mathbf{1}\right)^2 \mathbf{x}} + \frac{\mathbf{1}}{\frac{1}{\theta} - \mathbf{x}} \text{ and } \mathbf{g}\_x(\mathbf{1}) = \frac{\mathbf{1}}{\theta \left(\frac{1}{\theta} - \mathbf{1}\right)^2}. \tag{41}$$

Let

$$L\_x(\mathbf{x}) = \mathbf{g}\_x(\mathbf{x}) - \mathbf{g}\_x(\mathbf{1}) = \frac{\mathbf{1}}{\left(\frac{1}{\theta} - \mathbf{1}\right)^2 \mathbf{x}} + \frac{\mathbf{1}}{\frac{1}{\theta} - \mathbf{x}} - \frac{\mathbf{1}}{\theta \left(\frac{1}{\theta} - \mathbf{1}\right)^2}. \tag{42}$$

Thus

$$\begin{split} \left. L\_x(\theta, a) = L\_x(a|\theta) = L\_x(x) \right|\_{x=a/\theta} &= \frac{1}{\left(\frac{1}{\theta} - 1\right)^2} \frac{a}{\theta} + \frac{1}{\frac{1}{\theta} - \frac{a}{\theta}} - \frac{1}{\theta \left(\frac{1}{\theta} - 1\right)^2} \\ &= \frac{\theta}{\left(\frac{1}{\theta} - 1\right)^2 a} + \frac{\theta}{1 - a} - \frac{1}{\theta \left(\frac{1}{\theta} - 1\right)^2} . \end{split} \tag{43}$$

It is easy to check (see the supplement of [13]) that *Lz*ð Þ¼ *θ; a Lz*ð Þ¼ *a*j*θ Lz*ð Þj *<sup>x</sup> <sup>x</sup>*¼*a=<sup>θ</sup>*, which is called Zhang's loss function, satisfies all the seven properties listed in **Table 1** of [13]. Consequently, Zhang's loss function is a good loss function, and thus it is recommended for Θ ¼ ð Þ 0*;* 1 .

The Bayesian estimator under Zhang's loss function, *δπ <sup>z</sup>* ð Þ *x* , minimizes the posterior expected Zhang's loss (PEZL) (see [13]), E½ � *Lz*ð Þj *θ; a x* , that is,

$$\delta\_x^{\pi}(\mathbf{x}) = \arg\min\_{a \in \mathcal{A}} \mathbf{E}[L\_x(\theta, a)|\mathbf{x}],\tag{44}$$

where Af g *a*ð Þ *x* : *a*ð Þ *x* ∈ ð Þ 0*;* 1 is the action space, *a* ¼ *a*ð Þ *x* ∈ ð Þ 0*;* 1 is an action (estimator), *Lz*ð Þ *θ; a* given by (43) is Zhang's loss function, and *θ* ∈ð Þ 0*;* 1 is the unknown parameter of interest. The PEZL is easy to obtain (see [13]):

$$PEZL(\pi, a|\mathbf{x}) = \mathbf{E}[L\_x(\theta, a)|\mathbf{x}] = \frac{E\_1(\mathbf{x})}{a} + \frac{E\_2(\mathbf{x})}{1 - a} - E\_3(\mathbf{x}),\tag{45}$$

where

$$\begin{aligned} E\_1(\mathbf{x}) &= \mathrm{E} \left[ \frac{\theta^3}{\left(1 - \theta\right)^2} |\mathbf{x}| \right], \\\\ E\_2(\mathbf{x}) &= \mathrm{E}(\theta | \mathbf{x}), \\\\ E\_3(\mathbf{x}) &= \mathrm{E} \left[ \frac{\theta}{\left(1 - \theta\right)^2} |\mathbf{x}| \right]. \end{aligned} \tag{46}$$

It is found in [13] that

$$\delta\_x^{\pi}(\mathbf{x}) = \frac{\sqrt{E\_1(\mathbf{x})}}{\sqrt{E\_1(\mathbf{x})} + \sqrt{E\_2(\mathbf{x})}} \tag{47}$$

**5. Inequalities among Bayesian posterior estimators**

<sup>2</sup> ð Þ *<sup>x</sup>* , and *δπ*

In this section, we compare the six Bayesian estimators *δ<sup>π</sup>*

*<sup>p</sup>*ð Þ *<sup>x</sup>* , *δπ*

*δπ <sup>w</sup>*2ð Þ *<sup>x</sup>* , *<sup>δ</sup><sup>π</sup>*

*δπ <sup>p</sup>*ð Þ *<sup>x</sup>* , *<sup>δ</sup><sup>π</sup>*

estimator.

*Bayesian estimators*:

*ian estimators*:

the supplement of [36].

**101**

tors *δπ*

*pl*ð Þ *<sup>x</sup>* , *<sup>δ</sup><sup>π</sup>*

*<sup>w</sup>*2ð Þ *<sup>x</sup>* and *<sup>δ</sup><sup>π</sup>*

<sup>2</sup> ð Þ *<sup>x</sup>* , and *<sup>δ</sup><sup>π</sup>*

*<sup>s</sup>* ð Þ *<sup>x</sup>* , *<sup>δ</sup><sup>π</sup>*

*DOI: http://dx.doi.org/10.5772/intechopen.88587*

are summarized in the following theorem.

max *δ<sup>π</sup>*

smallest PELs exemplify this fact (see [36]).

*<sup>w</sup>*2ð Þ *<sup>x</sup> ; δπ*

*δπ*

*pl*ð Þ *x* <sup>≤</sup> *δπ*

*<sup>w</sup>*2ð Þ *<sup>x</sup>* <sup>≤</sup>*δ<sup>π</sup>*

*δπ*

For the six loss functions, we have the corresponding six Bayesian estimators

*The Bayesian Posterior Estimators under Six Loss Functions for Unrestricted and Restricted…*

mators, we discover three strings of inequalities which are summarized in Theorem 1 (see Theorem 1 in [36]). To our surprise, an order between the two Bayesian estima-

strings of inequalities only depend on the loss functions. Moreover, the inequalities are independent of the chosen models, and the used priors provided the Bayesian estimators exist, and thus they exist in a general setting which makes them quite interesting.

tors, the PELs, and the smallest PELs are summarized in **Table 2** (see **Table 1** in [36]). The six PELs are PEWSEL, PEPLL, PESL, PEPL, PESEL, and PEZL. In **Table 2**, each Bayesian estimator minimizes some corresponding PEL. Furthermore, the smallest PEL is the PEL evaluated at the corresponding Bayesian

It is easy to see that all the six loss functions are well defined on Θ ¼ ð Þ 0*;* 1 , and thus all the six Bayesian estimators are well defined on Θ ¼ ð Þ 0*;* 1 . There are only four loss functions defined on Θ ¼ ð Þ 0*;* ∞ , since the power-log loss function and Zhang's loss function are only defined on Θ ¼ ð Þ 0*;* 1 . Hence, only four Bayesian estimators are well defined on Θ ¼ ð Þ 0*;* ∞ . Moreover, only the weighted squared error loss function and the squared error loss function are defined on Θ ¼ �ð Þ ∞*;* ∞ , and therefore only two Bayesian estimators are well defined on Θ ¼ �ð Þ ∞*;* ∞ . Among the six Bayesian estimators, there exist three strings of inequalities which

Theorem 1 (Theorem 1 in [36]). *Assume the prior satisfies some regularity conditions so that the posterior expectations involved in the definitions of the six Bayesian estimators exist. Then for* Θ ¼ ð Þ 0*;* 1 , *there exists a string of inequalities among the six*

*<sup>s</sup>* ð Þ *<sup>x</sup>* <sup>≤</sup> *δπ*

*Moreover, for* Θ ¼ ð Þ 0*;* ∞ *, there exists a string of inequalities among the four Bayes-*

*Finally, for* Θ ¼ �ð Þ ∞*;* ∞ *, there exists an inequality between the two Bayesian estimators*:

The proof of Theorem 1 exploits a key, important, and unified tool, the covariance inequality (see Theorem 4.7.9 (p. 192) in [16]), and the proof can be found in

It is worthy to note that the six Bayesian estimators and the six smallest PELs are

*<sup>s</sup>* ð Þ *<sup>x</sup>* <sup>≤</sup> *δπ*

*<sup>w</sup>*2ð Þ *<sup>x</sup>* <sup>≤</sup>*δπ*

all functions of *π*, *x*, and the loss function. Because there exists three strings of inequalities among the six Bayesian estimators, we would wonder whether there exists a string of inequalities among the six smallest PELs, in other words, *PEWSELw*2ð Þ *π; x* , *PEPLLpl*ð Þ *π; x* , *PESLs*ð Þ *π; x* , *PEPLp*ð Þ *π; x* , *PESEL*2ð Þ *π; x* , and *PEZLz*ð Þ *π; x* . The answer to this question is no! The numerical simulations of the

*<sup>p</sup>*ð Þ *<sup>x</sup>* <sup>≤</sup>*δ<sup>π</sup>*

*<sup>p</sup>*ð Þ *<sup>x</sup>* <sup>≤</sup> *δπ*

<sup>2</sup> ð Þ *<sup>x</sup>* <sup>≤</sup>*δ<sup>π</sup>*

*pl*ð Þ *x* on Θ ¼ ð Þ 0*;* 1 does not exist. It is worthy to note that the three

*<sup>z</sup>* ð Þ *x* . The domains of the loss functions, the six Bayesian estima-

*<sup>z</sup>* ð Þ *x* . Interestingly, for the six Bayesian esti-

*<sup>w</sup>*2ð Þ *<sup>x</sup>* , *δπ*

*pl*ð Þ *<sup>x</sup>* , *δπ*

*<sup>z</sup>* ð Þ *x :* (49)

<sup>2</sup> ð Þ *x :* (50)

<sup>2</sup> ð Þ *x :* (51)

*<sup>s</sup>* ð Þ *x* ,

by taking partial derivative of the PEZL with respect to *a* and setting it to 0. The PEZLs evaluated at the Bayesian estimators are (see [13])

$$\begin{aligned} \text{PEZL}\_{\mathfrak{z}}(\mathfrak{x}, \mathfrak{x}) &= \text{E}[L\_{\mathfrak{z}}(\theta, \mathfrak{a}) | \mathfrak{x}]|\_{\mathfrak{a} = \delta\_{\mathfrak{z}}^{\mathfrak{z}}(\mathfrak{x})}, \\\\ \text{PEZL}\_{\mathfrak{z}}(\mathfrak{x}, \mathfrak{x}) &= \text{E}[L\_{\mathfrak{z}}(\theta, \mathfrak{a}) | \mathfrak{x}]|\_{\mathfrak{a} = \delta\_{\mathfrak{z}}^{\mathfrak{z}}(\mathfrak{x})}. \end{aligned} \tag{48}$$

Zhang et al. [13] considers an example of some magazine exposure data for the monthly magazine *Signature* (see [12, 35]) and compares the numerical results with those of [12].

For the probability parameter *θ* of the beta-negative binomial model (31), [34] recommends and analytically calculates the Bayesian estimator *δ<sup>π</sup> <sup>z</sup>* ð Þ *x* , with respect to *Be*ð Þ *α; β* prior under Zhang's loss function which penalizes gross overestimation and gross underestimation equally. This estimator minimizes the PEZL. They also calculate the usual Bayesian estimator *δπ* <sup>2</sup> ð Þ¼ *x* Eð Þ *θ*j*x* which minimizes the PESEL. Moreover, they also obtain the PEZLs evaluated at the two Bayesian estimators, *PEZLz*ð Þ *π; x* and *PEZL*2ð Þ *π; x* . After that, they show two theorems about the estimators of the hyperparameters of the beta-negative binomial model (31) when *m* is known or unknown by the moment method (Theorem 1 in [34]) and the MLE method (Theorem 2 in [34]). Finally, the empirical Bayesian estimator of the probability parameter *θ* under Zhang's loss function is obtained with the hyperparameters estimated by the moment method or the MLE method from the two theorems.

In the numerical simulations of [34], they have illustrated three things: the two inequalities of the Bayesian posterior estimators and the PEZLs, the moment estimators and the MLEs, which are consistent estimators of the hyperparameters, and the goodness of fit of the beta-negative binomial model to the simulated data. Numerical simulations show that the MLEs are better than the moment estimators when estimating the hyperparameters in terms of the goodness of fit of the model to the simulated data. However, the MLEs are very sensitive to the initial estimators, and the moment estimators are usually proved to be good initial estimators.

In the real data section of [34], they consider an example of some insurance claim data, which are assumed from the beta-negative binomial model (31). They consider four cases to fit the real data. In the first case, they assume that *m* ¼ 6 is known for illustrating purpose (of course, one can assume another known *m* value). In the other three cases, they assume that *m* is unknown, and they provide three approaches to handle this scenario. The first two approaches consider a range of *m* values, for instance, *m* ¼ 1*,* 2*,* …*,* 20. The first approach is to maximize the log-likelihood function. The second approach is to maximize the p-value of the goodness of fit of the model (31) to the real data. The third approach is to determine the hyperparameters *α*, *β*, and *m* from Theorems 1 and 2 in [34] by the moment method and the MLE method, respectively, when *m* is unknown. Four tables which show the number of claims, the observed frequencies, the expected probabilities, and the expected frequencies of the insurance claims data are provided to illustrate the four cases.

*The Bayesian Posterior Estimators under Six Loss Functions for Unrestricted and Restricted… DOI: http://dx.doi.org/10.5772/intechopen.88587*

#### **5. Inequalities among Bayesian posterior estimators**

It is found in [13] that

*Bayesian Inference on Complicated Data*

those of [12].

two theorems.

the four cases.

**100**

to be good initial estimators.

*δπ <sup>z</sup>* ð Þ¼ *x*

PEZLs evaluated at the Bayesian estimators are (see [13])

recommends and analytically calculates the Bayesian estimator *δ<sup>π</sup>*

calculate the usual Bayesian estimator *δπ*

ffiffiffiffiffiffiffiffiffiffiffi *<sup>E</sup>*1ð Þ *<sup>x</sup>* <sup>p</sup>

*<sup>E</sup>*2ð Þ *<sup>x</sup>* <sup>p</sup> (47)

(48)

*<sup>z</sup>* ð Þ *x* , with respect

*<sup>z</sup>* ð Þ *<sup>x</sup> ,*

<sup>2</sup> ð Þ *<sup>x</sup> :*

<sup>2</sup> ð Þ¼ *x* Eð Þ *θ*j*x* which minimizes the PESEL.

ffiffiffiffiffiffiffiffiffiffiffi *<sup>E</sup>*1ð Þ *<sup>x</sup>* <sup>p</sup> <sup>þ</sup> ffiffiffiffiffiffiffiffiffiffiffi

*PEZLz*ð Þ¼ *<sup>π</sup>; <sup>x</sup>* <sup>E</sup>½ �j *Lz*ð Þj *<sup>θ</sup>; <sup>a</sup> <sup>x</sup> <sup>a</sup>*¼*δ<sup>π</sup>*

*PEZL*2ð Þ¼ *<sup>π</sup>; <sup>x</sup>* <sup>E</sup>½ �j *Lz*ð Þj *<sup>θ</sup>; <sup>a</sup> <sup>x</sup> <sup>a</sup>*¼*δ<sup>π</sup>*

Zhang et al. [13] considers an example of some magazine exposure data for the monthly magazine *Signature* (see [12, 35]) and compares the numerical results with

For the probability parameter *θ* of the beta-negative binomial model (31), [34]

to *Be*ð Þ *α; β* prior under Zhang's loss function which penalizes gross overestimation and gross underestimation equally. This estimator minimizes the PEZL. They also

Moreover, they also obtain the PEZLs evaluated at the two Bayesian estimators, *PEZLz*ð Þ *π; x* and *PEZL*2ð Þ *π; x* . After that, they show two theorems about the estimators of the hyperparameters of the beta-negative binomial model (31) when *m* is known or unknown by the moment method (Theorem 1 in [34]) and the MLE method (Theorem 2 in [34]). Finally, the empirical Bayesian estimator of the

hyperparameters estimated by the moment method or the MLE method from the

In the numerical simulations of [34], they have illustrated three things: the two inequalities of the Bayesian posterior estimators and the PEZLs, the moment

In the real data section of [34], they consider an example of some insurance claim data, which are assumed from the beta-negative binomial model (31). They consider four cases to fit the real data. In the first case, they assume that *m* ¼ 6 is known for illustrating purpose (of course, one can assume another known *m* value). In the other three cases, they assume that *m* is unknown, and they provide three approaches to handle this scenario. The first two approaches consider a range of *m*

probability parameter *θ* under Zhang's loss function is obtained with the

estimators and the MLEs, which are consistent estimators of the hyperparameters, and the goodness of fit of the beta-negative binomial model to the simulated data. Numerical simulations show that the MLEs are better than the moment estimators when estimating the hyperparameters in terms of the goodness of fit of the model to the simulated data. However, the MLEs are very sensitive to the initial estimators, and the moment estimators are usually proved

values, for instance, *m* ¼ 1*,* 2*,* …*,* 20. The first approach is to maximize the log-likelihood function. The second approach is to maximize the p-value of the goodness of fit of the model (31) to the real data. The third approach is to determine the hyperparameters *α*, *β*, and *m* from Theorems 1 and 2 in [34] by the moment method and the MLE method, respectively, when *m* is unknown. Four tables which show the number of claims, the observed frequencies, the expected probabilities, and the expected frequencies of the insurance claims data are provided to illustrate

by taking partial derivative of the PEZL with respect to *a* and setting it to 0. The

For the six loss functions, we have the corresponding six Bayesian estimators *δπ <sup>w</sup>*2ð Þ *<sup>x</sup>* , *<sup>δ</sup><sup>π</sup> pl*ð Þ *<sup>x</sup>* , *<sup>δ</sup><sup>π</sup> <sup>s</sup>* ð Þ *<sup>x</sup>* , *<sup>δ</sup><sup>π</sup> <sup>p</sup>*ð Þ *<sup>x</sup>* , *δπ* <sup>2</sup> ð Þ *<sup>x</sup>* , and *δπ <sup>z</sup>* ð Þ *x* . Interestingly, for the six Bayesian estimators, we discover three strings of inequalities which are summarized in Theorem 1 (see Theorem 1 in [36]). To our surprise, an order between the two Bayesian estimators *δπ <sup>w</sup>*2ð Þ *<sup>x</sup>* and *<sup>δ</sup><sup>π</sup> pl*ð Þ *x* on Θ ¼ ð Þ 0*;* 1 does not exist. It is worthy to note that the three strings of inequalities only depend on the loss functions. Moreover, the inequalities are independent of the chosen models, and the used priors provided the Bayesian estimators exist, and thus they exist in a general setting which makes them quite interesting.

In this section, we compare the six Bayesian estimators *δ<sup>π</sup> <sup>w</sup>*2ð Þ *<sup>x</sup>* , *δπ pl*ð Þ *<sup>x</sup>* , *δπ <sup>s</sup>* ð Þ *x* , *δπ <sup>p</sup>*ð Þ *<sup>x</sup>* , *<sup>δ</sup><sup>π</sup>* <sup>2</sup> ð Þ *<sup>x</sup>* , and *<sup>δ</sup><sup>π</sup> <sup>z</sup>* ð Þ *x* . The domains of the loss functions, the six Bayesian estimators, the PELs, and the smallest PELs are summarized in **Table 2** (see **Table 1** in [36]). The six PELs are PEWSEL, PEPLL, PESL, PEPL, PESEL, and PEZL. In **Table 2**, each Bayesian estimator minimizes some corresponding PEL. Furthermore, the smallest PEL is the PEL evaluated at the corresponding Bayesian estimator.

It is easy to see that all the six loss functions are well defined on Θ ¼ ð Þ 0*;* 1 , and thus all the six Bayesian estimators are well defined on Θ ¼ ð Þ 0*;* 1 . There are only four loss functions defined on Θ ¼ ð Þ 0*;* ∞ , since the power-log loss function and Zhang's loss function are only defined on Θ ¼ ð Þ 0*;* 1 . Hence, only four Bayesian estimators are well defined on Θ ¼ ð Þ 0*;* ∞ . Moreover, only the weighted squared error loss function and the squared error loss function are defined on Θ ¼ �ð Þ ∞*;* ∞ , and therefore only two Bayesian estimators are well defined on Θ ¼ �ð Þ ∞*;* ∞ . Among the six Bayesian estimators, there exist three strings of inequalities which are summarized in the following theorem.

Theorem 1 (Theorem 1 in [36]). *Assume the prior satisfies some regularity conditions so that the posterior expectations involved in the definitions of the six Bayesian estimators exist. Then for* Θ ¼ ð Þ 0*;* 1 , *there exists a string of inequalities among the six Bayesian estimators*:

$$\max \left( \delta\_{w2}^{\pi}(\mathfrak{x}), \delta\_{pl}^{\pi}(\mathfrak{x}) \right) \le \delta\_{\mathfrak{s}}^{\pi}(\mathfrak{x}) \le \delta\_{p}^{\pi}(\mathfrak{x}) \le \delta\_{2}^{\pi}(\mathfrak{x}) \le \delta\_{\mathfrak{s}}^{\pi}(\mathfrak{x}).\tag{49}$$

*Moreover, for* Θ ¼ ð Þ 0*;* ∞ *, there exists a string of inequalities among the four Bayesian estimators*:

$$
\delta\_{w2}^{\pi}(\mathfrak{x}) \le \delta\_s^{\pi}(\mathfrak{x}) \le \delta\_p^{\pi}(\mathfrak{x}) \le \delta\_2^{\pi}(\mathfrak{x}).\tag{50}
$$

*Finally, for* Θ ¼ �ð Þ ∞*;* ∞ *, there exists an inequality between the two Bayesian estimators*:

$$
\delta\_{w2}^{\pi}(\mathfrak{x}) \le \delta\_2^{\pi}(\mathfrak{x}).\tag{51}
$$

The proof of Theorem 1 exploits a key, important, and unified tool, the covariance inequality (see Theorem 4.7.9 (p. 192) in [16]), and the proof can be found in the supplement of [36].

It is worthy to note that the six Bayesian estimators and the six smallest PELs are all functions of *π*, *x*, and the loss function. Because there exists three strings of inequalities among the six Bayesian estimators, we would wonder whether there exists a string of inequalities among the six smallest PELs, in other words, *PEWSELw*2ð Þ *π; x* , *PEPLLpl*ð Þ *π; x* , *PESLs*ð Þ *π; x* , *PEPLp*ð Þ *π; x* , *PESEL*2ð Þ *π; x* , and *PEZLz*ð Þ *π; x* . The answer to this question is no! The numerical simulations of the smallest PELs exemplify this fact (see [36]).


*(Table 1 in [36]) The six Bayesian estimators,*

 *the PELs, and the smallest PELs.* **6. Conclusions and discussions**

*DOI: http://dx.doi.org/10.5772/intechopen.88587*

six smallest PELs does not exist.

proved to be good initial estimators.

**Acknowledgements**

(14XJC910001).

**103**

then we recommend to use Zhang's loss function on 0ð Þ *;* 1 .

In this chapter, we have investigated six loss functions: the squared error loss function, the weighted squared error loss function, Stein's loss function, the powerpower loss function, the power-log loss function, and Zhang's loss function. Now we give some suggestions on the conditions for using each of the six loss functions. It is worthy to note that among the six loss functions, the first two loss functions are defined on Θ ¼ �ð Þ ∞*;* ∞ and they penalize overestimation and underestimation equally on ð Þ �∞*;* ∞ , and thus we recommend to use them when the parameter space is ð Þ �∞*;* ∞ . Moreover, the middle two loss functions are defined on Θ ¼ ð Þ 0*;* ∞ , and they penalize gross overestimation and gross underestimation equally on 0ð Þ *;* ∞ , and thus we recommend to use them when the parameter space is ð Þ 0*;* ∞ . In particular, if one prefers the loss function to have balanced convergence rates or penalties for its argument too large and too small, then we recommend to use the power-power loss function on 0ð Þ *;* ∞ . Furthermore, the last two loss functions are defined on Θ ¼ ð Þ 0*;* 1 , and they penalize gross overestimation and gross underestimation equally on 0ð Þ *;* 1 , and thus we recommend to use them when the parameter space is 0ð Þ *;* 1 . In particular, if one prefers the loss function to have balanced convergence rates or penalties for its argument too large and too small,

*The Bayesian Posterior Estimators under Six Loss Functions for Unrestricted and Restricted…*

For each one of the six loss functions, we can find a corresponding Bayesian estimator, which minimizes the corresponding posterior expected loss. Among the six Bayesian estimators, there exist three strings of inequalities summarized in Theorem 1 (see also Theorem 1 in [36]). However, a string of inequalities among the

We summarize three hierarchical models where the unknown parameter of interest is *θ* ∈ Θ ¼ ð Þ 0*;* ∞ , that is, the hierarchical normal and inverse gamma model (9), the hierarchical Poisson and gamma model (10), and the hierarchical normal and normal-inverse-gamma model (11). In addition, we summarize two hierarchical models where the unknown parameter of interest is *θ* ∈ Θ ¼ ð Þ 0*;* 1 , that is, the beta-binomial model (30) and the beta-negative binomial model (31). Now we give some suggestions on the selection of the hyperparameters. One way to select the hyperparameters is through the empirical Bayesian analysis, which relies on a conjugate prior modeling, where the hyperparameters are estimated from the observations and the "estimated prior" is then used as a regular prior in the later inference. The marginal distribution can then be used to recover the prior distribution from the observations. For empirical Bayesian analysis, two common methods are used to obtain the estimators of the hyperparameters, that is, the moment method and the MLE method. Numerical simulations show that the MLEs are better than the moment estimators when estimating the hyperparameters in terms of the goodness of fit of the model to the simulated data. However, the MLEs are very sensitive to the initial estimators, and the moment estimators are usually

The research was supported by the Fundamental Research Funds for the Central Universities (2019CDXYST0016; 2018CDXYST0024), China Scholarship Council (201606055028), National Natural Science Foundation of China (11671060), and MOE project of Humanities and Social Sciences on the west and the border area

*Bayesian Inference on Complicated Data*

**102**

*The Bayesian Posterior Estimators under Six Loss Functions for Unrestricted and Restricted… DOI: http://dx.doi.org/10.5772/intechopen.88587*

#### **6. Conclusions and discussions**

In this chapter, we have investigated six loss functions: the squared error loss function, the weighted squared error loss function, Stein's loss function, the powerpower loss function, the power-log loss function, and Zhang's loss function. Now we give some suggestions on the conditions for using each of the six loss functions. It is worthy to note that among the six loss functions, the first two loss functions are defined on Θ ¼ �ð Þ ∞*;* ∞ and they penalize overestimation and underestimation equally on ð Þ �∞*;* ∞ , and thus we recommend to use them when the parameter space is ð Þ �∞*;* ∞ . Moreover, the middle two loss functions are defined on Θ ¼ ð Þ 0*;* ∞ , and they penalize gross overestimation and gross underestimation equally on 0ð Þ *;* ∞ , and thus we recommend to use them when the parameter space is ð Þ 0*;* ∞ . In particular, if one prefers the loss function to have balanced convergence rates or penalties for its argument too large and too small, then we recommend to use the power-power loss function on 0ð Þ *;* ∞ . Furthermore, the last two loss functions are defined on Θ ¼ ð Þ 0*;* 1 , and they penalize gross overestimation and gross underestimation equally on 0ð Þ *;* 1 , and thus we recommend to use them when the parameter space is 0ð Þ *;* 1 . In particular, if one prefers the loss function to have balanced convergence rates or penalties for its argument too large and too small, then we recommend to use Zhang's loss function on 0ð Þ *;* 1 .

For each one of the six loss functions, we can find a corresponding Bayesian estimator, which minimizes the corresponding posterior expected loss. Among the six Bayesian estimators, there exist three strings of inequalities summarized in Theorem 1 (see also Theorem 1 in [36]). However, a string of inequalities among the six smallest PELs does not exist.

We summarize three hierarchical models where the unknown parameter of interest is *θ* ∈ Θ ¼ ð Þ 0*;* ∞ , that is, the hierarchical normal and inverse gamma model (9), the hierarchical Poisson and gamma model (10), and the hierarchical normal and normal-inverse-gamma model (11). In addition, we summarize two hierarchical models where the unknown parameter of interest is *θ* ∈ Θ ¼ ð Þ 0*;* 1 , that is, the beta-binomial model (30) and the beta-negative binomial model (31).

Now we give some suggestions on the selection of the hyperparameters. One way to select the hyperparameters is through the empirical Bayesian analysis, which relies on a conjugate prior modeling, where the hyperparameters are estimated from the observations and the "estimated prior" is then used as a regular prior in the later inference. The marginal distribution can then be used to recover the prior distribution from the observations. For empirical Bayesian analysis, two common methods are used to obtain the estimators of the hyperparameters, that is, the moment method and the MLE method. Numerical simulations show that the MLEs are better than the moment estimators when estimating the hyperparameters in terms of the goodness of fit of the model to the simulated data. However, the MLEs are very sensitive to the initial estimators, and the moment estimators are usually proved to be good initial estimators.

#### **Acknowledgements**

The research was supported by the Fundamental Research Funds for the Central Universities (2019CDXYST0016; 2018CDXYST0024), China Scholarship Council (201606055028), National Natural Science Foundation of China (11671060), and MOE project of Humanities and Social Sciences on the west and the border area (14XJC910001).

**Domain**

**102**

Θ ¼ �

Θ

¼ 0*;* 1 ð Þ

*δπpl x*ð Þ

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

 �

*PEPLL π; a*j*x* ð Þ¼ *E Lpl θ; a* ð Þj*x*

�

¼

*E θ* 1

"

*θ* � 1 � �<sup>2</sup>

1 � *a* � log *a* þ log *θ* � 1

*θ* þ 1j*x*

 #

*PEPLLpl π; x* ð Þ¼ *PEPLL π; a*j*x* ð Þj*a*¼*δπpl x*ð Þ

*Bayesian Inference on Complicated Data*

 �

¼

with

*Epl*

*E* 1 � *θ* ð Þ<sup>2</sup>

"

*θ*

j*x*

>0

 #

> 1 *x*ð Þ¼

> > Θ

Θ

Θ ¼ �

Θ

¼ 0*;* 1 ð Þ

*δπ*

ffiffiffiffiffiffiffiffiffiffiffi

*Ez*

*PEZL π; a*j*x* ð Þ¼ *E Lz θ; a* ð Þj*x*

½

¼

*E*

"

1

*θ* � 1 � �<sup>2</sup>

*a*

1 � *a* �

*θ* 1

*θ* � 1 � �<sup>2</sup> j*x*

þ *θ*

*θ*

1

 #

*PEZLz π; x* ð Þ¼ *PEZL π; a*j*x* ð Þj*a*¼*δπ*

*z x*ð Þ

 �

p

1 *x*ð Þ

ffiffiffiffiffiffiffiffiffiffiffi

ffiffiffiffiffiffiffiffiffiffiffi

*Ez*

þ

p

*Ez*

2 *x*ð Þ

p

1 *x*ð Þ

*z x*ð Þ¼

with

*Ez*

*E*

"

1 � *θ* ð Þ<sup>2</sup> j*x*

*θ*<sup>3</sup>

 #

> 1 *x*ð Þ¼

> > and

*Ez*

*E θ*j*x* ½ �

2 *x*ð Þ¼

> **Table 2.**

*(Table 1 in [36]) The six Bayesian estimators,*

 *the PELs, and the smallest PELs.*

∞*;* ∞ ð

 Þ

*δπ*

*E θ*j*x* ð Þ

2 *x*ð Þ¼

¼ 0*;* ∞ ð Þ

*δπ*

ffiffiffiffiffiffiffiffiffiffi

*E θ*j*x* ð Þ

> *p x*ð Þ¼

*E* 1

r

*θ*j*x* ð Þ

¼ 0*;* ∞ ð Þ

*δπ*

*s x*ð Þ¼ 1

*E* 1

*θ*j*x* ð Þ

*PESL π; a*j*x* ð Þ¼ *E Ls θ; a* ð Þj*x*

½

¼

*E a*

h

*θ* � log *a*

*PEPL π; a*j*x* ð Þ¼ *E Lp θ; a* ð Þj*x*

�

¼

*E a*

�

*θ* þ *θ*

*PESEL π; a*j*x* ð Þ¼ *E L*2 *θ; a* ð Þj*x*

½

¼

*E θ* � *a* ð Þ<sup>2</sup>

h

j*x*

 i

 �

*PESEL*2 *π; x* ð Þ¼ *PESEL π; a*j*x* ð Þj*a*¼*δπ*

2 *x*ð Þ

*a* � 2j*x*

 �

*PEPLp π; x* ð Þ¼ *PEPL π; a*j*x* ð Þj*a*¼*δπ*

*p x*ð Þ

> �

*θ* � 1j*x*

 �

*PESLs π; x* ð Þ¼ *PESL π; a*j*x* ð Þj*a*¼*δπ*

*s x*ð Þ

> i

2 þ

*Epl*

*Epl*

r

*Epl*

�

1 *x*ð Þþ 4

1 *x*ð Þ

2

1 *x*ð Þ�

∞*;* ∞ ð

 Þ

*δπ*

*θ*j*x* ð Þ

*w*2 *x*ð Þ¼ *E* 1

*E* 1

*θ*<sup>2</sup> j*x* � �

**Bayesian estimators**

**PELs**

*PEWSEL π; a*j*x* ð Þ¼ *E Lw*2 *θ; a* ð Þj*x*

½

¼

*E* 1

�

j*x*

 �

> *θ*<sup>2</sup> *θ* � *a* ð Þ<sup>2</sup>

 �

*PEWSELw*2 *π; x* ð Þ¼

*PEWSEL π; a*j*x* ð Þj*a*¼*δπ*

*w*2 *x*ð Þ

**Smallest PELs**

### **Conflict of interest**

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

**References**

[1] Robert CP. The Bayesian Choice: From Decision-Theoretic Motivations to Computational Implementation. 2nd paperback ed. New York: Springer; 2007

*DOI: http://dx.doi.org/10.5772/intechopen.88587*

stein's loss. Communications in

7125-7133

*The Bayesian Posterior Estimators under Six Loss Functions for Unrestricted and Restricted…*

Statistics-Theory and Methods. 2017;**46**:

[11] Zhang YY. The bayes rule of the positive restricted parameter under the power-power loss with an application. Communications in Statistics-Theory and Methods. 2019; Under review

[12] Zhang YY, Zhou MQ, Xie YH, Song WH. The bayes rule of the

parameter in (0, 1) under the power-log loss function with an application to the beta-binomial model. Journal of

Statistical Computation and Simulation.

[14] Stein C. Inadmissibility of the usual estimator for the variance of a normal distribution with unknown mean. Annals of the Institute of Statistical Mathematics. 1964;**16**:155-160

[15] Maatta JM, Casella G. Developments

estimation. Statistical Science. 1990;**5**:

[16] Casella G, Berger RL. Statistical Inference. 2nd ed. USA: Duxbury; 2002

[17] Lehmann EL, Casella G. Theory of Point Estimation. 2nd ed. New York:

[18] Raiffa H, Schlaifer R. Applied Statistical Decision Theory. Cambridge:

Harvard University Press; 1961

[19] Deely JJ, Lindley DV. Bayes empirical bayes. Journal of the

in decision-theoretic variance

90-120

Springer; 1998

[13] Zhang YY, Xie YH, Song WH, Zhou MQ. The bayes rule of the parameter in (0, 1) under zhang's loss function with an application to the betabinomial model. Communications in Statistics-Theory and Methods. 2019. DOI: 10.1080/03610926.2019.1565840

2017;**87**:2724-2737

[2] James W, Stein C. Estimation with quadratic loss. In: Proceedings of the Fourth Berkeley Symposium on

Mathematical Statistics and Probability.

[3] Brown LD. Inadmissibility of the usual estimators of scale parameters in problems with unknown location and scale parameters. The Annals of Mathematical Statistics. 1968;**39**:29-48

[4] Brown LD. Comment on the paper by maatta and casella. Statistical

[5] Parsian A, Nematollahi N. Estimation of scale parameter under entropy loss function. Journal of Statistical Planning

Vol. 1. 1961. pp. 361-380

Science. 1990;**5**:103-106

and Inference. 1996;**52**:77-91

[6] Petropoulos C, Kourouklis S. Estimation of a scale parameter in mixture models with unknown location. Journal of Statistical Planning and Inference. 2005;**128**:191-218

[7] Oono Y, Shinozaki N. On a class of improved estimators of variance and estimation under order restriction. Journal of Statistical Planning and Inference. 2006;**136**:2584-2605

[8] Ye RD, Wang SG. Improved estimation of the covariance matrix under stein's loss. Statistics & Probability Letters. 2009;**79**:715-721

[9] Bobotas P, Kourouklis S. On the estimation of a normal precision and a normal variance ratio. Statistical Methodology. 2010;**7**:445-463

[10] Zhang YY. The bayes rule of the variance parameter of the hierarchical normal and inverse gamma model under

**105**

### **Author details**

Ying-Ying Zhang Department of Statistics and Actuarial Science, College of Mathematics and Statistics, Chongqing University, Chongqing, China

\*Address all correspondence to: robertzhangyying@qq.com; robertzhang@cqu.edu.cn

© 2019 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

*The Bayesian Posterior Estimators under Six Loss Functions for Unrestricted and Restricted… DOI: http://dx.doi.org/10.5772/intechopen.88587*

#### **References**

**Conflict of interest**

*Bayesian Inference on Complicated Data*

**Author details**

Ying-Ying Zhang

**104**

robertzhang@cqu.edu.cn

The author declared no potential conflicts of interest with respect to the

Department of Statistics and Actuarial Science, College of Mathematics and

© 2019 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium,

Statistics, Chongqing University, Chongqing, China

provided the original work is properly cited.

\*Address all correspondence to: robertzhangyying@qq.com;

research, authorship, and/or publication of this article.

[1] Robert CP. The Bayesian Choice: From Decision-Theoretic Motivations to Computational Implementation. 2nd paperback ed. New York: Springer; 2007

[2] James W, Stein C. Estimation with quadratic loss. In: Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability. Vol. 1. 1961. pp. 361-380

[3] Brown LD. Inadmissibility of the usual estimators of scale parameters in problems with unknown location and scale parameters. The Annals of Mathematical Statistics. 1968;**39**:29-48

[4] Brown LD. Comment on the paper by maatta and casella. Statistical Science. 1990;**5**:103-106

[5] Parsian A, Nematollahi N. Estimation of scale parameter under entropy loss function. Journal of Statistical Planning and Inference. 1996;**52**:77-91

[6] Petropoulos C, Kourouklis S. Estimation of a scale parameter in mixture models with unknown location. Journal of Statistical Planning and Inference. 2005;**128**:191-218

[7] Oono Y, Shinozaki N. On a class of improved estimators of variance and estimation under order restriction. Journal of Statistical Planning and Inference. 2006;**136**:2584-2605

[8] Ye RD, Wang SG. Improved estimation of the covariance matrix under stein's loss. Statistics & Probability Letters. 2009;**79**:715-721

[9] Bobotas P, Kourouklis S. On the estimation of a normal precision and a normal variance ratio. Statistical Methodology. 2010;**7**:445-463

[10] Zhang YY. The bayes rule of the variance parameter of the hierarchical normal and inverse gamma model under stein's loss. Communications in Statistics-Theory and Methods. 2017;**46**: 7125-7133

[11] Zhang YY. The bayes rule of the positive restricted parameter under the power-power loss with an application. Communications in Statistics-Theory and Methods. 2019; Under review

[12] Zhang YY, Zhou MQ, Xie YH, Song WH. The bayes rule of the parameter in (0, 1) under the power-log loss function with an application to the beta-binomial model. Journal of Statistical Computation and Simulation. 2017;**87**:2724-2737

[13] Zhang YY, Xie YH, Song WH, Zhou MQ. The bayes rule of the parameter in (0, 1) under zhang's loss function with an application to the betabinomial model. Communications in Statistics-Theory and Methods. 2019. DOI: 10.1080/03610926.2019.1565840

[14] Stein C. Inadmissibility of the usual estimator for the variance of a normal distribution with unknown mean. Annals of the Institute of Statistical Mathematics. 1964;**16**:155-160

[15] Maatta JM, Casella G. Developments in decision-theoretic variance estimation. Statistical Science. 1990;**5**: 90-120

[16] Casella G, Berger RL. Statistical Inference. 2nd ed. USA: Duxbury; 2002

[17] Lehmann EL, Casella G. Theory of Point Estimation. 2nd ed. New York: Springer; 1998

[18] Raiffa H, Schlaifer R. Applied Statistical Decision Theory. Cambridge: Harvard University Press; 1961

[19] Deely JJ, Lindley DV. Bayes empirical bayes. Journal of the

American Statistical Association. 1981; **76**:833-841

[20] Zhang YY, Wang ZY, Duan ZM, Mi W. The empirical bayes estimators of the parameter of the poisson distribution with a conjugate gamma prior under stein's loss function. Journal of Statistical Computation and Simulation. 2019. DOI: 10.1080/ 00949655.2019.1652606

[21] Mao SS, Tang YC. Bayesian Statistics. 2nd ed. Beijing: China Statistics Press; 2012

[22] Chen MH. Bayesian statistics lecture. Statistics Graduate Summer School. China: School of Mathematics and Statistics; Northeast Normal University: Changchun; 2014

[23] Xie YH, Song WH, Zhou MQ, Zhang YY. The bayes posterior estimator of the variance parameter of the normal distribution with a normal-inverse gamma prior under stein's loss. Chinese Journal of Applied Probability and Statistics. 2018;**34**: 551-564

[24] Dey D, Srinivasan C. Estimation of a covariance matrix under stein's loss. The Annals of Statistics. 1985;**13**: 1581-1591

[25] Sheena Y, Takemura A. Inadmissibility of non-order-preserving orthogonally invariant estimators of the covariance matrix in the case of stein's loss. Journal of Multivariate Analysis. 1992;**41**:117-131

[26] Konno Y. Estimation of a normal covariance matrix with incomplete data under stein's loss. Journal of Multivariate Analysis. 1995;**52**:308-324

[27] Konno Y. Estimation of normal covariance matrices parametrized by irreducible symmetric cones under stein's loss. Journal of Multivariate Analysis. 2007;**98**:295-316

[28] Sun XQ, Sun DC, He ZQ. Bayesian inference on multivariate normal covariance and precision matrices in a star-shaped model with missing data. Communications in Statistics-Theory and Methods. 2010;**39**:642-666

[29] Ma TF, Jia LJ, Su YS. A new estimator of covariance matrix. Journal of Statistical Planning and Inference. 2012;**142**:529-536

[30] Xu K, He DJ. Further results on estimation of covariance matrix. Statistics & Probability Letters. 2015; **101**:11-20

[31] Tsukuma H. Estimation of a highdimensional covariance matrix with the stein loss. Journal of Multivariate Analysis. 2016;**148**:1-17

[32] Singh SK, Singh U, Sharma VK. Expected total test time and Bayesian estimation for generalized lindley distribution under progressively type-ii censored sample where removals follow the beta-binomial probability law. Applied Mathematics and Computation. 2013;**222**:402-419

[33] Luo R, Paul S. Estimation for zeroinflated beta-binomial regression model with missing response data. Statistics in Medicine. 2018;**37**:3789-3813

[34] Zhou MQ, Zhang YY, Sun Y, Sun J. The empirical bayes estimators of the probability parameter of the betanegative binomial model under Zhang's loss function. Computational Statistics and Data Analysis. 2019; Under review

[35] Danaher PJ. A markov mixture model for magazine exposure. Journal of the American Statistical Association. 1989;**84**:922-926

[36] Zhang YY, Xie YH, Song WH, Zhou MQ. Three strings of inequalities among six bayes estimators. Communications in Statistics-Theory and Methods. 2018;**47**:1953-1961

American Statistical Association. 1981;

*Bayesian Inference on Complicated Data*

[28] Sun XQ, Sun DC, He ZQ. Bayesian inference on multivariate normal covariance and precision matrices in a star-shaped model with missing data. Communications in Statistics-Theory and Methods. 2010;**39**:642-666

[29] Ma TF, Jia LJ, Su YS. A new estimator of covariance matrix. Journal of Statistical Planning and Inference.

[30] Xu K, He DJ. Further results on estimation of covariance matrix. Statistics & Probability Letters. 2015;

[31] Tsukuma H. Estimation of a highdimensional covariance matrix with the stein loss. Journal of Multivariate

[32] Singh SK, Singh U, Sharma VK. Expected total test time and Bayesian estimation for generalized lindley distribution under progressively type-ii censored sample where removals follow the beta-binomial probability law. Applied Mathematics and Computation.

[33] Luo R, Paul S. Estimation for zeroinflated beta-binomial regression model with missing response data. Statistics in

[34] Zhou MQ, Zhang YY, Sun Y, Sun J. The empirical bayes estimators of the probability parameter of the betanegative binomial model under Zhang's loss function. Computational Statistics and Data Analysis. 2019; Under review

[35] Danaher PJ. A markov mixture model for magazine exposure. Journal of the American Statistical Association.

[36] Zhang YY, Xie YH, Song WH, Zhou MQ. Three strings of inequalities

Communications in Statistics-Theory and Methods. 2018;**47**:1953-1961

among six bayes estimators.

Medicine. 2018;**37**:3789-3813

2012;**142**:529-536

Analysis. 2016;**148**:1-17

2013;**222**:402-419

1989;**84**:922-926

**101**:11-20

[20] Zhang YY, Wang ZY, Duan ZM, Mi W. The empirical bayes estimators of

distribution with a conjugate gamma prior under stein's loss function. Journal

the parameter of the poisson

of Statistical Computation and Simulation. 2019. DOI: 10.1080/

[21] Mao SS, Tang YC. Bayesian Statistics. 2nd ed. Beijing: China

[22] Chen MH. Bayesian statistics lecture. Statistics Graduate Summer School. China: School of Mathematics and Statistics; Northeast Normal University: Changchun; 2014

[23] Xie YH, Song WH, Zhou MQ, Zhang YY. The bayes posterior estimator of the variance parameter of the normal distribution with a normal-inverse gamma prior under stein's loss. Chinese Journal of Applied Probability and Statistics. 2018;**34**:

[24] Dey D, Srinivasan C. Estimation of a covariance matrix under stein's loss. The Annals of Statistics. 1985;**13**:

Inadmissibility of non-order-preserving orthogonally invariant estimators of the covariance matrix in the case of stein's loss. Journal of Multivariate Analysis.

[26] Konno Y. Estimation of a normal covariance matrix with incomplete data

Multivariate Analysis. 1995;**52**:308-324

[27] Konno Y. Estimation of normal covariance matrices parametrized by irreducible symmetric cones under stein's loss. Journal of Multivariate

[25] Sheena Y, Takemura A.

under stein's loss. Journal of

Analysis. 2007;**98**:295-316

**106**

00949655.2019.1652606

Statistics Press; 2012

551-564

1581-1591

1992;**41**:117-131

**76**:833-841

### *Edited by Niansheng Tang*

Due to great applications in various fields, such as social science, biomedicine, genomics, and signal processing, and the improvement of computing ability, Bayesian inference has made substantial developments for analyzing complicated data. This book introduces key ideas of Bayesian sampling methods, Bayesian estimation, and selection of the prior. It is structured around topics on the impact of the choice of the prior on Bayesian statistics, some advances on Bayesian sampling methods, and Bayesian inference for complicated data including breast cancer data, cloud-based healthcare data, gene network data, and longitudinal data. This volume is designed for statisticians, engineers, doctors, and machine learning researchers.

Published in London, UK © 2020 IntechOpen © spainter\_vfx / iStock

Bayesian Inference on Complicated Data

Bayesian Inference on

Complicated Data

*Edited by Niansheng Tang*