**2. Bayesian estimation**

For statistical models, Bayesian estimation is usually obtained from its posterior distribution based on Bayes theory. In general, Bayesian estimation includes Bayesian estimation of parameters and nonparametric functions. For parametric Bayesian estimation, we need first to specify the prior distribution of the parameter and then evaluate its posterior mean/median (i.e., Bayesian estimation of parameter) from its posterior distributions if the quadratic/mean-absolute loss function is used. For nonparametric Bayesian estimation, we need first approximate nonparametric function via some proper method such as B-splines and P-splines [2], i.e., parameterized approximation to nonparametric function, which leads to a parameterized model, and then employ Bayesian idea of parametric models to evaluate Bayesian estimates of parameter and nonparametric function. In what follows, we introduce how to evaluate Bayesian estimates of parameters or nonparametric functions in a relatively complicated model (e.g., random effects model/latent variable model).

In a latent variable model with missing response data, we assume the following form:

$$Y\_i = X\_i \beta + \Lambda a\_i + \varepsilon\_i, \quad \eta\_i = \Pi \eta\_i + \Gamma \xi\_i + \varepsilon\_i, \quad i = 1, 2, \dots, n,\tag{1}$$

where *Yi* is a *p* � 1 vector of manifest variables including continuous and categorical variables [3], *Xi* is a *p* � *q* matrix of covariates, *ω<sup>i</sup>* is a *r* � 1 vector of latent variables, *ϵ<sup>i</sup>* and *ε<sup>i</sup>* are the *p* � 1 and *r*<sup>1</sup> � 1 vectors of measurement errors, respectively, *β* is a *q* � 1 vector of unknown parameters, Λ is a *p* � *r* factor loading matrix, *<sup>ω</sup><sup>i</sup>* <sup>¼</sup> *<sup>η</sup><sup>T</sup> <sup>i</sup>* , *ξ<sup>T</sup> i <sup>T</sup>* in which *<sup>η</sup><sup>i</sup>* and *<sup>ξ</sup><sup>i</sup>* are the *<sup>r</sup>*<sup>1</sup> � 1 and *<sup>r</sup>*<sup>2</sup> � 1 sub-vectors of *<sup>ω</sup><sup>i</sup>* and *r*<sup>1</sup> þ *r*<sup>2</sup> ¼ *r*, respectively, Π and Γ are *r*<sup>1</sup> � *r*<sup>1</sup> and *r*<sup>1</sup> � *r*<sup>2</sup> matrices of unknown parameters. It is also assumed that ∣*I* � Π∣ 6¼ 0, *ξ<sup>i</sup>* follows a normal distribution or an unknown distribution, and *yij*'s are subject to missingness, where *I* is a *r*<sup>1</sup> � *r*<sup>1</sup> identity matrix, *yij* is the *j*th component of *Yi* for *i* ¼ 1, … ,*n*,*j* ¼ 1, … ,*p*.

In general, a simple and standard assumption for the distributions of *εi*, *ϵi*, and *ξ<sup>i</sup>* is to follow some parametric density family such as skew-normal/skew-t/skewnormal-cauchy/skew-symmetric-Laplace distribution [2] or exponential family distribution [3] or normal distribution [4], or unequal time autoregression series or their mixture. But this assumption may be unreasonable or too restrictive. To this end, some alternative methods have been developed to relax their parametric distribution assumption. For example, let *ε<sup>i</sup>* or *ξ<sup>i</sup>* follow an unknown distribution, which is approximated by a Dirichlet process prior [2], spiked Dirichlet process prior [5], truncated centered Dirichlet process prior [6], or a class of smooth densities that are approximated by the semiparametric approach of Gallant and Nychka [7], or unknown distribution with its quantiles specified leading to the well-known quantile regression models, or a Bayesian neural network approach to learning unknown distribution [8].

To introduce missing data, let *δij* be an indicator of missing value *yij*, i.e., *δij* ¼ 1 if *yij* is observed, and *δij* ¼ 0 if *yij* is missing. In this case, we usually need to assume a missingness data mechanism, for example, missing at completely random or missing at random or missing not at random (also called nonignorable missing), and then specify a parametric or semiparmetric model for the considered missingness data mechanism. For example, see Lee and Tang [4] for the logistic regression model, Kim and Yu [9] and Tang, Zhao, and Zhu [10] for the exponential tilting model, and Wang and Tang [11] for the probit regression model together with latent variables. Also, one can train a missingness data mechanism model via a data-driven machine learning method [12], which is a completely new and not yet studied topic, including its

implementation, algorithm, and theories. We are working on this new and promising topic, which may lead to a new research field.

To make Bayesian inference on the considered model (1), we need to specify a prior distribution for unknown parameters or coefficients in approximating unknown nonparametric functions. A standard assumption for unknown parameters is some proper parametric distribution family such as normal distribution, gamma distribution, inverse gamma distribution, inverse Gaussian distribution, Wishart distribution, and Beta distribution in which their hyperparameters are user prespecified. Their misspecification or improper application may lead to unreasonable even misleading parameter estimation. Bayesian inference based on these assumptions did not utilize historical data and limits its popularity in that the usage of historical data may improve the efficiency of parameter estimation. To address this issue, some relaxed priors are considered, for example, see power prior, g-prior, normalized power prior [13], calibrated power prior, dynamic power prior, and the power-expected-posterior prior and the scale transformed power prior [14]. For a high-dimensional sparse parametric model, we can assume a spike and slab prior for the parameter, which can be hierarchically expressed as a mixture of a normal distribution and an exponential distribution.

Let *ϑ*<sup>1</sup> be the set of unknown parameters associated with the distribution of *εi*, and *ϑ*<sup>2</sup> be the set of unknown parameters associated with the distributions of *ε<sup>i</sup>* and *ξi*, *ϑ*<sup>3</sup> be the set of unknown parameters associated with the distribution of *δi*, and denote *ϑ* ¼ f g *ϑ*1, *ϑ*2, *ϑ*<sup>3</sup> . Let *θ* be a set of unknown parameters in f g *β*, Λ, Π, Γ, *ϑ* . Denote *Y* ¼ *Yi* f g , *i* ¼ 1, … , *n* , *Y*obs ¼ *Yi*,obs f g , *i* ¼ 1, … , *n* , *Y*mis ¼ *Yi*,mis f g , *i* ¼ 1, … , *n* , *X* ¼ *Xi* f g , *i* ¼ 1, … , *n* , *F* ¼ *ω<sup>i</sup>* f g , *i* ¼ 1, … , *n* , where *Yi*,mis and *Yi*,obs are sub-vectors of *Yi* corresponding to missing and observed values, respectively. Let *D* ¼ *Y*obs f g , *X* and *<sup>δ</sup>* <sup>¼</sup> *<sup>δ</sup>ij*,*<sup>i</sup>* <sup>¼</sup> 1, … ,*n*,*<sup>j</sup>* <sup>¼</sup> 1, … ,*<sup>p</sup>* � . If the marginal posterior distribution of *<sup>θ</sup>* given the dataset *D* is

$$\pi(|\theta|D) \doteq \left\{ \pi(\theta, F, Y\_{\text{mis}}, \delta | D) dF dY\_{\text{mis}} d\delta \right. \\ \left. = \frac{\int \pi(\theta) \pi(Y|\theta, F, X) \pi(F|\theta\_2) \pi(\delta | Y, \theta\_3 dF dY\_{\text{mis}} d\delta)}{\pi(Y\_{\text{obs}}|X)}, \tag{2}$$

its posterior mean (i.e., Bayesian estimate) can be evaluated by

$$\hat{\theta} = E(\theta|D) = \int \theta \pi(\theta|D)d\theta = \frac{\int \theta \pi(\theta)\pi(Y|\theta, F, X)\pi(F|\theta\_2)\pi(\delta|Y, \theta\_3)dF dY\_{\text{mis}} d\delta d\theta}{\pi(Y\_{\text{obs}}|X)},\tag{3}$$

where *π θ*ð Þ is the prior distribution of *θ*, *π*ð Þ *Y*j*θ*, *F*, *X* is the probability density function of *Y* given ð Þ *X*, *β*, Λ, *ϑ*<sup>1</sup> , i.e., the likelihood function of *θ* associated with the considered latent variable model, *<sup>π</sup>*ð Þ¼ *<sup>F</sup>*j*ϑ*<sup>2</sup> <sup>Π</sup>*<sup>n</sup> <sup>i</sup>*¼<sup>1</sup>*π ηi*j*ξ<sup>i</sup>* ð Þ , <sup>Π</sup>, <sup>Γ</sup>, *<sup>ϑ</sup>*<sup>2</sup> *π ξ<sup>i</sup>* ð Þ <sup>j</sup>*ϑ*<sup>2</sup> is the probability density of *<sup>F</sup>*, *<sup>π</sup> <sup>Y</sup>*obs ð Þ¼ <sup>j</sup>*<sup>X</sup>* <sup>Ð</sup> *π θ*ð Þ*π*ð Þ *Y*j*θ*, *F*, *X π*ð Þ *F*j*ϑ*<sup>2</sup> *π δ*ð Þ j*Y*, *ϑ*<sup>3</sup> *dFdY*mis*dδdθ* is the marginal likelihood of *Y*obs given *X*, and *π δ*ð Þ j*Y*, *ϑ*<sup>3</sup> is the probability density of *δ* given ð Þ *Y*, *ϑ*<sup>3</sup> .

From Eq. (3), it is easily seen that evaluating ^*θ* is almost impossible due to highdimensional integral involved. To address this issue, the well-known Markov chain Monte Carlo (MCMC) algorithm is employed to approximate ^*θ* by sequentially

drawing observations from the posterior distributions of components of *θ* and *F*, *Y*mis ð Þ , *δ* via the Gibbs sampler together with the Metropolis-Hastings algorithm. Denote *π θ*ð Þ j*Y*, *F*, *X*, *δ* , *π Y*misj*Y*obs ð Þ , *F*, *X*, *θ*, *δ* , and *π*ð Þ *F*j*Y*, *θ*, *X* as the conditional distributions of *θ* given ð*Y*,*F*, *X*,*δ*Þ, *Y*mis given *Y*obs ð Þ , *F*, *X*, *δ* , and *F* given ð Þ *Y*, *θ*, *X* , respectively. The Gibbs sampler is implemented as follows. At the *t*th iteration of the Gibbs sampler with the current observations *θ*ð Þ*<sup>t</sup>* , *F*ð Þ*<sup>t</sup>* , *Y*ð Þ*<sup>t</sup>* mis, *δ*ð Þ*<sup>t</sup>* n o of *<sup>θ</sup>*, *<sup>F</sup>*, *<sup>Y</sup>*mis f g , *<sup>δ</sup>* , we sequentially draw i) *<sup>β</sup>*ð Þ *<sup>t</sup>*þ<sup>1</sup> from the conditional distribution *π β*ð Þ <sup>j</sup>*Y*, *<sup>F</sup>*, *<sup>X</sup>*, <sup>Λ</sup>, *<sup>ϑ</sup>*<sup>1</sup> , ii) <sup>Λ</sup>ð Þ *<sup>t</sup>*þ<sup>1</sup> from the conditional distribution *<sup>π</sup>*ð Þ <sup>Λ</sup>j*Y*, *<sup>F</sup>*, *<sup>X</sup>*, *<sup>β</sup>*, *<sup>ϑ</sup>*<sup>1</sup> , iii) <sup>Π</sup>ð Þ *<sup>t</sup>*þ<sup>1</sup> from the conditional distribution *<sup>π</sup>*ð Þ <sup>Π</sup>j*F*, <sup>Γ</sup>, *<sup>ϑ</sup>*<sup>2</sup> , iv) <sup>Γ</sup>ð Þ *<sup>t</sup>*þ<sup>1</sup> from the conditional distribution *<sup>π</sup>*ð Þ <sup>Γ</sup>j*F*, <sup>Π</sup>, *<sup>ϑ</sup>*<sup>2</sup> , v) *<sup>ϑ</sup>*ð Þ *<sup>t</sup>*þ<sup>1</sup> <sup>1</sup> from the conditional distribution *π ϑ*ð Þ <sup>1</sup>j*Y*, *<sup>X</sup>*, *<sup>β</sup>*, <sup>Λ</sup>, *<sup>F</sup>* , vi) *<sup>ϑ</sup>*ð Þ *<sup>t</sup>*þ<sup>1</sup> 2 from the conditional distribution *π ϑ*ð Þ <sup>2</sup>j*F*, <sup>Π</sup>, <sup>Γ</sup> , vii) *<sup>Y</sup>*ð Þ *<sup>t</sup>*þ<sup>1</sup> mis from the conditional distribution *<sup>π</sup> <sup>Y</sup>*misj*Y*obs ð Þ , *<sup>δ</sup>*, *<sup>X</sup>*, *<sup>β</sup>*, <sup>Λ</sup>, *<sup>F</sup>*, *<sup>ϑ</sup>*<sup>1</sup> , and viii) *<sup>F</sup>*ð Þ *<sup>t</sup>*þ<sup>1</sup> from the conditional distribution *π*ð Þ *F*j*Y*, *X*, *θ* . The aforementioned conditional distributions may be some familiar distributions from which observations can be directly drawn. But in some cases, these conditional distributions may be some unfamiliar and rather complicated distributions from which observations can be indirectly drawn. In this case, some alternative approaches, for example, the Metropolis-Hastings algorithm, rejection sampling, acceptance-rejection sampling, importance sampling, hybrid-jump-based sampling, and reversible jump sampling, can be employed to sample observations from these complicated distributions. The convergence of the above introduced Gibbs algorithm can be monitored by the estimated potential scale reduction (EPSR) values associated with the parameters [15], which are evaluated continuously as the iterations run. The Gibbs sampler converges if the EPSR values of unknown parameters are less than 1.2. Also, we can assess the convergence of the Gibbs sampler by plotting several parallel sequences of observations drawn from different starting values of unknown parameters against iterations.

$$\text{Let } \left\{ \theta^{(m)}, m = 1, \dots, M \right\}, \left\{ F^{(m)}, m = 1, \dots, M \right\} \text{ and } \left\{ Y^{(m)}\_{\text{mis}}, m = 1, \dots, M \right\} \text{ be } M.$$

observations sampled from their corresponding conditional distributions via the aforementioned Gibbs sampler after the Gibbs sampler algorithm converges, respectively. Bayesian estimates of *θ*, *F*, and *Y*mis can be computed by

$$\hat{\theta} = \frac{1}{M} \sum\_{m=1}^{M} \theta^{(m)}, \quad \hat{F} = \frac{1}{M} \sum\_{m=1}^{M} \theta^{(m)}, \quad \hat{Y}\_{\text{mis}} = \frac{1}{M} \sum\_{m=1}^{M} Y\_{\text{mis}}^{(m)}, \tag{4}$$

respectively. Their corresponding standard deviations can be computed with their corresponding sample covariance matrices of the observations. The details can refer to the literature [7]. The above argument on Bayesian inference is a classical method. However, for a high-dimensional parametric or nonparametric model, one needs some new approaches to solve the computing time and efficiency and stability of algorithm problem. In fact, when the dimension of covariate matrix is large and the sample size is relatively small, i.e., the well-known "large *p* and small *n*" problem, the Gibbs sampler is computationally expensive and has poor stability.

To solve this issue for a high-dimensional regression model, there are some novel approaches developed for parameter/nonparametric function estimation in the Bayesian framework, for example, see Bayesian Lasso, Bayesian adaptive Lasso, Bayesian elastic net, and Bayesian *L*1*<sup>=</sup>*2. These approaches can be utilized to estimate model parameters or nonparametric functions and are simultaneously used to select
