**3. Variational inference and variational autoencoder**

As the central problem in inference analysis, posterior distribution computation is facing two computing challenges: marginal likelihood computation and predictive distribution computation. Both of them are intractable since they often require computing high-dimensional integrals. Therefore, approximate inference approaches such as Gibbs sampling based on Markov chain Monte Carlo (MCMC) principle are appealing. However, Gibbs sampling and its variants are often restricted from some applications for their inefficiencies especially in the high-dimensional scenario. This awkward situation has not been changed until the VAE was proposed theoretically [36]. To get an understanding of a VAE, we will first start from the relevant bases including variational inference (VI), evidence low boundary (ELBO), mean field, and Kullback–Leibler (KL) divergence.

To describe the problem mathematically, let X = {*x*1, *x*2, …,*xN*} be a set of N observations and Z = {z1, z2, …,zm} be the m latent variables. P(Z, X; θ) denotes the joint distribution of X and Z given the parameter θ of the model. P(*X*|*Z*) and P(*Z*|*X*) are called the likelihood of *Z* and the posterior distribution of *X*, respectively.

#### **3.1. Variational inference**

**2. Autoencoders and variants**

74 Machine Learning and Biometrics

besides some specialties of its own.

J

detail in Section 3.

**2.1. Autoencoder and regularized variants**

Variational autoencoder has close relationship with autoencoder. An autoencoder is a neural network that consists of encoder and decoder. Encoder maps its input into representation and decoder reconstructs the representation back into the input, that is, perfect autoencoder can resemble the training data approximately by forcing to prioritize those aspects of the input that are helpful to resembling and discard the others. In this regard, the autoencoder learns the useful properties of training data. Comparatively, VAE shares the same character with AE

Autoencoder can be used to get useful features from the encoder output. Generally, in the view of the feature dimension, autoencoder falls into two categories: undercomplete and overcomplete. Undercomplete means the dimension of feature is less than that of the input and more salient features could be learned well in this scenario. Conversely, in the case of overcomplete, the dimension of feature is greater than that of the input and more sparsity features might be drawn in this setting. Additionally, the objective function is another core topic for an autoencoder. It is designed to make the autoencoder have capabilities such as linear regression or logistic regression, which limit the model to some useful properties of the

the model and *α* is a nonnegative hyperparameter that controls how much of the penalty term Ω to the relative to the standard objective function J. Numerically, setting *α* to 0 means not any regularization and larger values of *α* result in more regularization. Conceptually, autoencoders with penalty term is usually called regularized autoencoder that is encouraged to have small derivative of the representation, which leads the convergence faster than those that

Varied forms of regularizer terms make the autoencoder have different properties and bring us different variants of regularized autoencoder. These variants include primarily sparse autoencoder (SAE), denoising autoencoder (DAE) [3], contractive autoencoder (CAE), and variational autoencoder (VAE). Theoretically, VAE combines variance inference (VI) and neural networks. As a generative model, one of the prominent successes of VAE is that it realizes effective random sampling using back-propagation (BP) technology. This will be described in

Different from VAE, SAE makes majority of the neurons in its hidden layers be inactive since the active functions on these neurons are feasibly saturated for most input. This results in the sparsity of features, where many of the elements of the features are zero (or close to zero). In the view of mathematics, the sparsity of SAE is accomplished by the penalty term KL(p

(X, *θ*) = J(X; *θ*) + *α*Ω(*θ*). (1)

, bd} are the parameters of

~ ‖p),

, be , W<sup>d</sup>

training data. The general form of the objective function can be depicted as follows:

~

where X is the training data for a given autoencoder. <sup>θ</sup> <sup>=</sup> {W<sup>e</sup>

have not any regularization during the training time.

Theoretically, the motivation of variational inference [33, 35] is to find a feasible distribution to approximate the desired posterior distribution that is intractable. To measure how closeness of these two distributions are, Kullback–Leibler (KL) divergence [34] is introduced. Let *P*(*X*) and *Q*(*X*) indicate two different distributions of the continuous random variables *X*, their KL divergence is defined as:

\*\*Lemma \*\*Lemma\*\*\*\* \*\*a\*\*.\*\* \*\*Lemma\*\*\*\*

$$\text{KL}(\mathbf{P} || \mathbf{Q}) = \int \mathbb{Q}(\mathbf{X}) \log \frac{\mathbf{Q}(\mathbf{X})}{\mathbf{P}(\mathbf{X})} d\mathbf{X} = \mathbb{E}\_{\mathbf{Q}(\mathbf{X})} \left[ \log \frac{\mathbf{Q}(\mathbf{X})}{\mathbf{P}(\mathbf{X})} \right]. \tag{2}$$

*3.1.2. Mean field*

ize Q(Z) as:

where Qi

(Zi

be decomposed as:

*Q*(*Z*) = ∏*<sup>i</sup>*=1

*p*(*Z*, *X*) = *p*(*X*) ∏*<sup>i</sup>*=1

*m*

∗ (*Zi* (*EQ*[*log*(*P*(*Zi*

) ∝ exp(*E*−*Qi*

where log(P(X)) is constant with respect to Q(Z). Then, maximizing ELBO is equivalently maximizing the last summation term. Furthermore, we can derive out the optimal solution Q<sup>∗</sup>

Formula (8) indicates that the factors are all proportional to the exponentiated log the joint

inference (CAIV) [37] as well. However, as the ELBO is not a necessary convex function, there

As a deterministic model, general regularized autoencoder does not know anything about how to create a latent vector until a sample is input. Conversely, as a generative model, variational autoencoder (VAE) [36] emerges as a successful example of combination of variance inference and neural network. VAE forces the latent vector following some kind of distribution. These characters not only encourage the properties of the general regularized autoencoders but also expand some additional properties. For example, VAE can generate some data points even without any encoding input. It is the specialty of VAE that differs from the other regularized autoencoders. To explore VAE further, it is necessary to understand those complicated ideas such as the neural network structure, the loss function, and the optimiza-

is a global optimum.

Then, the ELBO can be written as Eq. (7):

*L*(*Q*) = log(*P*(*X*)) + ∑*<sup>i</sup>*=1

Lagrangian multiplier method:

*Qi*

is no guarantee that the solution Q<sup>∗</sup>

**3.2. Variational autoencoder**

distribution except the i

tion algorithm.

To simplify the optimization problem of ELBO, it is necessary to make assumption on the family , as the selection of the family affects impressively on complexity of the optimization algorithm for the problem. This assumption focuses on the way that how to factor-

> *<sup>m</sup> Qi* (*Zi*

*<sup>m</sup> p*(*Zi*


[*log*(*Zi*

, *Z*−*<sup>i</sup>*

th variational factor. This is the gist of the coordinate ascent variational


, *X*))] − *EQi*

[log(*Qi*

(*Zi*

, *X*)]). (8)

))]). (7)

by

77

ables of the model. According to the chain rule of probability, the joint distribution P(X, Z) can

) denotes the individual factors that are mutually independent over the latent vari-

). (5)

http://dx.doi.org/10.5772/intechopen.76434

Electrocardiogram Recognization Based on Variational AutoEncoder

, *X*). (6)

Intuitively, KL divergence is nonnegative and monotonically decreasing to the similarity of the distributions, that is, the more similar of the two distributions, the smaller the KL divergence value is. The identity equals zero when *Q*(*X*) is the same as P(*X*). However, the KL divergence is non-symmetrical as KL(Q∣|P) ≠ KL(P‖Q). The definition indicates implicitly another two properties: the KL divergence equals zero when Q(X) goes infinitively to zero regardless of P(X) and rises asymptotically infinity as P(X) becomes zero. Hence, we can approximate the distribution P(X) for Q(X) by minimizing KL(Q(X)‖P(X)).

### *3.1.1. Evidence lower boundary*

In the context of Bayesian statistics, "Evidence" is an alternative term used for the marginal likelihood of the observations. Formula (3) reveals the relationship between KL divergence and the logarithm of the evidence *P*(*X*). The difference between them equals the expectation of log(*p*(*X*, *Z*)) − log(*q*(*Z*)), which is called the evidence lower boundary (ELBO). As the KL divergence is nonnegative, then we have the evidence lower boundary as formula (3). Jordan et al. [1] got the same result originally using the Jensen's inequality. Formula (3) shows literally the name of ELBO. We may define the expectation of log(*p*(*X*, *Z*)) − log(*q*(*Z*)) as *L*(*Q*), a function of distribution of Q(Z):

$$\log\left(\mathbb{P}(\mathbf{X})\right) - \text{KL}\{\mathbb{Q}(\mathbf{Z})\|\:\mathbb{P}(\mathbf{Z}\mid\mathbf{X})\} = E\_{\mathbb{Q}}[\log\{\mathbb{P}(\mathbf{Z},\mathbf{X})\} - \log\{\mathbb{Q}(\mathbf{Z})\}] \triangleq \mathcal{L}(\mathbb{Q}).\tag{3}$$

Intuitively, maximizing ELBO is equivalent to minimizing the KL divergence. As the KL (*Q*(*Z*)‖*P*(*Z*|*X*)) decreases to zero, it is necessary to make the posterior distribution *P*(*Z*|*X*) share the same distribution with *Q*(*Z*). Hence, we can use *Q*(*Z*) to approximate the posterior distribution *P*(*Z*|*X*) by maximizing ELBO, which can be realized by optimizing the objective of *L*(*q*) as formula (4), finding an optimal distribution Q∗(*Z*) within a specifying family of densities over the latent variables. Expectation maximization (EM) algorithm [2] is one of the successful approaches that were designed for finding the optimal solution Q∗(*Z*) within the family . It alternates iteratively between expectation step (E-step) where the posterior distribution P(Z|X; θ) is calculated and then, maximization step (M-step) where the expectation of the complete-data likelihood with respect to the posterior distribution P(Z|X; θold) is maximized by optimizing the parameters θnew. Then updates the parameters θold with θnew:

$$Q^\* = \operatorname\*{argmax}\_{Q \neq \emptyset} L(Q). \tag{4}$$

#### *3.1.2. Mean field*

closeness of these two distributions are, Kullback–Leibler (KL) divergence [34] is introduced. Let *P*(*X*) and *Q*(*X*) indicate two different distributions of the continuous random variables *X*,

*Q*(*X*)

Intuitively, KL divergence is nonnegative and monotonically decreasing to the similarity of the distributions, that is, the more similar of the two distributions, the smaller the KL divergence value is. The identity equals zero when *Q*(*X*) is the same as P(*X*). However, the KL divergence is non-symmetrical as KL(Q∣|P) ≠ KL(P‖Q). The definition indicates implicitly another two properties: the KL divergence equals zero when Q(X) goes infinitively to zero regardless of P(X) and rises asymptotically infinity as P(X) becomes zero. Hence, we can approximate the distri-

In the context of Bayesian statistics, "Evidence" is an alternative term used for the marginal likelihood of the observations. Formula (3) reveals the relationship between KL divergence and the logarithm of the evidence *P*(*X*). The difference between them equals the expectation of log(*p*(*X*, *Z*)) − log(*q*(*Z*)), which is called the evidence lower boundary (ELBO). As the KL divergence is nonnegative, then we have the evidence lower boundary as formula (3). Jordan et al. [1] got the same result originally using the Jensen's inequality. Formula (3) shows literally the name of ELBO. We may define the expectation of log(*p*(*X*, *Z*)) − log(*q*(*Z*)) as *L*(*Q*), a function of distribution

*log*(*P*(*X*)) − KL(*Q*(*Z*)‖*P*(*Z*|*X*)) = *EQ*[log(*P*(*Z*, *X*)) − log(*Q*(*Z*))] ≜ *L*(*Q*). (3)

Intuitively, maximizing ELBO is equivalent to minimizing the KL divergence. As the KL (*Q*(*Z*)‖*P*(*Z*|*X*)) decreases to zero, it is necessary to make the posterior distribution *P*(*Z*|*X*) share the same distribution with *Q*(*Z*). Hence, we can use *Q*(*Z*) to approximate the posterior distribution *P*(*Z*|*X*) by maximizing ELBO, which can be realized by optimizing the objective of *L*(*q*) as formula (4), finding an optimal distribution Q∗(*Z*) within a specifying family of densities over the latent variables. Expectation maximization (EM) algorithm [2] is one of the successful approaches that were designed for finding the optimal solution Q∗(*Z*) within the family . It alternates iteratively between expectation step (E-step) where the posterior distribution P(Z|X; θ) is calculated and then, maximization step (M-step) where the expectation of the complete-data likelihood with respect to the posterior distribution P(Z|X; θold) is maximized by optimizing the parameters θnew. Then updates the parameters

*Q*∈

*L*(*Q*). (4)

*<sup>P</sup>*(*X*) *dX* <sup>=</sup> *EQ*(*X*)[log \_

*Q*(*X*)

*<sup>P</sup>*(*X*)]. (2)

their KL divergence is defined as:

76 Machine Learning and Biometrics

*3.1.1. Evidence lower boundary*

of Q(Z):

θold with θnew:

KL(P∣|Q) <sup>=</sup> <sup>∫</sup>*Q*(*X*) *log*\_

bution P(X) for Q(X) by minimizing KL(Q(X)‖P(X)).

*Q*<sup>∗</sup> = argmax

To simplify the optimization problem of ELBO, it is necessary to make assumption on the family , as the selection of the family affects impressively on complexity of the optimization algorithm for the problem. This assumption focuses on the way that how to factorize Q(Z) as:

$$\mathcal{Q}(\mathbf{Z}) = \prod\_{\mu\_1}^{n} \mathcal{Q}\_{\uparrow}(\mathbf{Z}\_{\downarrow}).\tag{5}$$

where Qi (Zi ) denotes the individual factors that are mutually independent over the latent variables of the model. According to the chain rule of probability, the joint distribution P(X, Z) can be decomposed as:

$$p(Z\_i \mid \mathbf{X}) = p(\mathbf{X}) \prod\_{i=1}^{n} p(Z\_i \mid Z\_{1:(i-1)}, \mathbf{X}).\tag{6}$$

Then, the ELBO can be written as Eq. (7):

$$L(\mathbf{Q}) = \log(\mathbf{P}(\mathbf{X})) + \sum\_{l=1}^{n} \left( \mathbb{E}\_{\mathbf{Q}} \left[ \log \left( \mathbf{P}(\mathbf{Z} \mid \mathbf{Z}\_{1:(l-1)'} \mathbf{X}) \right) \right] - \mathbb{E}\_{\mathbf{Q}} \left[ \log \left( \mathbf{Q}(\mathbf{Z}) \right) \right] \right). \tag{7}$$

where log(P(X)) is constant with respect to Q(Z). Then, maximizing ELBO is equivalently maximizing the last summation term. Furthermore, we can derive out the optimal solution Q<sup>∗</sup> by Lagrangian multiplier method:

$$\mathbb{Q}\_{\mathbb{P}}^\*(\mathbb{Z}) \propto \exp\left(\mathbb{E}\_{\mathbb{Q}}\left[\log\langle \mathbb{Z}\_{\mathbb{P}} \, \mathbb{Z}\_{\mathbb{-}\mathbb{P}} \, \mathrm{X} \rangle \right] \right). \tag{8}$$

Formula (8) indicates that the factors are all proportional to the exponentiated log the joint distribution except the i th variational factor. This is the gist of the coordinate ascent variational inference (CAIV) [37] as well. However, as the ELBO is not a necessary convex function, there is no guarantee that the solution Q<sup>∗</sup> is a global optimum.

#### **3.2. Variational autoencoder**

As a deterministic model, general regularized autoencoder does not know anything about how to create a latent vector until a sample is input. Conversely, as a generative model, variational autoencoder (VAE) [36] emerges as a successful example of combination of variance inference and neural network. VAE forces the latent vector following some kind of distribution. These characters not only encourage the properties of the general regularized autoencoders but also expand some additional properties. For example, VAE can generate some data points even without any encoding input. It is the specialty of VAE that differs from the other regularized autoencoders. To explore VAE further, it is necessary to understand those complicated ideas such as the neural network structure, the loss function, and the optimization algorithm.

In the view of the hierarchy, the neural network structure of the VAE is mainly composed of three parts. The first part is the encoder, which is used to encode the signals from the input layer. The second part is the decoder, which is located in the right side as shown in **Figure 2**. The third part is the sampling unit located in the middle of the other two parts. Except for the encoder and the decoder which are similar to that of the traditional autoencoder, the additional sampling unit is responsible for sampling from the latent variables spaces.

It is clear that the gradient depends not only on the decoder's distribution P(X<sup>i</sup>

there is no stochastic unit with the neural network. Kingma et al. [36] presented a method named "reparameterization trick" to solve the problem successfully. Instead of drawing from the encoder's representations directly, sampling unit generates μ and σ at first by sampling from the input X. Given μ(X) and σ(X), we can do sampling from (μ(X), σ2(X)), and then compute <sup>Z</sup> <sup>=</sup> μ(X) <sup>+</sup> <sup>σ</sup>(X) <sup>∗</sup> <sup>ε</sup>, where ε~(0, I). Consequently, given a fixed X and ε, LVAE becomes continuous and deterministic for P and Q, which means that derivation of LVAE over Q is computable. Then those algorithms based on the gradient descent (GD) can be effective on VAE neural networks. Comparing to the time-consuming Gibbs sampling methods, algorithms based on GD

In this section, we introduce our method on ECG preprocessing and enhancement. The task in this procedure is to split the ECG waves into segments according to the cardiac cycle [28] and then take them as data points for training our models. As described in Section 1, QRS complex is responsible for the activities of ventricular depolarization and repolarization, it has morphologically higher amplitude and sharper peak than other components such as P-wave and T-wave. Therefore, it is much more convenient to detect and locate Q peaks (or R, S peaks) than any other components in these ECG segments. Algorithm 1 describes the procedure of how to split ECG waveforms in detail. The templates selected in algorithm 1 are produced by

The critical step in Algorithm 1 is how to evaluate the similarity between the selected area on the ECG waveform and the given template. Generally, the mean squared error (MSE) is usually adopted in some ECG recognizing applications. However, the main disadvantage of this method is that it is time-consuming to align the selected area with the given template. For example, there are two pictures with the same curve, the similar value of the pictures may be definitely tiny if the template aligns extremely well or a very large as they do not cover each other at all. Another reasonable approach named the correlation coefficient is being currently used [21, 26]. Instead of computing directly the difference between the ECG waveform and the template as the MSE method, it solves an optimal problem that minimizes the sum of the squares of the

We introduce a parameter hstep for the length of the segment of ECG waveforms. It is important to keep hstep lie in a proper range. Otherwise, there are more than one R peaks or none in the segment when the hstep is out of the range. To avoid the awkward situations, there is a trick that let the hstep be proportional to the distance between two adjacent peaks and rather less than it, that

per minute, then hstep ≤ 200. As the heart rate is not a constant during the sampling procedure, then distance can be calculated by the inequation. For this reason, in all of our experiments, the distance is set empirically as the average of that of previous three cardiac periods. The searching step can be initialized as a constant value as there are no any variations on the vertical directions.

heart rate . For instance, suppose sampling rate is 250 Hz and heart rate equals 75 times

offsets of the selected ECG data points to the corresponding points on the template.

the encoder's distribution Q(Z|X<sup>i</sup>

are much more effective and efficient.

**4. ECG preprocessing and enhancement**

the contours of the most ECG R wave peaks.

is, hstep ≤

sampling rate \_\_\_\_\_\_\_\_\_\_\_

We keep the vstep equaling 1 in this chapter.


79

). Except for the non-continuity of the encoder's distribution,

Electrocardiogram Recognization Based on Variational AutoEncoder

http://dx.doi.org/10.5772/intechopen.76434

Another issue about how to train the structure is the loss function as shown in formula (9), which is essentially the same as the negative L(Q) in formula (7). In the view of training, the losses of a VAE come from two aspects: the first part is from the neural network that measures how much the difference between the reconstructed data and the original input. This part encourages the decoder to learn to reconstruct the input. Otherwise, the value of this part will become even larger that will increase the total loss value finally. The second part comes from the KL divergence that indicates how much close of the encoder's distribution Q(Z|*X*) and the latent variables distribution. This part can be taken as a regularizer as that of the traditional autoencoder. It forces the encoder's distribution Q(Z|X<sup>i</sup> ) go as close to the latent variables distribution P(Z) as possible by minimizing KL divergence of them. In other words, if the encoder outputs representations are different from the specified distribution, then the regularizer term will penalize the loss function. Otherwise, the penalty will vanish away:

$$L\_{\rm VAE} = -\sum\_{i=1}^{N} \left( E\_{Z \sim Q(Z \mid X)} \left[ \log \left( P(X\_i \mid Z) \right) \right] - KL \left( Q(Z \mid X\_i) \| P(Z) \right) \right). \tag{9}$$

The last idea for VAE is the way that how to minimize the loss function of Eq. (9) as working on the neural networks, where the algorithms based on gradient decent are popularly adopted. Comparatively, it is feasible to compute the first term in the Eq. (9) as the expectation indicates the reconstruction difference and we can calculate it by the mean squared error between the output of the encoder and the decoder, as similar to that of the traditional autoencoders. However, it is more difficult to compute the second KL divergence directly as P(Z) and P(*Xi* |Z) are all intractable. Fortunately, An effective solution was proposed by Kingma et al. [36] on the assumption that Q(Z|*Xi* ) follows a normal distribution Q(Z|X<sup>i</sup> )~(Z; <sup>θ</sup>), where θ = {μ1 , <sup>Σ</sup>1} and μ1 and Σ<sup>1</sup> are the parameters of the mean and the variance, respectively. For the simplicity, here we assume P(Z) = (Z; 0, I), where I is a unitary diagonal matrix. The advantages of this choice make the computation of the KL divergence manageable. We can compute it in the closed form as:

$$\text{KL}\left(Q\{Z\mid X\_{\text{i}}\}\|\|P(Z)\right) = \frac{1}{2}\Big(\log\frac{1}{|\Sigma\_{\text{i}}|} - D + tr\{\Sigma\_{\text{i}}\} + \langle\mu\_{\text{i}}\rangle^{\text{T}}(\mu\_{\text{i}})\Big). \tag{10}$$

D is a constant value that is only relevant to the dimensionality of the distribution.

Additionally, to train a VAE neural structure, the gradient decent should be focused on when error back propagates through the sampling layers. However, we cannot derivate the loss function over the distribution *Q*(*Z*|*Xi* ) directly as the distribution is a non-continuous operation and has no gradient. To clarify the problem, suppose we can take the derivation of J VAE respect to Q(Z|*Xi* ), then we get the gradient expression as following:

$$\frac{\partial L\_{\text{M}\varepsilon}}{\partial \mathbf{Q}} = \log \left( P(\mathbf{X}\_{\text{1}} \, \mathbf{Z}) \right) + \log \left( \mathbf{Q} \right) - \log \left( P(\mathbf{Z}) \right) + \text{const.} \tag{11}$$

It is clear that the gradient depends not only on the decoder's distribution P(X<sup>i</sup> |Z) but also on the encoder's distribution Q(Z|X<sup>i</sup> ). Except for the non-continuity of the encoder's distribution, there is no stochastic unit with the neural network. Kingma et al. [36] presented a method named "reparameterization trick" to solve the problem successfully. Instead of drawing from the encoder's representations directly, sampling unit generates μ and σ at first by sampling from the input X. Given μ(X) and σ(X), we can do sampling from (μ(X), σ2(X)), and then compute <sup>Z</sup> <sup>=</sup> μ(X) <sup>+</sup> <sup>σ</sup>(X) <sup>∗</sup> <sup>ε</sup>, where ε~(0, I). Consequently, given a fixed X and ε, LVAE becomes continuous and deterministic for P and Q, which means that derivation of LVAE over Q is computable. Then those algorithms based on the gradient descent (GD) can be effective on VAE neural networks. Comparing to the time-consuming Gibbs sampling methods, algorithms based on GD are much more effective and efficient.
