**2. Autoencoders and variants**

Variational autoencoder has close relationship with autoencoder. An autoencoder is a neural network that consists of encoder and decoder. Encoder maps its input into representation and decoder reconstructs the representation back into the input, that is, perfect autoencoder can resemble the training data approximately by forcing to prioritize those aspects of the input that are helpful to resembling and discard the others. In this regard, the autoencoder learns the useful properties of training data. Comparatively, VAE shares the same character with AE besides some specialties of its own.

#### **2.1. Autoencoder and regularized variants**

Autoencoder can be used to get useful features from the encoder output. Generally, in the view of the feature dimension, autoencoder falls into two categories: undercomplete and overcomplete. Undercomplete means the dimension of feature is less than that of the input and more salient features could be learned well in this scenario. Conversely, in the case of overcomplete, the dimension of feature is greater than that of the input and more sparsity features might be drawn in this setting. Additionally, the objective function is another core topic for an autoencoder. It is designed to make the autoencoder have capabilities such as linear regression or logistic regression, which limit the model to some useful properties of the training data. The general form of the objective function can be depicted as follows:

$$
\tilde{\mathbf{J}}(\mathbf{X}, \boldsymbol{\Theta}) = \mathbf{J}(\mathbf{X}; \boldsymbol{\Theta}) + a\boldsymbol{\Omega}(\boldsymbol{\Theta}).\tag{1}
$$

where p ~

corrupted data x

~

and Kullback–Leibler (KL) divergence.

distribution of *X*, respectively.

**3.1. Variational inference**

To describe the problem mathematically, let X =

manifold, the bigger step DAE takes to the manifold.

**3. Variational inference and variational autoencoder**

is the given sparsity value. Parameter p will be adjusted gradually to p

penalty term to the object function, DAE is trained by the noise-corrupted data x

stage and achieve satisfactory sparsity. Analogous to SAE, CAE [4, 26] yields the specialized contractive properties by the penalized term—a Jacobin matrix that is consisted of the partial derivatives of the decoder active functions to input vectors. Then the input perturbations can be resisted during training time. Consequently, neighborhood of points in samples is encouraged to map into a smaller area, which can be thought as the capability of contracting for CAE. The motivation of DAE is to be insensitive to noise. Instead of adding an additional

[27, 30]. DAE yields great success in many cases especially in manifold assumption. As the

to take those points that are farther from the manifold to near. The larger distance from the

Generally, these autoencoders share some properties. DAE and CAE are able to learn the manifold structure of the samples. Simultaneously, SAE and CAE have the similar sparsity character on their representation. Nevertheless, the implementations of these autoencoders are quite different. For example, DAE reaches the goal by using the noise-corrupted data to train the structure to learn the proper parameters that can reconstruct the original samples without any noise. Comparatively, CAE takes Jacobian matrix as part of the loss function and encourages robustness on the representation by contracting the samples during the training process.

As the central problem in inference analysis, posterior distribution computation is facing two computing challenges: marginal likelihood computation and predictive distribution computation. Both of them are intractable since they often require computing high-dimensional integrals. Therefore, approximate inference approaches such as Gibbs sampling based on Markov chain Monte Carlo (MCMC) principle are appealing. However, Gibbs sampling and its variants are often restricted from some applications for their inefficiencies especially in the high-dimensional scenario. This awkward situation has not been changed until the VAE was proposed theoretically [36]. To get an understanding of a VAE, we will first start from the relevant bases including variational inference (VI), evidence low boundary (ELBO), mean field,

> {*x*1, *x*2, …,*xN*}

Z = {z1, z2, …,zm} be the m latent variables. P(Z, X; θ) denotes the joint distribution of X and Z given the parameter θ of the model. P(*X*|*Z*) and P(*Z*|*X*) are called the likelihood of *Z* and the posterior

Theoretically, the motivation of variational inference [33, 35] is to find a feasible distribution to approximate the desired posterior distribution that is intractable. To measure how

lie farther away from the manifold than the uncorrupted ones, DAE tends

~

http://dx.doi.org/10.5772/intechopen.76434

Electrocardiogram Recognization Based on Variational AutoEncoder

in the training

75

~ (x

be a set of N observations and

<sup>~</sup> <sup>=</sup> <sup>x</sup> <sup>+</sup> βτ)

where X is the training data for a given autoencoder. <sup>θ</sup> <sup>=</sup> {W<sup>e</sup> , be , W<sup>d</sup> , bd} are the parameters of the model and *α* is a nonnegative hyperparameter that controls how much of the penalty term Ω to the relative to the standard objective function J. Numerically, setting *α* to 0 means not any regularization and larger values of *α* result in more regularization. Conceptually, autoencoders with penalty term is usually called regularized autoencoder that is encouraged to have small derivative of the representation, which leads the convergence faster than those that have not any regularization during the training time.

Varied forms of regularizer terms make the autoencoder have different properties and bring us different variants of regularized autoencoder. These variants include primarily sparse autoencoder (SAE), denoising autoencoder (DAE) [3], contractive autoencoder (CAE), and variational autoencoder (VAE). Theoretically, VAE combines variance inference (VI) and neural networks. As a generative model, one of the prominent successes of VAE is that it realizes effective random sampling using back-propagation (BP) technology. This will be described in detail in Section 3.

Different from VAE, SAE makes majority of the neurons in its hidden layers be inactive since the active functions on these neurons are feasibly saturated for most input. This results in the sparsity of features, where many of the elements of the features are zero (or close to zero). In the view of mathematics, the sparsity of SAE is accomplished by the penalty term KL(p ~ ‖p), where p ~ is the given sparsity value. Parameter p will be adjusted gradually to p ~ in the training stage and achieve satisfactory sparsity. Analogous to SAE, CAE [4, 26] yields the specialized contractive properties by the penalized term—a Jacobin matrix that is consisted of the partial derivatives of the decoder active functions to input vectors. Then the input perturbations can be resisted during training time. Consequently, neighborhood of points in samples is encouraged to map into a smaller area, which can be thought as the capability of contracting for CAE. The motivation of DAE is to be insensitive to noise. Instead of adding an additional penalty term to the object function, DAE is trained by the noise-corrupted data x ~ (x <sup>~</sup> <sup>=</sup> <sup>x</sup> <sup>+</sup> βτ) [27, 30]. DAE yields great success in many cases especially in manifold assumption. As the corrupted data x ~ lie farther away from the manifold than the uncorrupted ones, DAE tends to take those points that are farther from the manifold to near. The larger distance from the manifold, the bigger step DAE takes to the manifold.

Generally, these autoencoders share some properties. DAE and CAE are able to learn the manifold structure of the samples. Simultaneously, SAE and CAE have the similar sparsity character on their representation. Nevertheless, the implementations of these autoencoders are quite different. For example, DAE reaches the goal by using the noise-corrupted data to train the structure to learn the proper parameters that can reconstruct the original samples without any noise. Comparatively, CAE takes Jacobian matrix as part of the loss function and encourages robustness on the representation by contracting the samples during the training process.
