1. Introduction

Artificial intelligence is mostly known in areas of data analysis and classification. Generative models otherwise are suitable for the generation (synthesis), completion, or even removal of data. A recent application in image processing, which combines both classification and generation, is image captioning. One advantage of automatic image captioning is indexing, which is important in content-based image retrieval [1]. Supervised learning learns on labeled datasets, which takes time to prepare and clean. The datasets context should be bias-free to work properly to train a model, and this can be crucial in domains like criminal justice, infrastructure, or finance [2] or even in clinical aspects [3]. Another example is the labeled maps used for road planning. These maps are satellite pictures with labels used as a navigation system and the labels need to be prepared [4]. Generative models can help to expand datasets or generate new examples from the abstraction of features of different classes.

A class is the assigned classification given to a data value. Another area, involved in classification and generation, is natural language processing (NLP), with

techniques used directly in captioning [5]. A successful method in automatic image processing is inpainting [6]; this model learns a distribution from data by sampling and learning from incomplete data. It is a supervised learning model with good results at data completion of images; it finds the distribution of surrounding areas of the pixels to complete the missing information. The downside of this method is that it is constrained to work only around existing information. It is not able to generate new data from conditions and specific constraints or without surrounding information for context. It has a simple architecture, but it is constrained to analyze pixel by pixel which makes it slow. It uses a dataset with images as targets and the same image with occlusions on different spots, creating gaps of information. Models like the restricted Boltzmann machine [7] or maximum likelihood [8] work on finding the probability density function of data and then maximizing the probability of it, but it is intractable so they result in only approximations. When these models are conditioned, generation can be specified with queries like text descriptions [9].

MSE <sup>¼</sup> <sup>1</sup> n Xn k¼1

A Simplified Generative Model Based on Gradient Descent and Mean Square Error

The Yk is the kth output target and Y

DOI: http://dx.doi.org/10.5772/intechopen.90099

network.

(Eq. (3)).

117

zero and variance of one.

papers of GANs [10] and VAEs [12].

Yk � Y ck � �<sup>2</sup>

where Θ are the weights that need to update at each epoch iteration of training, η represents a learning rate which scales the adjustment of the error, and ∇<sup>Θ</sup> Jð Þ Θ is the gradient error of the direction of update of the function. All samples have associated x and y axes which are initialized with the same values at the center of the learning space (.5,.5). The learning space is known as the latent space which is the corresponding abstraction of the information. It represents different aspects of the data [12]. In this example, the latent space is a two-dimensional continuous space between zero, and one represented graphically at the x- and y-axis for visualization. The initial weights of the layers are set with a Gaussian distribution with a mean of

The model was trained with a dataset of 10,000 labeled samples of handwritten numbers. The model creates a binary vector of 10 indexes from 0 to 9; this value is related to the class the image belongs to. For example, if we have the image of a number two, the only indexed position with value one in the vector is X i ^½ �¼ <sup>¼</sup> <sup>2</sup> 1; the rest are set to zero. Then the vector is extended with two additional floatingpoint values that are related to the learning space. The used image database is the MNist csv file [17], which is a comma-separated value format with 10,000 examples of images with handwritten digits of a 28 by 28 pixel size. The MNist was chosen to visually show results in the same way these results are shown in the

The network at the training stage starts to iterate with a random subset of the dataset. It keeps updating the weights inside the layers to improve the output and the variables associated with each sample of the training data (latent space) from

representing vector created at the beginning of the training. This vector has a 12 dimension size as part of each of the examples. When training with each sample on every epoch it updates the generated output from the network, with the target image, and the input value, which is the conditioner vector and learns the representation of the latent space inside the vector. This vector, with initial values starting as a one-hot vector [13], has a slight difference given by an explicit extension of the same dimensionality of the latent space and initializing with random values. The first 10 positions describe which number is the value of the image from 0 to 9 with a binary value. For example, to query an image of the number two, the initial vector is [0,0,1,0,0,0,0,0,0,0,0.5,0.5], and the two floating-point values are

This helps to determine which values of the network encodes in the forward pass

where X is the input vector, W and W<sup>0</sup> are the matrix weights of each layer, b, b<sup>0</sup>

and f are the activation functions of each layer, e.g., sigmoidal or ReLU. When

f WX ð Þþ þ b b<sup>0</sup> ð Þ (3)

,

the input. The values associated with the samples from the dataset are a

going to be used to learn the abstraction of the data (see Figure 1).

X ¼ f W<sup>0</sup>

c<sup>k</sup> is the output computed by the neural

Θ ¼ Θ � η ∗ ∇<sup>Θ</sup> Jð Þ Θ (2)

(1)

The generative adversarial networks (GANs) are an efficient solution for the generation of data [10]. They consist of two neural networks competing against each other in a min-max game approach. One of the networks has a Gaussian noise input and it generates data. The second networks train on discriminating real data from generated one. Both networks are competing constantly and getting better at each epoch. The training keeps going until it reaches the Nash equilibrium, which is the technique used to get them to learn and start generating realistic data. GANs are an improvement in the image generation quality on top of the variational autoencoders (VAEs) but are unstable, and finding parameters for training is not trivial. Wasserstein GAN is a variation that implements techniques to find hyperparameters in an interactive method [11].

The architecture called VAEs are a model that learns a Bayesian distribution from examples in a dataset. These learned distributions with a nonsupervised learning are capable of new data generation from the interpolation of the point sets in the latent space [12]. PixelRNN is a model that uses recurrent neural networks to generate images. The generation steps are pixel by pixel, which samples at each time from previous pixels. The problem with this kind of generation is the processing time at training and at generation runtime [9]. PixelCNN is a model that conditions the generation process with a one-hot vector [13], and it uses convolutional layers; the generation is also by pixel which increments and has a high processing runtime [14]. This chapter presents a model of a One Network Generator (ONG) tested on the dataset of the handwritten digits of MNist, with this simplified generative model based on traditional mean square error [15] and gradient descent [16] for a stable training. The rest of this chapter has the following organization. In Section 2, the proposed methodology of the model for data generation using one conditioned neural network is detailed. In Section 3, results obtained by testing the proposal are presented. In Section 4, conclusions are described.

## 2. Methodology and the proposed ONG

The model consists of a neural network with two fully connected layers of architecture with sigmoid function activation. It was tested with an image dataset samples to condition both sides of the network. At the output layer, it measures the result with the mean square error (Eq. (1)), and it is trained with gradient descent (Eq. (2)) and backpropagation [16]. It updates the inputs using the delta errors of the input layer.

A Simplified Generative Model Based on Gradient Descent and Mean Square Error DOI: http://dx.doi.org/10.5772/intechopen.90099

$$MSE = \frac{1}{n} \sum\_{k=1}^{n} \left( Y\_k - \widehat{Y\_k} \right)^2 \tag{1}$$

The Yk is the kth output target and Y c<sup>k</sup> is the output computed by the neural network.

$$
\Theta = \Theta - \eta \ast \nabla\_{\Theta} f(\Theta) \tag{2}
$$

where Θ are the weights that need to update at each epoch iteration of training, η represents a learning rate which scales the adjustment of the error, and ∇<sup>Θ</sup> Jð Þ Θ is the gradient error of the direction of update of the function. All samples have associated x and y axes which are initialized with the same values at the center of the learning space (.5,.5). The learning space is known as the latent space which is the corresponding abstraction of the information. It represents different aspects of the data [12]. In this example, the latent space is a two-dimensional continuous space between zero, and one represented graphically at the x- and y-axis for visualization. The initial weights of the layers are set with a Gaussian distribution with a mean of zero and variance of one.

The model was trained with a dataset of 10,000 labeled samples of handwritten numbers. The model creates a binary vector of 10 indexes from 0 to 9; this value is related to the class the image belongs to. For example, if we have the image of a number two, the only indexed position with value one in the vector is X i ^½ �¼ <sup>¼</sup> <sup>2</sup> 1; the rest are set to zero. Then the vector is extended with two additional floatingpoint values that are related to the learning space. The used image database is the MNist csv file [17], which is a comma-separated value format with 10,000 examples of images with handwritten digits of a 28 by 28 pixel size. The MNist was chosen to visually show results in the same way these results are shown in the papers of GANs [10] and VAEs [12].

The network at the training stage starts to iterate with a random subset of the dataset. It keeps updating the weights inside the layers to improve the output and the variables associated with each sample of the training data (latent space) from the input. The values associated with the samples from the dataset are a representing vector created at the beginning of the training. This vector has a 12 dimension size as part of each of the examples. When training with each sample on every epoch it updates the generated output from the network, with the target image, and the input value, which is the conditioner vector and learns the representation of the latent space inside the vector. This vector, with initial values starting as a one-hot vector [13], has a slight difference given by an explicit extension of the same dimensionality of the latent space and initializing with random values. The first 10 positions describe which number is the value of the image from 0 to 9 with a binary value. For example, to query an image of the number two, the initial vector is [0,0,1,0,0,0,0,0,0,0,0.5,0.5], and the two floating-point values are going to be used to learn the abstraction of the data (see Figure 1).

This helps to determine which values of the network encodes in the forward pass (Eq. (3)).

$$X = f(\mathcal{W}f(\mathcal{WX} + b) + b') \tag{3}$$

where X is the input vector, W and W<sup>0</sup> are the matrix weights of each layer, b, b<sup>0</sup> , and f are the activation functions of each layer, e.g., sigmoidal or ReLU. When

techniques used directly in captioning [5]. A successful method in automatic image processing is inpainting [6]; this model learns a distribution from data by sampling and learning from incomplete data. It is a supervised learning model with good results at data completion of images; it finds the distribution of surrounding areas of the pixels to complete the missing information. The downside of this method is that it is constrained to work only around existing information. It is not able to generate new data from conditions and specific constraints or without surrounding information for context. It has a simple architecture, but it is constrained to analyze pixel by pixel which makes it slow. It uses a dataset with images as targets and the same image with occlusions on different spots, creating gaps of information. Models like the restricted Boltzmann machine [7] or maximum likelihood [8] work on finding the probability density function of data and then maximizing the probability of it, but it is intractable so they result in only approximations. When these models are conditioned, generation can be specified with queries like text descriptions [9]. The generative adversarial networks (GANs) are an efficient solution for the generation of data [10]. They consist of two neural networks competing against each other in a min-max game approach. One of the networks has a Gaussian noise input and it generates data. The second networks train on discriminating real data from generated one. Both networks are competing constantly and getting better at each epoch. The training keeps going until it reaches the Nash equilibrium, which is the technique used to get them to learn and start generating realistic data. GANs are

Technology, Science and Culture - A Global Vision, Volume II

an improvement in the image generation quality on top of the variational

hyperparameters in an interactive method [11].

presented. In Section 4, conclusions are described.

2. Methodology and the proposed ONG

the input layer.

116

autoencoders (VAEs) but are unstable, and finding parameters for training is not trivial. Wasserstein GAN is a variation that implements techniques to find

The architecture called VAEs are a model that learns a Bayesian distribution from examples in a dataset. These learned distributions with a nonsupervised learning are capable of new data generation from the interpolation of the point sets in the latent space [12]. PixelRNN is a model that uses recurrent neural networks to generate images. The generation steps are pixel by pixel, which samples at each time from previous pixels. The problem with this kind of generation is the processing time at training and at generation runtime [9]. PixelCNN is a model that conditions the generation process with a one-hot vector [13], and it uses convolutional layers; the generation is also by pixel which increments and has a high processing runtime [14]. This chapter presents a model of a One Network Generator (ONG) tested on the dataset of the handwritten digits of MNist, with this simplified generative model based on traditional mean square error [15] and gradient descent [16] for a stable training. The rest of this chapter has the following organization. In Section 2, the proposed methodology of the model for data generation using one conditioned neural network is detailed. In Section 3, results obtained by testing the proposal are

The model consists of a neural network with two fully connected layers of architecture with sigmoid function activation. It was tested with an image dataset samples to condition both sides of the network. At the output layer, it measures the result with the mean square error (Eq. (1)), and it is trained with gradient descent (Eq. (2)) and backpropagation [16]. It updates the inputs using the delta errors of

dimensions is the abstraction of the data in that area. The model learns the latent space as it keeps evolving and learning to generate the data by adjusting the similar

A Simplified Generative Model Based on Gradient Descent and Mean Square Error

The latent space keeps a correlation in the two referenced dimensions of the latent space; this means that the distribution depends on both dimensions and is not complete or it does not fill the entire space of both axes. In addition, the abstraction of the data in the two dimensions makes the sampling results still blurry as the

Even when the distribution of the learned space has gaps in certain areas, the generated images from the given distribution are an interpolation from the learned dataset. It controls the results by sampling the learned space with input vectors; again the first 10 binary values represent the requested image number, and the extra two values are sampled from the latent space area with values in the range of

The model can sample in a continuous manner from the learned space by sampling from the valid areas to get interpolated samples in continuous values from

Sampled images in a specific x, y values from the latent space. The sampling vectors with different conditioner represented with number one. The ONG network trained with these representations and the results obtained

references together (see Figure 3).

DOI: http://dx.doi.org/10.5772/intechopen.90099

VAEs.

Figure 3.

Figure 4.

119

from picking at that same point in space.

Evolution of the latent space learned with the updated input.

(0,1) (see Figure 4).

Figure 1.

Conditioning the neural network with a target image and a target incomplete vector.

#### Figure 2.

Architecture of the ONG model. At each epoch, the output and input are updated.

backpropagation runs, the deltas of the first layer updates both dimensions from the inputs (x, y) which are the reference values of every example in the dataset. Each of the inputs of the dataset keeps its reference to the updated values computed from the delta error propagated from the MSE (see Figure 2).

At each epoch, the input values start to distribute the reference values (x1, x2) that describe the images from the dataset at the learning space (latent space). The resulting distributions of the axis (x, y) at the latent space should be a guide to sample from it and retrieve an interpolation in a continuous space.

#### 3. Results and discussion

The experiments were tested on a computer with 24 GB of RAM with a processor Intel(R) Xeon(R) CPU E5-2603 v3 at 1.60 Ghz with Windows 10 of 64 bit. A program developed in C# language of the. NET Framework to implement the model was designed. The ONG model conditions the neural network on both the inputs and the generated data. As each epoch runs, it updates the weights, and the error deltas from the first layer (input layer) compute the updated values of the axis (dimensions) of the learned space. Inside the input values of each image represented by the x, y dimensions, the latent space represented in these two

## A Simplified Generative Model Based on Gradient Descent and Mean Square Error DOI: http://dx.doi.org/10.5772/intechopen.90099

dimensions is the abstraction of the data in that area. The model learns the latent space as it keeps evolving and learning to generate the data by adjusting the similar references together (see Figure 3).

The latent space keeps a correlation in the two referenced dimensions of the latent space; this means that the distribution depends on both dimensions and is not complete or it does not fill the entire space of both axes. In addition, the abstraction of the data in the two dimensions makes the sampling results still blurry as the VAEs.

Even when the distribution of the learned space has gaps in certain areas, the generated images from the given distribution are an interpolation from the learned dataset. It controls the results by sampling the learned space with input vectors; again the first 10 binary values represent the requested image number, and the extra two values are sampled from the latent space area with values in the range of (0,1) (see Figure 4).

The model can sample in a continuous manner from the learned space by sampling from the valid areas to get interpolated samples in continuous values from

Figure 3. Evolution of the latent space learned with the updated input.

#### Figure 4.

backpropagation runs, the deltas of the first layer updates both dimensions from the inputs (x, y) which are the reference values of every example in the dataset. Each of the inputs of the dataset keeps its reference to the updated values computed from

At each epoch, the input values start to distribute the reference values (x1, x2) that describe the images from the dataset at the learning space (latent space). The resulting distributions of the axis (x, y) at the latent space should be a guide to

The experiments were tested on a computer with 24 GB of RAM with a processor Intel(R) Xeon(R) CPU E5-2603 v3 at 1.60 Ghz with Windows 10 of 64 bit. A program developed in C# language of the. NET Framework to implement the model was designed. The ONG model conditions the neural network on both the inputs and the generated data. As each epoch runs, it updates the weights, and the error deltas from the first layer (input layer) compute the updated values of the axis (dimensions) of the learned space. Inside the input values of each image represented by the x, y dimensions, the latent space represented in these two

the delta error propagated from the MSE (see Figure 2).

Architecture of the ONG model. At each epoch, the output and input are updated.

Conditioning the neural network with a target image and a target incomplete vector.

Technology, Science and Culture - A Global Vision, Volume II

3. Results and discussion

Figure 1.

Figure 2.

118

sample from it and retrieve an interpolation in a continuous space.

Sampled images in a specific x, y values from the latent space. The sampling vectors with different conditioner represented with number one. The ONG network trained with these representations and the results obtained from picking at that same point in space.

extensions of the MSE. These differences make a higher complexity of implementation and training of these methods, which is reduced in our proposed method. That makes this model a better fit as a generative model in simpler cases. In all these tests, we only tried abstraction in two dimensions; the mentioned methods use commonly a range of 100–120 dimensions for the abstraction, and we were also

With the MNist csv file [17], we tested our one-network architecture to generate

The main difference in complexity with other methods is that the proposed one in this chapter is using only a one-network and two-layered architecture. Commonly the mentioned architectures have two networks with more than two layers in each of the networks to cluster the information and start to generate new informa-

learned abstractions of the different handwritten numbers of the dataset and obtained new versions of the images which contained interpolated characteristics of

capable of conditioning the network to generate a requested image.

A Simplified Generative Model Based on Gradient Descent and Mean Square Error

DOI: http://dx.doi.org/10.5772/intechopen.90099

tion from the abstraction of the features in a given dataset.

the queried data.

Author details

121

Omar López-Rincón\* and Oleg Starostenko\*

Américas Puebla, Cholula, Puebla, Mexico

provided the original work is properly cited.

oleg.starostenko@udlap.mx

Department of Computing, Electronics and Mechatronics, Universidad de las

© 2020 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium,

\*Address all correspondence to: omar.lopezrn@udlap.mx and

#### Figure 5.

Sampled digits in continuous x, y values from the latent space interpolation. The conditioner of the vector changes only on the dimension that represents each class: 9, 5, and 2, respectively, top to bottom.

both axes. The result shows a soft transition at each change from the selected class with different features related to the position of the x, y values (see Figure 5).

We were able to generate the image of the requested number; the training took up 30,000 epochs with a learning error rate no lower than 1% computed from the MSE (Eq. (1)) of the neural network training stage. The generation was an interpolation of a continuous value sampled from the latent space. The latent space given by the reference values showed a correlation in both dimensions, which results in a problem when sampling from outside the distribution of the latent space. The generated information is incomplete or incorrect. In addition, the abstraction of the data in two dimensions makes the sampling result still blurry as the VAEs do. If it uses more dimensions to abstract the information, the output results in black gaps inside the distributions of the latent space. When sampling from the empty space, the generation results with no information or in incomplete generated images.

A possible solution for this completion of the information or latent space management could be the training of a K-autoencoder [18] to learn the sample space and generate a secondary latent space based on the incomplete dimensions to get only valid data that is mapped from control space. This will increase complexity, but it still needs to be tested. To improve generation, the input update should stop after covering the visible (x, y) axis of the latent space. This needs to be done because, since we are trying to represent the encoded information and the resembled images are projected into only two-dimensional space, the projection of the inputs keeps updating, and so the learning keeps updating; it is necessary to stop changing the inputs. To keep conditioning different aspects of a dataset, we need those features already labeled as in supervised learning. Otherwise, the network tries to cluster the information in a loop.

This technique of keeping the input in a fixed learned state helps to increase the quality of the generated images. The variation of the classes in the dataset is another problem with the dimensionality reduction. Reducing the variation or increasing the dimensionality representation, the quality of the generated data also increases, and it reduces the learning error of the network. The model also was able to learn without the labels, but it was slower, and the error rate kept a higher threshold of learning.

#### 4. Conclusions

This method is a simplified version of a neural network generator of two networks reduced into only one network architecture, which can learn to generate interpolations from given samples like GAN and VAE architecture do. Both models are based on two neural networks, and their parameters are optimized with

### A Simplified Generative Model Based on Gradient Descent and Mean Square Error DOI: http://dx.doi.org/10.5772/intechopen.90099

extensions of the MSE. These differences make a higher complexity of implementation and training of these methods, which is reduced in our proposed method. That makes this model a better fit as a generative model in simpler cases. In all these tests, we only tried abstraction in two dimensions; the mentioned methods use commonly a range of 100–120 dimensions for the abstraction, and we were also capable of conditioning the network to generate a requested image.

With the MNist csv file [17], we tested our one-network architecture to generate learned abstractions of the different handwritten numbers of the dataset and obtained new versions of the images which contained interpolated characteristics of the queried data.

The main difference in complexity with other methods is that the proposed one in this chapter is using only a one-network and two-layered architecture. Commonly the mentioned architectures have two networks with more than two layers in each of the networks to cluster the information and start to generate new information from the abstraction of the features in a given dataset.
