**4. Architecture design**

Although numerous variants of CNN architectures for solving different tasks are proposed from the deep learning community every year, their essential components and over-all structures are very similar. We group the recent classic network structures into three main types, including encoder, encoder-decoder and GANs.

#### **4.1 Encoder**

In 1990, LeCun et al. proposed a seminal network called LeNet [2], which help establish the modern CNN structure. Since then, many new methods and compositions are proposed to handle the difficulties encountered in training deep networks for challenging tasks such as objective detection and recognition in computer vision. Some representative works in recent years are AlexNet [18], ZFNet [7], VGGNet [19], GoogleNet [13], ResNet [11], Inception [14]. As mentioned earlier, new methods for constructing convolutional layers in these networks are proposed, e.g., shortcut connection [11] and mixed kernels [14, 20].

In general, the above-mentioned networks can all be regarded as encoders, in which each input such as an image is encoded into a high-level feature representation, as shown on the left in **Figure 4**. And this encoded representation can be further used for, such as image classification, object detection etc. In some literatures, an encoder is also called as a feature extractor. Specifically, the basic convolutional layers are the main components for constructing an encoder network, by stacking multiple layers, each layer in the network can learn high-level abstractions from previous layers [1]. More formally, an encoder network can be written as

$$Z = \mathcal{F}\_{enode}(X; \Theta) \tag{6}$$

where *X* is the input, Θ is the parameters to learn (e.g., kernels and bias) in the network and *Z* denotes the encoded representation such as a vector.

#### **4.2 Encoder-decoder**

In some specific tasks such as image segmentation [20], our goal is to map an input image to a segmented output image rather than an abstraction. An encoderdecoder structure is specifically designed for solving this type of task. There are

**Figure 4.** *Left: An encoder network. Middle: An encoder-decoder network. Right: Generative adversarial networks.*

*Advances in Convolutional Neural Networks DOI: http://dx.doi.org/10.5772/intechopen.93512*

very similar to the traditional convolutional layers. The main difference is that each capsule (i.e., an element in convolutional feature maps) has a weight matrix *Wij*

Although numerous variants of CNN architectures for solving different tasks

In 1990, LeCun et al. proposed a seminal network called LeNet [2], which help establish the modern CNN structure. Since then, many new methods and compositions are proposed to handle the difficulties encountered in training deep networks for challenging tasks such as objective detection and recognition in computer vision. Some representative works in recent years are AlexNet [18], ZFNet [7], VGGNet [19], GoogleNet [13], ResNet [11], Inception [14]. As mentioned earlier, new methods for constructing convolutional layers in these networks are proposed, e.g.,

In general, the above-mentioned networks can all be regarded as encoders, in which each input such as an image is encoded into a high-level feature representation, as shown on the left in **Figure 4**. And this encoded representation can be further used for, such as image classification, object detection etc. In some literatures, an encoder is also called as a feature extractor. Specifically, the basic

convolutional layers are the main components for constructing an encoder network, by stacking multiple layers, each layer in the network can learn high-level abstractions from previous layers [1]. More formally, an encoder network can be written as

where *X* is the input, Θ is the parameters to learn (e.g., kernels and bias) in the

In some specific tasks such as image segmentation [20], our goal is to map an input image to a segmented output image rather than an abstraction. An encoderdecoder structure is specifically designed for solving this type of task. There are

*Left: An encoder network. Middle: An encoder-decoder network. Right: Generative adversarial networks.*

network and *Z* denotes the encoded representation such as a vector.

*Z* ¼ F*encoder*ð Þ *X*; Θ (6)

are proposed from the deep learning community every year, their essential components and over-all structures are very similar. We group the recent classic network structures into three main types, including encoder, encoder-decoder and

(i.e., the sizes are 8 � 16 in [15] and 4 � 4 in [16] respectively).

shortcut connection [11] and mixed kernels [14, 20].

**4. Architecture design**

*Advances and Applications in Deep Learning*

GANs.

**4.1 Encoder**

**4.2 Encoder-decoder**

**Figure 4.**

**28**

many possible ways to implement an encoder-decoder structure, and many variants have also been proposed to improve the drawbacks in the last few years. A naive version of encoder-decoder network which was introduced in [20] can be denoted as

$$Z = \mathcal{F}\_{encoder}(X; \theta\_{encoder}) \tag{7}$$

$$
\hat{X} = \mathcal{F}\_{decoder}(Z; \theta\_{decoder}) \tag{8}
$$

where F*encoder* denotes an encoder CNN to map an input sample to a representation *Z* and F*decoder* represents a decoder CNN to reconstruct the input sample with *Z*. Specifically, CNN encoders usually conducts basic convolution operations (i.e., Section 2.1) and CNN decoders perform transposed convolution operations (i.e., Section 2.2).

As shown in the middle in **Figure 4**, an encoder-decoder network is still one complete network and we can train it with an end-to-end method. Note that there are generally many convolutional layers in each coder network, which results that it can be challenging to train a deep encoder-decoder network directly. Recall that the shortcut connection is often adopted to address the problems in deep CNNs. Naturally, we can add connections between the encoder and the decoder. An influential network based on this idea is U-Net [6], which is widely applied in many challenging domains such as medical image segmentation. The above two equations can also be rewritten as a composition of two functions.

$$
\hat{X} = \mathcal{F}\_{decoder} \bullet \mathcal{F}\_{encoder}(X; \Theta) = \mathcal{F}\_{auto encoder}(X; \Theta) \tag{9}
$$

Specifically, in unsupervised learning, an encoder-decoder network is also well known as autoencoder. And there are many variants of autoencoders proposed in recent years, some famous ones including variational autoencoder [21], denoising variational autoencoder [22] and conditional variational autoencoder [23, 24].

#### **4.3 GANs**

Since generative adversarial networks were firstly proposed by Goodfellow et al. [25] in 2014, this type of architectures for playing two-player minimax game has been most extensively studied. Partly because it is an unsupervised learning method and we can obtain a fancy generator network which can help generate fake examples from a latent space (i.e., a vector with some random noise). On the right in **Figure 4** shows the basic structure of GANs, in which a generator network can map some input noise into a fake example and make it look as real as possible and a discriminator network always tries to identify the fake sample from its input. By iteratively training the two players, they can both improve their methods. More formally, we can have

$$\hat{Y} = \mathcal{D}(\mathcal{G}(L; \theta\_{\hat{G}}), X\_{val}, \theta\_{\mathcal{D}}) \tag{10}$$

where *G* denotes the generator function and *D* represents the discriminator function. *L* is the latent space input in the generator, and its output is a fake example. *Xreal* is the real samples we have collected. And *<sup>Y</sup>*^ <sup>∈</sup> ½ � 0, 1 is the predicted result of the discriminator to show whether the input is real or fake.

As shown in **Table 1**, numerous variants of GANs architectures can be found in the recently published literatures and we broadly summarize these representative networks according to their published time. Note that the fundamental methods behind these architectures are very similar.


Note that there are numerous variants of loss functions used in the deep learning literature. However, the fundamental theories behind them are very similar. We group them into two categories, namely Divergence Loss Functions and Margin Loss Functions. And we also introduce six typical and classic loss functions that are

Divergence loss functions denote a family of loss functions based on computing the divergences between the predicted results and true labels, mainly including

Before introducing the Kullback–Leibler divergence, we need to understand that

*P X*ð Þ*logP X*ð Þ�<sup>X</sup>

*Q X*ð Þ

*P X*ð Þ*log P X*ð Þ

where *DKL*ð Þ *P*k*Q* denotes the Kullback–Leibler divergence from *Q* to *P*.

and Q. There is also a symmetrized form of the Kullback–Leibler divergence, which is known as the Jensen–Shannon divergence. It is a measure of the similarity

> *P* þ *Q* 2 � �

Specifically, *JSD P*ð Þ¼ k*Q* 0 means the two distributions are the same. Therefore, if we minimize the Jensen-Shannon divergence, we can make the distribution *Q* close to the distribution *P*. More Specifically, if *Q* denotes the distribution on data, and *P* represents the distribution which is learned by a CNN model. By minimizing the divergence, we can learn a model which is close to the true data distribution.

<sup>D</sup> : *<sup>X</sup>*�*P X*ð Þ *log* Dð Þþ *<sup>X</sup>*; <sup>Θ</sup><sup>D</sup> *<sup>L</sup>*�*Q L*ð Þ *log* ð Þ <sup>1</sup> �D Gð Þ ð Þ *<sup>L</sup>*; <sup>Θ</sup><sup>G</sup> ; <sup>Θ</sup><sup>D</sup> (14)

where G denotes the generator and D denotes the discriminator. And our goal is to try to make *Q*ð Þ Gð Þ *L* close to *P X*ð Þ. In other words, when the generative distribution of fake examples is close to the distribution of real samples, the discriminator

Log loss is widely used in the current deep neural networks due to its simplicity

þ 1 2

*DKL Q*k

*X*

*P X*ð Þ*logQ X*ð Þ

*<sup>X</sup>P X*ð Þ*logQ X*ð Þ is the cross entropy of P

*P* þ *Q* 2

� � (13)

(12)

the fundamental goal of deep learning is to learn a data distribution *Q* over the training dataset so that *Q* is close to the true data distribution *P*. Back in 1951, Kullback-Leibler divergence was proposed to measure the difference between two

commonly used for training neural networks.

*Advances in Convolutional Neural Networks DOI: http://dx.doi.org/10.5772/intechopen.93512*

Kullback-Leibler Divergence, Log Loss, Mean Squared Error.

distributions on the same probability space [41]. It is defined as

*X*

<sup>¼</sup> <sup>X</sup> *X*

*DKL*ð Þ¼ *<sup>P</sup>*k*<sup>Q</sup>* <sup>X</sup>

*<sup>X</sup>P X*ð Þ*logP X*ð Þ is the entropy of P and <sup>P</sup>

*JSD P*ð Þ¼ <sup>k</sup>*<sup>Q</sup>* <sup>1</sup>

cannot distinguish between the fake and the real.

and power. The binary log loss function is defined as

2

*DKL P*k

This is the main idea of GANs. The loss function of GANs is defined as

**5.1 Divergence loss functions**

*5.1.1 Kullback-Leibler divergence*

P

between *P* and *Q*.

min G

*5.1.2 Log loss*

**31**

max

#### **Table 1.**

*Representative architectures of GANs in recent years.*
