*5.1.1 Kullback-Leibler divergence*

Before introducing the Kullback–Leibler divergence, we need to understand that the fundamental goal of deep learning is to learn a data distribution *Q* over the training dataset so that *Q* is close to the true data distribution *P*. Back in 1951, Kullback-Leibler divergence was proposed to measure the difference between two distributions on the same probability space [41]. It is defined as

$$\begin{split} D\_{\text{KL}}(P||Q) &= \sum\_{X} P(X) \log P(X) - \sum\_{X} P(X) \log Q(X) \\ &= \sum\_{X} P(X) \log \frac{P(X)}{Q(X)} \end{split} \tag{12}$$

P where *DKL*ð Þ *P*k*Q* denotes the Kullback–Leibler divergence from *Q* to *P*. *<sup>X</sup>P X*ð Þ*logP X*ð Þ is the entropy of P and <sup>P</sup> *<sup>X</sup>P X*ð Þ*logQ X*ð Þ is the cross entropy of P and Q. There is also a symmetrized form of the Kullback–Leibler divergence, which is known as the Jensen–Shannon divergence. It is a measure of the similarity between *P* and *Q*.

$$JSD(P||Q) = \frac{1}{2} D\_{KL} \left( P || \frac{P+Q}{2} \right) + \frac{1}{2} D\_{KL} \left( Q || \frac{P+Q}{2} \right) \tag{13}$$

Specifically, *JSD P*ð Þ¼ k*Q* 0 means the two distributions are the same. Therefore, if we minimize the Jensen-Shannon divergence, we can make the distribution *Q* close to the distribution *P*. More Specifically, if *Q* denotes the distribution on data, and *P* represents the distribution which is learned by a CNN model. By minimizing the divergence, we can learn a model which is close to the true data distribution. This is the main idea of GANs. The loss function of GANs is defined as

$$\min\_{\mathcal{G}} \quad \max\_{\mathcal{D}} : \quad \mathbb{E}\_{\mathbf{X} \sim P(\mathbf{X})} \log \mathcal{D}(\mathbf{X}; \Theta\_{\mathcal{D}}) + \mathbb{E}\_{L \sim Q(L)} \log \left( \mathbf{1} - \mathcal{D}(\mathcal{G}(L; \Theta\_{\mathcal{G}}); \Theta\_{\mathcal{D}}) \right) \tag{14}$$

where G denotes the generator and D denotes the discriminator. And our goal is to try to make *Q*ð Þ Gð Þ *L* close to *P X*ð Þ. In other words, when the generative distribution of fake examples is close to the distribution of real samples, the discriminator cannot distinguish between the fake and the real.

#### *5.1.2 Log loss*

Log loss is widely used in the current deep neural networks due to its simplicity and power. The binary log loss function is defined as

**5. Loss functions**

Laplacian Pyramid GANs

*Advances and Applications in Deep Learning*

Deep Convolutional GANs

Semi-supervised GANs

Auxiliary Classifier GANs

Label-noise Robust GANs

*Representative architectures of GANs in recent years.*

[27]

[28]

[30]

[33]

[37]

**Table 1.**

constructing CNNs.

iteration of training).

**30**

Before introducing the loss functions, we need to understand that the ultimate goal to train a neural network Fð Þ *X*; Θ is to find a suitable set of parameters Θ so that our model can achieve good performance on the unseen samples (i.e., test dataset). The typical way to search Θ in machine learning is to use loss functions as a criterion during training. In other words, training neural networks is equivalent to optimizing the loss functions by back-propagation. Accurately, a loss function outputs a scalar value which is regarded as a criterion for measuring the difference between the predicted result and the true label over one sample. And during training, our goal is to minimize the scalar value over *m* training samples (i.e., cost function). Therefore, as shown in **Figure 1**, loss functions play a significant role in

**Name Year Summary**

Conditional GANs [26] 2014 Labels are included in G and D.

GANs [25] 2014 The original version of GANs, where G and D are implemented

Bidirectional GANs [29] 2016 An extra encoder was adopted based on the traditional GANs.

InfoGANs [31] 2016 An extra classifier was added into the GANs. Energy-based GANs [32] 2016 The D was replaced with an encoder-decoder network.

Progressive GANs [34] 2017 Progressive steps are adopted to explain the networks. BigGANs [35] 2018 A large GANs with self-attention module and hinge loss. Self-attention GANs [36] 2019 The self-attention mechanism is proposed to build G and D.

AutoGANs [38] 2019 The neural architecture search algorithm is used to obtain G and D.

Your Local GANs [39] 2020 A new local sparse attention layer was proposed. MSG-GANs [40] 2020 There are connections from G to D.

with fully connected neural networks.

2015 CNNs with the laplacian pyramid method.

2017 An auxiliary classifier was used in the D.

2019 A noise transition model is included in D.

2015 Transposed convolutional layers are used to construct G.

2016 The D can also classify the real samples while distinguishing the real and fake.

> J ¼ <sup>1</sup> *m* X*m i*¼1

where L*<sup>i</sup>* denotes a loss function for the training sample *i*, and J is often known as cost function, which is just the mean of the sum of the losses over *m* training samples (i.e., usually a batch of *m* training samples is fed into a CNN during each

L*<sup>i</sup>* (11)

*Advances and Applications in Deep Learning*

$$\mathcal{L}\_{binary} = -\text{Ylog}\left(\hat{Y}\right) - \left(\mathbf{1} - \mathbf{Y}\right)\log\left(\mathbf{1} - \hat{Y}\right) \quad \left(\mathbf{Y} \in [0, \mathbf{1}]\right) \tag{15}$$

*5.2.1 Hinge loss*

only one correct label, it is denoted as

*Advances in Convolutional Neural Networks DOI: http://dx.doi.org/10.5772/intechopen.93512*

> *nclass* X *i*6¼*k*

L*hinge* ¼

*5.2.2 Contrastive loss*

*d Za*, *Zp*

*5.2.3 Triplet loss*

samples *Xa*,*Xp*, *Xn*

**33**

Hinge loss is well known to train Support Vector Machine classifiers. Specifically, there are two main types of hinge losses. The first type is for each sample with

� �*<sup>p</sup> yk* <sup>¼</sup> 1,

where *yi* denotes each element in the one-hot-encoding label, *yk* is the correct class. ^*yi* represents the predicted result of our neural network for each class. Δ ¼ 1 is the standard choice for the margin. If *p* ¼ 1, the above loss denotes the standard

However, in real tasks such as attribute classification, each samples can have multiple correct labels. e.g., a photo posted on Facebook may include a set of

where *<sup>δ</sup> yi* <sup>¼</sup> <sup>1</sup> � � ¼ þ1 if *yi* <sup>¼</sup> 1, otherwise *<sup>δ</sup> yi* <sup>¼</sup> <sup>1</sup> � � ¼ �1. <sup>Δ</sup> <sup>¼</sup> 1 is still the

Contrastive loss is specially designed for measuring the similarity of a pair of

*Xa* is known as an anchor sample and *Xp* denotes the positive sample and *Xn*

the loss for the pair is the distance between their outputs from the network

� �. While if the pair f g *Xa*, *Xn* is not matching and the distance of their outputs from the model is small than the pre-defined margin 0, ð Þ Δ � *d Z*ð Þ *<sup>a</sup>*, *Zn* >0,

> � � if matched pair max 0, ð Þ <sup>Δ</sup> � *d Z*ð Þ *<sup>a</sup>*, *Zn* if unmatched pair

where *y* ¼ 1 if the given pair is matching, otherwise *y* ¼ 0. Δ is the margin which

Triplet loss looks similar to the contrastive loss, but it is a measure of the difference between the matched pair and the unmatched pair. Considering three

max 0, <sup>Δ</sup> � *<sup>δ</sup> yi* <sup>¼</sup> <sup>1</sup> � �^*yi*

*nclass* X *i*6¼*k*

!

*yi* ¼ 0

� �*<sup>p</sup>* (20)

� � and f g *Xa*,*Xn* , where

� �Þ ¼ k k *Za* � *Zn* 2). Alterna-

� � <sup>þ</sup> ð Þ <sup>1</sup> � *<sup>y</sup>* max 0, ð Þ <sup>Δ</sup> � *d Z*ð Þ *<sup>a</sup>*, *Zn* (22)

� � is matching, then

(21)

(19)

max 0, Δ þ ^*yi* � ^*yk*

Hinge loss, and if *p* ¼ 2, it is the Squared Hinge loss.

L*hinge* ¼

common choice for the margin and *p* ¼ 1 or *p* ¼ 2.

training samples. Considering two pairs of samples *Xa*, *Xp*

represents the negative sample, Specifically, if the pair *Xa*, *Xp*

then we need also to calculate the loss. Formally, we can have

where *d* can be the Euclidean distance, (i.e., *d Za*, *Zp*

<sup>L</sup>*contrastive* <sup>¼</sup> *d Za*, *Zp*

tively, the above equation can be rewritten as

L*contrastive* ¼ *yd Za*, *Zp*

can affect the loss calculating for the unmatched pairs.

� �, the Triplet loss is denoted as

hashtags. Therefore, the second type for multiple labels is

*nclass* X *i*¼1

where *<sup>Y</sup>* <sup>∈</sup>½ � 0, 1 denotes the binary label for a sample and *<sup>Y</sup>*^ is the predicted result, (i.e., given a training sample with its corresponding label f g *X*, *Y* , we can have an output predicted result with an encoder network *<sup>Y</sup>*^ ¼ F*encoder*ð Þ *<sup>X</sup>*, <sup>Θ</sup> .)

When the learning task is multi-class classification, each sample label is normally encoded with the one-hot-encoding format, which can be denoted as *Y* ¼ *<sup>y</sup>*1, *<sup>y</sup>*2, … , *ynclass* � �<sup>T</sup> , i.e., if the label is 3, then only *y*<sup>3</sup> ¼ 1 and the others are all given the value of 0. Therefore, the log loss for one sample can be written as

$$\mathcal{L}\_{\log} = -\sum\_{i=1}^{nclaw} \mathbf{1}\{\mathbf{y}\_i = \mathbf{1}\} \log \left(\hat{\mathbf{y}}\_i\right) \tag{16}$$

where ^*yi* is the predicted result for the true label *yi* . 1 *yi* <sup>¼</sup> <sup>1</sup> � � denotes the indicator function, which means that its output is 1 if *yi* ¼ 1, otherwise it outputs 0.

We may wonder why the log loss is a reasonable choice. Informally, let *Y* denotes the data distribution and *Y*^ denotes the distribution leaned by our model, then based on Kullback–Leibler divergence, we can have

$$D\_{KL}\left(Y\|\hat{Y}\right) = \sum Y \log Y - \sum Y \log \hat{Y} \tag{17}$$

And our goal is to minimize the divergence between *Y* and *Y*^ so that the distribution obtained by our model is close to the true data distribution. Because the term P*YlogY* is the entropy related to data, and we only need to optimize the cross entropy term �P*YlogY*^. Therefore, log loss is also well known as cross-entropy loss.

#### *5.1.3 Mean squared error*

Probably the mean squared error is one of the most familiar loss functions as it is really like the least square loss function. It directly calculates the difference between the predicted result and the true label, which is denoted as

$$\mathcal{L}\_{mean} = -\frac{1}{2} \left( Y - \hat{Y} \right)^2 \tag{18}$$

One example which can help us deeply understand the mean squared error is that minimize the mean squared loss of a linear regression model is equivalent to maximum likelihood. In other words, this is a method to optimize the parameters of our model so that the distribution learned by our model is most probable under the observed training data. Therefore, the fundamental goal is still the same as above, which is to make the model distribution and the data distribution as close as possible.

#### **5.2 Margin loss functions**

Margin loss functions represent a family of margin maximizing loss functions. The typical functions include Hinge Loss, Contrastive Loss and Triplet Loss. Unlike the divergence loss functions, margin loss functions calculate the relative distances between outputs and they are more flexible in terms of training data.

*Advances in Convolutional Neural Networks DOI: http://dx.doi.org/10.5772/intechopen.93512*

#### *5.2.1 Hinge loss*

<sup>L</sup>*binary* ¼ �*Ylog <sup>Y</sup>*^

*Advances and Applications in Deep Learning*

*y*1, *y*2, … , *ynclass* � �<sup>T</sup>

cross-entropy loss.

possible.

**32**

**5.2 Margin loss functions**

*5.1.3 Mean squared error*

� � � ð Þ <sup>1</sup> � *<sup>Y</sup> log* <sup>1</sup> � *<sup>Y</sup>*^ � � ð Þ *<sup>Y</sup>* <sup>∈</sup>½ � 0, 1 (15)

� � (16)

. 1 *yi* <sup>¼</sup> <sup>1</sup> � � denotes the

*<sup>Y</sup>* � *<sup>Y</sup>*^ � �<sup>2</sup> (18)

where *<sup>Y</sup>* <sup>∈</sup>½ � 0, 1 denotes the binary label for a sample and *<sup>Y</sup>*^ is the predicted result, (i.e., given a training sample with its corresponding label f g *X*, *Y* , we can have an output predicted result with an encoder network *<sup>Y</sup>*^ ¼ F*encoder*ð Þ *<sup>X</sup>*, <sup>Θ</sup> .)

encoded with the one-hot-encoding format, which can be denoted as *Y* ¼

*nclass* X *i*¼1

indicator function, which means that its output is 1 if *yi* ¼ 1, otherwise it outputs 0. We may wonder why the log loss is a reasonable choice. Informally, let *Y* denotes the data distribution and *Y*^ denotes the distribution leaned by our model,

And our goal is to minimize the divergence between *Y* and *Y*^ so that the distribution obtained by our model is close to the true data distribution. Because the term P*YlogY* is the entropy related to data, and we only need to optimize the

Probably the mean squared error is one of the most familiar loss functions as it is really like the least square loss function. It directly calculates the difference between

2

One example which can help us deeply understand the mean squared error is that minimize the mean squared loss of a linear regression model is equivalent to maximum likelihood. In other words, this is a method to optimize the parameters of our model so that the distribution learned by our model is most probable under the observed training data. Therefore, the fundamental goal is still the same as above, which is to make the model distribution and the data distribution as close as

Margin loss functions represent a family of margin maximizing loss functions. The typical functions include Hinge Loss, Contrastive Loss and Triplet Loss. Unlike the divergence loss functions, margin loss functions calculate the relative distances

between outputs and they are more flexible in terms of training data.

cross entropy term �P*YlogY*^. Therefore, log loss is also well known as

<sup>L</sup>*mean* ¼ � <sup>1</sup>

the value of 0. Therefore, the log loss for one sample can be written as

L*log* ¼ �

where ^*yi* is the predicted result for the true label *yi*

then based on Kullback–Leibler divergence, we can have

the predicted result and the true label, which is denoted as

When the learning task is multi-class classification, each sample label is normally

, i.e., if the label is 3, then only *y*<sup>3</sup> ¼ 1 and the others are all given

*DKL <sup>Y</sup>*k*Y*^ � � <sup>¼</sup> <sup>X</sup>*YlogY* �X*YlogY*^ (17)

<sup>1</sup> *yi* <sup>¼</sup> <sup>1</sup> � � *log* ^*yi*

Hinge loss is well known to train Support Vector Machine classifiers. Specifically, there are two main types of hinge losses. The first type is for each sample with only one correct label, it is denoted as

$$\mathcal{L}\_{\text{hinge}} = \sum\_{i \neq k}^{\text{nclaw}} \max\left(\mathbf{0}, \Delta + \hat{\boldsymbol{\jmath}}\_{i} - \hat{\boldsymbol{\jmath}}\_{k}\right)^{p} \quad \left(\boldsymbol{\jmath}\_{k} = \mathbf{1}, \sum\_{i \neq k}^{\text{nclaw}} \boldsymbol{\jmath}\_{i} = \mathbf{0}\right) \tag{19}$$

where *yi* denotes each element in the one-hot-encoding label, *yk* is the correct class. ^*yi* represents the predicted result of our neural network for each class. Δ ¼ 1 is the standard choice for the margin. If *p* ¼ 1, the above loss denotes the standard Hinge loss, and if *p* ¼ 2, it is the Squared Hinge loss.

However, in real tasks such as attribute classification, each samples can have multiple correct labels. e.g., a photo posted on Facebook may include a set of hashtags. Therefore, the second type for multiple labels is

$$\mathcal{L}\_{\text{hinge}} = \sum\_{i=1}^{nclas} \max\left(0, \Delta - \delta(y\_i = \mathbf{1})\hat{\boldsymbol{\wp}}\_i\right)^p \tag{20}$$

where *<sup>δ</sup> yi* <sup>¼</sup> <sup>1</sup> � � ¼ þ1 if *yi* <sup>¼</sup> 1, otherwise *<sup>δ</sup> yi* <sup>¼</sup> <sup>1</sup> � � ¼ �1. <sup>Δ</sup> <sup>¼</sup> 1 is still the common choice for the margin and *p* ¼ 1 or *p* ¼ 2.

### *5.2.2 Contrastive loss*

Contrastive loss is specially designed for measuring the similarity of a pair of training samples. Considering two pairs of samples *Xa*, *Xp* � � and f g *Xa*,*Xn* , where *Xa* is known as an anchor sample and *Xp* denotes the positive sample and *Xn* represents the negative sample, Specifically, if the pair *Xa*, *Xp* � � is matching, then the loss for the pair is the distance between their outputs from the network *d Za*, *Zp* � �. While if the pair f g *Xa*, *Xn* is not matching and the distance of their outputs from the model is small than the pre-defined margin 0, ð Þ Δ � *d Z*ð Þ *<sup>a</sup>*, *Zn* >0, then we need also to calculate the loss. Formally, we can have

$$\mathcal{L}\_{\text{contrative}} = \begin{pmatrix} d(Z\_a, Z\_p) & \text{if matched pair} \\ \max\left(0, \Delta - d(Z\_a, Z\_n)\right) & \text{if unmatched pair} \end{pmatrix} \tag{21}$$

where *d* can be the Euclidean distance, (i.e., *d Za*, *Zp* � �Þ ¼ k k *Za* � *Zn* 2). Alternatively, the above equation can be rewritten as

$$\mathcal{L}\_{\text{contrative}} = \mathcal{y}d\left(Z\_a, Z\_p\right) + \left(\mathbf{1} - \mathbf{y}\right) \max\left(\mathbf{0}, \Delta - d(Z\_a, Z\_n)\right) \tag{22}$$

where *y* ¼ 1 if the given pair is matching, otherwise *y* ¼ 0. Δ is the margin which can affect the loss calculating for the unmatched pairs.

#### *5.2.3 Triplet loss*

Triplet loss looks similar to the contrastive loss, but it is a measure of the difference between the matched pair and the unmatched pair. Considering three samples *Xa*,*Xp*, *Xn* � �, the Triplet loss is denoted as

$$\mathcal{L}\_{triplet} = \max\left(0, \Delta + d\left(Z\_a, Z\_p\right) - d(Z\_a, Z\_n)\right) \tag{23}$$

utilizes a set of default boxes with different aspect ratios, and each box outputs the

The multiple levels of representations learned in the multiple layers of CNNs can also be used for solving the task of human-body pose estimation. Specifically, there are mainly two types of approaches, including regression of body joint coordinates and heat-map for each body part. In 2014, a framework called DeepPose [45] was introduced to learn pose estimation by a deep CNN, in which estimating humanbody pose is equivalent to regressing the body joint coordinates. There are also some extension works based on this method, such as a process called iterative error feedback [46], which encompasses both the input and output spaces of CNN for enhancing the performance. In 2014, Tompson et al. [47] propose a hybrid architecture which consists of a CNN and a Markov Random Field, in particular the output of the CNN for an input image is a heat-map. Some recent works based on the heat-map method such as [48], in which a multi-context attention mechanism

The operation of image restoration is to recover a damaged or corrupt image for the clean image such as image denoising and super-resolution. Therefore, a natural way to implement this idea is to utilize a pre-trained encoder-decoder network, where the encoder can map a noise image into a high-level representation, and the decoder can transform the representation into an original image. For example, Mao et al. [49] apply a deep convolutional encoder-decoder network for image restoration, in particular the shortcut connection method is adopted between the encoder and decoder, which has been demonstrated in Section 3.2. And the transposed convolution is used for constructing the decoder network, as mentioned in Section 2.2. Similar work in [50] has also been introduced for image restoration, in which a

The task of image segmentation is to map an input image into a segmented output image. The encoder-decoder networks have been developed dramatically in recent years and achieve a significant impact on computer vision. Specifically, there are mainly two types of tasks including semantic segmentation and instance segmentation. In 2015, Long et al. [20] firstly showed that an end-to-end fully CNN can achieve state-of-art in image segmentation tasks. Similar work has also been introduced in [6] in 2015, in which a U-Net architecture is proposed for medical image segmentation, and the main advance in this architecture is that the shortcut connection method is also used between the encoder and decoder network. Since then, a series of papers based on these two methods have been published. In particular nowadays the U-Net

One of the exciting applications achieved by CNNs is image captioning, which is to describe the content of an input image with natural language. The basic idea is as

shape offsets and the class confidences [44].

*Advances in Convolutional Neural Networks DOI: http://dx.doi.org/10.5772/intechopen.93512*

was proposed to incorporate with CNNs.

**6.2 Applications with encoder-decoders**

residual method is used in the network (i.e., in Section 3.2).

based architectures are widely used for the medical image diagnosis.

*6.1.3 Pose estimation*

*6.2.1 Image restoration*

*6.2.2 Image segmentation*

*6.2.3 Image captioning*

**35**

Note that minimize the loss function is equivalent to minimizing the distances of matched pairs and maximizing the distances of unmatched pairs.
