**1. Introduction**

Convolutional Neural Networks (CNNs) are specially designed to handle data that consists of multiple arrays/matrixes such as an image composed of three matrixes in RGB channels [1]. The key idea behind CNNs is the convolution operation, which is to use multiple small kernels/filters to extract local features by sliding over the same input. Each kernel can output a feature map and all the feature maps are concatenated together, this is also known as a convolutional layer and it is the core component in a CNN. Note that these concatenated maps can be further processed by the next layer. To reduce the computational cost, the pooling operation such as maximum pooling is usually applied on these feature maps. A typical CNN is usually structured as a series of layers, including multiple convolutional layers and a few of fully connected layers. For example, the famous LeNet [2] consists of two convolutional layers and three fully connected layers, and the pooling operation is used after each convolutional layer.

In addition to building a neural network, a loss function is essential to measure the model performance. Therefore, the process of training a CNN model is transformed into an optimization problem, which normally seeks to minimize the value of the loss function over the training data. Specifically, a gradient-descent based algorithm is usually adopted to iteratively optimize the parameters in a CNN.

**Figure 1** shows the high-level abstraction of CNNs in this survey. Specifically, we firstly introduce two types of convolution operations in Section 2. Then four

#### **Figure 1.**

*High-level abstraction of convolutional neural networks in this survey.*

methods are summarized for constructing convolutional layers in CNNs in Section 3. In Section 4, we group the current CNN architectures into three types: encoder, encoder-decoder and GANs. Next, we discuss two main types of loss functions in Section 5. In Section 6, we give the advanced applications based on the three types of CNN structures. Finally in Section 7, we conclude this research and give future trends.

*K* ¼ 9 � 9 � *DI*, where *DI* is the depth of input, the total number of parameters will increase dramatically and the computational cost will be prohibitive. In practical, as shown on the right in **Figure 2**, we can insert zeros between each element in the kernels and get dilated kernels. For example, dilated kernels have been applied in many tasks such as image segmentation [3], translation tasks [4] and speech

*Left: A demonstration of transposed 2D convolution with a* 3 � 3 � *DI kernel (stride = 1, padding = 0), <sup>I</sup>* <sup>∈</sup> <sup>R</sup><sup>2</sup>�2�*DI is the spacial input and O* <sup>∈</sup> <sup>R</sup><sup>4</sup>�<sup>4</sup> *is the 1-channel output feature map. Note that the receptive fields in O can overlap and we normally sum the values where output overlaps. Right: A dilated kernel for increasing the receptive fields in the input, where the empty space between each element represents 0.*

Normally the size of output feature maps generated from the basic convolution is smaller than the input space (i.e., the dimension of input *I* is 5 � 5 � *DI* and the dimension of output *O* is 3 � 3 in **Figure 2**), which results in high-level abstraction by using multiple convolutional layers. Transposed convolution can be seen as a reverse idea from basic convolution. Its primary purpose is to obtain an output feature map that is bigger than the input space. As shown on the left in **Figure 3**, the size of the input *I* is 2 � 2 � *DI*, after transposed convolution, we can have a 4 � 4 feature map *O*. Specifically, during transposed convolution, each output filed in *O* is

Similarly, we can still use dilated kernels in transposed convolution. The main reason why we need transposed convolution is that it is the fundamental idea to construct a decoder network, which is used to map a latent space into an output image, such as the decoders in U-Net [6] and GANs. Specifically, the transposed convolution is widely used in tasks such as model visualization [7], image segmen-

The core components in CNNs are convolutional layers. In the last section, we have demonstrated two types of convolution operations and they are the main idea to construct convolutional layers. In this part, we summarize the main methods in deep learning for building convolutional layers, including basic convolutional layers, convolutional layers with shortcut connection, convolutional layers with

Recall that there are normally *DO* kernels in one convolutional layer, where *DO* also denotes the depth of the output feature map. In other words, the number of

recognition [5].

**Figure 3.**

**3. Convolutional layers**

**3.1 Basic convolutional layers**

**25**

mixed kernels and convolutional capsule layers.

**2.2 Transposed convolution and dilated kernels**

*Advances in Convolutional Neural Networks DOI: http://dx.doi.org/10.5772/intechopen.93512*

just the kernel multiplied by the scalar value of one element in *I*.

tation [6], image classification [8] and image super-resolution [9].
