*Advances in Convolutional Neural Networks DOI: http://dx.doi.org/10.5772/intechopen.93512*

#### **Figure 3.**

methods are summarized for constructing convolutional layers in CNNs in Section 3. In Section 4, we group the current CNN architectures into three types: encoder, encoder-decoder and GANs. Next, we discuss two main types of loss functions in Section 5. In Section 6, we give the advanced applications based on the three types of CNN structures. Finally in Section 7, we conclude this research and give future

The main reason why CNNs are so successful on a variety of problems is that kernels (also known as filters) with fixed numbers of parameters are adopted to handle spacial data such as images. In particular the weight sharing mechanism can help reduce the number of parameters for low computational cost while remaining the spacial invariance properties. In general, there are mainly two types of convolution operations, including basic convolution and transposed convolution.

As shown on the left in **Figure 2**, convolution operation essentially is a linear model for the local spacial input. Specifically, it only performs the sum of elementwise dot products between the local input and the kernels (usually including a bias), and output a value after an activation function. Each kernel slides overall spacial locations in the input with a fixed step. The result is that we can get an 1-channel feature map. Note that there are generally many kernels in one convolutional layer, and all of the output feature maps are concatenated together, e.g., if the number of kernels used in this convolutional layer is *DO*, we can get an *<sup>O</sup>* <sup>∈</sup> <sup>R</sup><sup>3</sup>�3�*DO* feature map. While the kernel size of 3 � 3 is widely used in current CNNs, we may need large receptive fields in the input for observing more information during each convolution operation. However, if we directly increase the size of kernels such as

*Left: A demonstration of basic 2D convolution with a* <sup>3</sup> � <sup>3</sup> � *DI kernel (stride = 1, padding = 0), I* <sup>∈</sup> <sup>R</sup><sup>5</sup>�5�*DI is the spacial input and O* <sup>∈</sup> <sup>R</sup><sup>3</sup>�<sup>3</sup> *is the 1-channel output feature map. Right: A dilated kernel for increasing the*

*receptive fields in the input, where the empty space between each element represents 0.*

trends.

**Figure 2.**

**24**

**Figure 1.**

**2. Convolution operations**

**2.1 Basic convolution and dilated kernels**

*High-level abstraction of convolutional neural networks in this survey.*

*Advances and Applications in Deep Learning*

*Left: A demonstration of transposed 2D convolution with a* 3 � 3 � *DI kernel (stride = 1, padding = 0), <sup>I</sup>* <sup>∈</sup> <sup>R</sup><sup>2</sup>�2�*DI is the spacial input and O* <sup>∈</sup> <sup>R</sup><sup>4</sup>�<sup>4</sup> *is the 1-channel output feature map. Note that the receptive fields in O can overlap and we normally sum the values where output overlaps. Right: A dilated kernel for increasing the receptive fields in the input, where the empty space between each element represents 0.*

*K* ¼ 9 � 9 � *DI*, where *DI* is the depth of input, the total number of parameters will increase dramatically and the computational cost will be prohibitive. In practical, as shown on the right in **Figure 2**, we can insert zeros between each element in the kernels and get dilated kernels. For example, dilated kernels have been applied in many tasks such as image segmentation [3], translation tasks [4] and speech recognition [5].

### **2.2 Transposed convolution and dilated kernels**

Normally the size of output feature maps generated from the basic convolution is smaller than the input space (i.e., the dimension of input *I* is 5 � 5 � *DI* and the dimension of output *O* is 3 � 3 in **Figure 2**), which results in high-level abstraction by using multiple convolutional layers. Transposed convolution can be seen as a reverse idea from basic convolution. Its primary purpose is to obtain an output feature map that is bigger than the input space. As shown on the left in **Figure 3**, the size of the input *I* is 2 � 2 � *DI*, after transposed convolution, we can have a 4 � 4 feature map *O*. Specifically, during transposed convolution, each output filed in *O* is just the kernel multiplied by the scalar value of one element in *I*.

Similarly, we can still use dilated kernels in transposed convolution. The main reason why we need transposed convolution is that it is the fundamental idea to construct a decoder network, which is used to map a latent space into an output image, such as the decoders in U-Net [6] and GANs. Specifically, the transposed convolution is widely used in tasks such as model visualization [7], image segmentation [6], image classification [8] and image super-resolution [9].
