**3. Convolutional layers**

The core components in CNNs are convolutional layers. In the last section, we have demonstrated two types of convolution operations and they are the main idea to construct convolutional layers. In this part, we summarize the main methods in deep learning for building convolutional layers, including basic convolutional layers, convolutional layers with shortcut connection, convolutional layers with mixed kernels and convolutional capsule layers.

#### **3.1 Basic convolutional layers**

Recall that there are normally *DO* kernels in one convolutional layer, where *DO* also denotes the depth of the output feature map. In other words, the number of

channels in the output map depends on the number of kernels used in the convolutional layer. More formally, we can denote it as

$$O = \sum\_{i=1}^{D\_O} I \ast K\_i \tag{1}$$

• **Element-Wise Sum:** Each element in *Icurrent* is added by the corresponding element in *Iprevious*, which means that the dimensions of *Icurrent* and *Iprevious* must be the same, and the result is that we can get an output *O* of the same size. This type of operation is well known as **identity shortcut connection** and it is the core idea in ResNet [11, 12]. The main advantage is that it does not add any extra parameters or computational complexity. The disadvantage is due to its

• **Concatenation:** We can concatenate the current output and previous input together. Suppose the size of the current output feature map is *WHDO* and the

concatenated feature map *O* with a size of *WH D*ð Þ *<sup>O</sup>* þ *DI* . Note that the widths and heights of input and output must be the same. The advantage is that we can remain the information from the previous layers. The disadvantage is that we have to use extra parameters to handle the concatenated feature map *O*. (i.e., the depth of kernels for processing feature map *O* is ð Þ *DO* þ *DI* .) Specifically, this type of convolutional layers is broadly adopted in networks

So far we have demonstrated that we normally use many convolutional kernels with the same size in one convolutional layer such as 3 � 3. To enlarge the receptive field, we may adopt the dilated kernels instead. However, it is difficult to know what size of kernels we should use in a CNN. Naturally, we may apply different sizes of kernels in each convolutional layer. E.g., both 1 � 1, 3 � 3 and 5 � 5 kernels are adopted in one convolutional layer. More formally, we define one convolutional

*I* ∗*K*<sup>3</sup>�<sup>3</sup>

where pool(I) denotes the pooling operation such as max-pooling. Therefore,

the computational cost involved will increase sharply. In the inception module [13, 14], a 1 � 1 convolutional layer is applied before 3 � 3 and 5 � 5 convolutional

*"The pooling operation used in convolutional neural networks is a big mistake and*

In general, pooling operation is essential to reduce the size of output feature maps so that we can obtain high-level abstractions from input by stacking multiple convolutional layers in a CNN. However, the cost is that some information in the

In 2017 [15], Hinton et al. proposed an alluring version of convolutional architectures, which is known as capsule networks, followed by the updated versions in 2018 [16] and 2019 [17]. The convolutional capsule layers in capsule networks are

However, if we directly add different sizes of kernels in one convolutional layer,

*<sup>i</sup>* <sup>þ</sup><sup>X</sup> *D*3 *O*

*i*¼1

*<sup>O</sup>* <sup>þ</sup> *<sup>D</sup>*<sup>2</sup>

*I* ∗*K*<sup>5</sup>�<sup>5</sup>

*<sup>O</sup>* <sup>þ</sup> *<sup>D</sup>*<sup>3</sup>

� �.

*<sup>i</sup>* þ *pool I*ð Þ

*<sup>O</sup>* þ *DI*

1

A (5)

size of the previous input is *WHDI*, after concatenation, we can have a

for image segmentation such as U-Net [6].

**3.3 Convolutional layers with mixed kernels**

layer with mixed kernels as

*<sup>O</sup>* <sup>¼</sup> *<sup>σ</sup>* <sup>X</sup>

**3.4 Convolutional capsule layers**

**27**

0 @

*D*1 *O*

*i*¼1

the size of the output feature map is *WOHO D*<sup>1</sup>

layers in order to reduce the convolutional cost.

*I* ∗*K*<sup>1</sup>�<sup>1</sup>

*<sup>i</sup>* <sup>þ</sup><sup>X</sup> *D*2 *O*

*the fact that it works so well is a disaster."—Geoffrey Hinton.*

feature maps has been abandoned such as conducting max-pooling.

*i*¼1

inflexible.

*Advances in Convolutional Neural Networks DOI: http://dx.doi.org/10.5772/intechopen.93512*

where ∗ represents the convolution operation which has been addressed above, <sup>P</sup> denotes the concatenation operation and *<sup>O</sup>* <sup>∈</sup> <sup>R</sup>*WOHODO* is the output feature map. After convolution operation, a no-linear activation function is applied on each element in the concatenated feature map, which can be denoted as

$$O = \sigma(O) \tag{2}$$

While there are many variants related to the activation function, the typical ones which are widely adopted are ReLU *<sup>σ</sup>*ð Þ¼ *<sup>x</sup>* max 0, ð Þ *<sup>x</sup>* , tanh *<sup>σ</sup>*ð Þ¼ *<sup>x</sup> <sup>e</sup>x*�*e*�*<sup>x</sup> ex*þ*e*�*<sup>x</sup>* and sigmoid *<sup>σ</sup>*ð Þ¼ *<sup>x</sup>* <sup>1</sup> <sup>1</sup>þ*e*�*<sup>x</sup>*. Note that the non-linear activation functions are essential for building multi-layer networks, as it shows that a two-layer network with enough neurons can uniformly approximate any functions, which is also known as universal approximation theorem [10].

Note that after convolution operation, the width and height of the output feature map *<sup>O</sup>* <sup>∈</sup> <sup>R</sup>*WOHODO* are usually close to the width and height of the input *<sup>I</sup>* <sup>∈</sup> <sup>R</sup>*WIHIDI* . To further reduce the dimensions of the output feature maps for reducing computational cost, the pooling operation is widely used in the current CNNs. Specifically, for 2D pooling operation, two main hyper-parameters are involved: the filter size *F* � *F* and stride *S*. And after pooling operation, the width of the feature map *O* is reduced to *WO* ¼ ð Þ *WO* � *F =S* þ 1 and the height of the feature map *O* is *HO* ¼ ð Þ *HO* � *F =S* þ 1. In brief, we can have

$$O = pool(O) \tag{3}$$

where *pool*ðÞ denotes the pooling operation discussed above. Typical pooling operations includes max-pooling and average-pooling. A general choice to conduct pooling operation is to use *stride* ¼ 2 with 2 � 2 filter, which means that each 4 pixels in the 2D feature map *O* will be compressed into 1 pixel. Using a toy example, suppose that there are only four pixels *<sup>O</sup>* <sup>¼</sup> 1 2 3 4 � �, then *poolmax*ð Þ¼ *<sup>O</sup>* ½ � <sup>4</sup> or *poolavg* ð Þ¼ *O* ½ � 2*:*5 .

#### **3.2 Convolutional layers with shortcut connection**

It is true that deep neural networks normally can learn better representation from the data than shallow neural networks. However, stacking more layers in a CNN can lead to the problems of vanishing or exploding gradients, which make the networks hard to optimize. A simple and effective way to address this problem is to use shortcut connections, which can help directly transform the information from the previous layer to the current layer in a network.

$$O = \sigma\left(\left(\sum\_{i=1}^{D\_0} I\_{\text{current}} \ast K\_i\right) \oplus I\_{\text{previous}}\right) \tag{4}$$

Note that ⊕ can denote two types of operations.

*Advances in Convolutional Neural Networks DOI: http://dx.doi.org/10.5772/intechopen.93512*

channels in the output map depends on the number of kernels used in the

element in the concatenated feature map, which can be denoted as

which are widely adopted are ReLU *<sup>σ</sup>*ð Þ¼ *<sup>x</sup>* max 0, ð Þ *<sup>x</sup>* , tanh *<sup>σ</sup>*ð Þ¼ *<sup>x</sup> <sup>e</sup>x*�*e*�*<sup>x</sup>*

*<sup>O</sup>* <sup>¼</sup> <sup>X</sup> *DO*

*i*¼1

where ∗ represents the convolution operation which has been addressed above, <sup>P</sup> denotes the concatenation operation and *<sup>O</sup>* <sup>∈</sup> <sup>R</sup>*WOHODO* is the output feature map. After convolution operation, a no-linear activation function is applied on each

While there are many variants related to the activation function, the typical ones

Note that after convolution operation, the width and height of the output feature map *<sup>O</sup>* <sup>∈</sup> <sup>R</sup>*WOHODO* are usually close to the width and height of the input *<sup>I</sup>* <sup>∈</sup> <sup>R</sup>*WIHIDI*

To further reduce the dimensions of the output feature maps for reducing computational cost, the pooling operation is widely used in the current CNNs. Specifically, for 2D pooling operation, two main hyper-parameters are involved: the filter size *F* � *F* and stride *S*. And after pooling operation, the width of the feature map *O* is reduced to *WO* ¼ ð Þ *WO* � *F =S* þ 1 and the height of the feature map *O* is

where *pool*ðÞ denotes the pooling operation discussed above. Typical pooling operations includes max-pooling and average-pooling. A general choice to conduct pooling operation is to use *stride* ¼ 2 with 2 � 2 filter, which means that each 4 pixels in the 2D feature map *O* will be compressed into 1 pixel. Using a toy example,

It is true that deep neural networks normally can learn better representation from the data than shallow neural networks. However, stacking more layers in a CNN can lead to the problems of vanishing or exploding gradients, which make the networks hard to optimize. A simple and effective way to address this problem is to use shortcut connections, which can help directly transform the information from

> *Icurrent* ∗*Ki* !

!

⊕ *Iprevious*

3 4 � �

building multi-layer networks, as it shows that a two-layer network with enough neurons can uniformly approximate any functions, which is also known as universal

<sup>1</sup>þ*e*�*<sup>x</sup>*. Note that the non-linear activation functions are essential for

*I* ∗*Ki* (1)

*O* ¼ *σ*ð Þ *O* (2)

*O* ¼ *pool O*ð Þ (3)

, then *poolmax*ð Þ¼ *O* ½ � 4 or

*ex*þ*e*�*<sup>x</sup>* and

.

(4)

convolutional layer. More formally, we can denote it as

*Advances and Applications in Deep Learning*

sigmoid *<sup>σ</sup>*ð Þ¼ *<sup>x</sup>* <sup>1</sup>

*poolavg* ð Þ¼ *O* ½ � 2*:*5 .

**26**

approximation theorem [10].

*HO* ¼ ð Þ *HO* � *F =S* þ 1. In brief, we can have

suppose that there are only four pixels *<sup>O</sup>* <sup>¼</sup> 1 2

**3.2 Convolutional layers with shortcut connection**

the previous layer to the current layer in a network.

*<sup>O</sup>* <sup>¼</sup> *<sup>σ</sup>* <sup>X</sup>

Note that ⊕ can denote two types of operations.

*DO*

*i*¼1


#### **3.3 Convolutional layers with mixed kernels**

So far we have demonstrated that we normally use many convolutional kernels with the same size in one convolutional layer such as 3 � 3. To enlarge the receptive field, we may adopt the dilated kernels instead. However, it is difficult to know what size of kernels we should use in a CNN. Naturally, we may apply different sizes of kernels in each convolutional layer. E.g., both 1 � 1, 3 � 3 and 5 � 5 kernels are adopted in one convolutional layer. More formally, we define one convolutional layer with mixed kernels as

$$O = \sigma \left( \sum\_{i=1}^{D\_O^1} I \ast K\_i^{1 \times 1} + \sum\_{i=1}^{D\_O^2} I \ast K\_i^{3 \times 3} + \sum\_{i=1}^{D\_O^3} I \ast K\_i^{5 \times 5} + pool(I) \right) \tag{5}$$

where pool(I) denotes the pooling operation such as max-pooling. Therefore, the size of the output feature map is *WOHO D*<sup>1</sup> *<sup>O</sup>* <sup>þ</sup> *<sup>D</sup>*<sup>2</sup> *<sup>O</sup>* <sup>þ</sup> *<sup>D</sup>*<sup>3</sup> *<sup>O</sup>* þ *DI* � �.

However, if we directly add different sizes of kernels in one convolutional layer, the computational cost involved will increase sharply. In the inception module [13, 14], a 1 � 1 convolutional layer is applied before 3 � 3 and 5 � 5 convolutional layers in order to reduce the convolutional cost.

#### **3.4 Convolutional capsule layers**

*"The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster."—Geoffrey Hinton.*

In general, pooling operation is essential to reduce the size of output feature maps so that we can obtain high-level abstractions from input by stacking multiple convolutional layers in a CNN. However, the cost is that some information in the feature maps has been abandoned such as conducting max-pooling.

In 2017 [15], Hinton et al. proposed an alluring version of convolutional architectures, which is known as capsule networks, followed by the updated versions in 2018 [16] and 2019 [17]. The convolutional capsule layers in capsule networks are

very similar to the traditional convolutional layers. The main difference is that each capsule (i.e., an element in convolutional feature maps) has a weight matrix *Wij* (i.e., the sizes are 8 � 16 in [15] and 4 � 4 in [16] respectively).

many possible ways to implement an encoder-decoder structure, and many variants have also been proposed to improve the drawbacks in the last few years. A naive version of encoder-decoder network which was introduced in [20] can be denoted as

where F*encoder* denotes an encoder CNN to map an input sample to a representation *Z* and F*decoder* represents a decoder CNN to reconstruct the input sample with *Z*. Specifically, CNN encoders usually conducts basic convolution operations (i.e., Section 2.1) and CNN decoders perform transposed convolution operations (i.e.,

As shown in the middle in **Figure 4**, an encoder-decoder network is still one complete network and we can train it with an end-to-end method. Note that there are generally many convolutional layers in each coder network, which results that it can be challenging to train a deep encoder-decoder network directly. Recall that the shortcut connection is often adopted to address the problems in deep CNNs. Naturally, we can add connections between the encoder and the decoder. An influential network based on this idea is U-Net [6], which is widely applied in many challenging domains such as medical image segmentation. The above two equations can also

Specifically, in unsupervised learning, an encoder-decoder network is also well known as autoencoder. And there are many variants of autoencoders proposed in recent years, some famous ones including variational autoencoder [21], denoising variational autoencoder [22] and conditional variational autoencoder [23, 24].

Since generative adversarial networks were firstly proposed by Goodfellow et al. [25] in 2014, this type of architectures for playing two-player minimax game has been most extensively studied. Partly because it is an unsupervised learning method and we can obtain a fancy generator network which can help generate fake examples from a latent space (i.e., a vector with some random noise). On the right in **Figure 4** shows the basic structure of GANs, in which a generator network can map some input noise into a fake example and make it look as real as possible and a discriminator network always tries to identify the fake sample from its input. By iteratively training the two players, they can both improve their methods. More

*;Xreal; θ<sup>D</sup>*

where *G* denotes the generator function and *D* represents the discriminator function. *L* is the latent space input in the generator, and its output is a fake example. *Xreal* is the real samples we have collected. And *<sup>Y</sup>*^ <sup>∈</sup> ½ � 0, 1 is the predicted

As shown in **Table 1**, numerous variants of GANs architectures can be found in the recently published literatures and we broadly summarize these representative networks according to their published time. Note that the fundamental methods

(10)

*<sup>Y</sup>*^ <sup>¼</sup> *D G <sup>L</sup>*; *<sup>θ</sup><sup>G</sup>*

result of the discriminator to show whether the input is real or fake.

behind these architectures are very similar.

*<sup>X</sup>*^ ¼ F*decoder*∘F*encoder*ð Þ¼F *<sup>X</sup>*; <sup>Θ</sup> *autoencoder*ð Þ *<sup>X</sup>*; <sup>Θ</sup> (9)

be rewritten as a composition of two functions.

*Advances in Convolutional Neural Networks DOI: http://dx.doi.org/10.5772/intechopen.93512*

Section 2.2).

**4.3 GANs**

formally, we can have

**29**

*Z* ¼ F*encoder*ð Þ *X*; *θencoder* (7) *<sup>X</sup>*^ ¼ F*decoder*ð Þ *<sup>Z</sup>*; *<sup>θ</sup>decoder* (8)
