**2. Convolution operations**

The main reason why CNNs are so successful on a variety of problems is that kernels (also known as filters) with fixed numbers of parameters are adopted to handle spacial data such as images. In particular the weight sharing mechanism can help reduce the number of parameters for low computational cost while remaining the spacial invariance properties. In general, there are mainly two types of convolution operations, including basic convolution and transposed convolution.

## **2.1 Basic convolution and dilated kernels**

As shown on the left in **Figure 2**, convolution operation essentially is a linear model for the local spacial input. Specifically, it only performs the sum of elementwise dot products between the local input and the kernels (usually including a bias), and output a value after an activation function. Each kernel slides overall spacial locations in the input with a fixed step. The result is that we can get an 1-channel feature map. Note that there are generally many kernels in one convolutional layer, and all of the output feature maps are concatenated together, e.g., if the number of kernels used in this convolutional layer is *DO*, we can get an *<sup>O</sup>* <sup>∈</sup> <sup>R</sup><sup>3</sup>�3�*DO* feature map.

While the kernel size of 3 � 3 is widely used in current CNNs, we may need large receptive fields in the input for observing more information during each convolution operation. However, if we directly increase the size of kernels such as

#### **Figure 2.**

*Left: A demonstration of basic 2D convolution with a* <sup>3</sup> � <sup>3</sup> � *DI kernel (stride = 1, padding = 0), I* <sup>∈</sup> <sup>R</sup><sup>5</sup>�5�*DI is the spacial input and O* <sup>∈</sup> <sup>R</sup><sup>3</sup>�<sup>3</sup> *is the 1-channel output feature map. Right: A dilated kernel for increasing the receptive fields in the input, where the empty space between each element represents 0.*
