**2. Convolutional neural network**

Neural networks have been successfully applied to many real-world problems. The most general type of neural network is Multilayer Perceptron (MLP). While MLPs can be used to effectively classify small images, they are impractical for large images. The reason for this can be explained by the fact that the implementation of a MLP would result in a huge output vector of weights for each class (size of millions). MLPs not only are computationally expensive to train (both in terms of time and memory usage), but they also have high variance due to the large number of weights [22, 23].

Convolutional Neural Networks (CNNs) have been driving the heart of computer vision in recent years. The key concept of CNNs is to find local features from an input (usually an image) at higher layers and combine them into more multifaceted features at lower layers [24, 25]. CNNs are very good in extracting patterns in the input image, such as lines, gradients, circles, or even eyes and faces. It is this property that makes CNNs so powerful for computer vision. Unlike earlier computer vision algorithms, CNNs can operate directly on a raw image and do not need any preprocessing. In the medical field, CNNs are used to improve image quality in low-light images from a high-speed video endoscopy [26] and is also applied to recognize the nature of pulmonary nodules via CT images and the identification of pediatric pneumonia via chest X-ray images [27].

A CNN comprises several layers where each neuron of a subsequent higher layer connects to a subset of neurons in the previous lower layer. This permits the receptive field of a neuron of a higher layer to cover a greater part of images compared to that of a lower layer. The higher layer is capable to learn more abstract features of images than the lower layer by considering the spatial relationships between different receptive fields. It should be noted that CNNs significantly reduce the number of weights, and in turn reduce the variance. Like MLPs, CNNs use fully connected (FC) layers and non-linearities, but they introduce two new types of layers: convolutional and pooling layers. A convolutional layer takes a W × H × D dimensional input "I" and convolves it with a w × h × D dimensional filter (or kernel) G. The weights of the filter can be hand designed, but in the context of machine learning they are automatically tuned, just like the way weights of an FC layer are tuned. In line with convolutional layers where reducing the number of weights in neural networks reduces the variance, pooling layers directly reduce the number of neurons in neural networks. The sole purpose of a pooling layer is to downsample (also known as pool, gather, consolidate) the previous layer, by sliding a fixed window across a layer and choosing one value that effectively "represents" all of the units captured by the window. There are two common implementations of pooling. In max-pooling, the representative value just becomes the largest of all the units in the window, while in average-pooling, the representative value is the average of all the units in the window. In practice, pooling layers are stridden across the image with the stride equal to the size of the pooling layer. None of these properties actually involve any weights, unlike fully connected and convolutional layers.
