*6.2.3 Image captioning*

One of the exciting applications achieved by CNNs is image captioning, which is to describe the content of an input image with natural language. The basic idea is as follows: Firstly, a pre-trained CNN encoder is used to extract some high-level features from an input image. Secondly, these features are typically fed into an recurrent neural network for generating a sentence. For example, Li et al. [51] proposed a fully convolutional localization network for extracting representation from images and the decoder for generating captions is LSTM. Recently, attention mechanism has been widely used for sequence processing and achieved significant improvements such as machine translation, Huang et al. [52] introduce an encoderdecoder framework, where an attention module is used in the encoder and decoder respectively. Specifically, the encoder is a CNN based network.

**Image to Image:** The task of image-to-image translation is to learn a mapping Gð Þ! *X Y*. E.g., Isola et al. [59] apply conditional GANs for an image-to-image task and achieve impressive results such as mapping sketches to photographs, blackwhite photographs to color etc. Another typical work is the CycleGANs [60], which

**Text to Image:** One of the interesting works from GANs is to synthesis a realistic image based on some text descriptions. E.g., "There is a little bird with red feather." Some representative works include: Reed et al. [61] introduce a textconditional convolutional GANs. Zhang et al. [62] apply a StackGANs to synthesize

loss can help remain the original content from the input images.

**Super Resolution:** The task of super-resolution is to map a low-resolution image to a high-resolution image. In 2017, ledig et al. [63] propose a framework named as SRGAN, which is regarded as the first work that has the ability to generate photorealistic images for 4X upscaling factors. Specifically, the loss functions used in their framework consist of an adversarial loss and a content loss. In particular the content

Image editing is regarded as a fundamental problem in computer vision. The emergence of GANs has also brought new chances for this task. In the past few years, GANs have been developed for image editing, such as image inpainting and

**Image inpainting:** The task of image inpainting is to recover an arbitrary dam-

In this research, we have conducted a hierarchically-structured survey of the main components in CNNs from the low level to the high level, namely, convolution operations, convolutional layers, architecture design, loss functions. In addition to introducing the recent advances of these aspects in CNNs, we have also discussed the advanced applications based on the three types of architectures including encoder, encoder-decoder and GANs, from which we can see that CNNs have made numerous breakthroughs and achieved state-of-the-art in computer vision, natural language processing and speech recognition, especially these fantastic results based

From the above analyses, we can summarize that the current development tendencies in CNNs mainly focus on designing new architectures and loss functions. Because these two aspects are the core parts when applying CNNs into various types of tasks. On the other hand, the fundamental ideas behind these various applica-

However, there are still many disadvantages in the current deep learning. The first problem is the requirement of large-scale datasets, in particular constructing a labeled dataset is very time-consuming and expensive such as in the medical

aged region in an image. Specifically, we can utilize the algorithm to learn the content and style of the image and generate the damaged part based on the input image, such as [64], in which they introduce a context encoder for natural image inpainting. And in [65, 66], their works mainly focus on human face completion. **Image matting:** The goal of image matting is to separate the foreground object from the background in an image. This technique can be used for a wide range of applications such as photo editing and video post-production. And there are also

can transfer a style of an image into another.

*Advances in Convolutional Neural Networks DOI: http://dx.doi.org/10.5772/intechopen.93512*

some representative works such as [67, 68].

tions are very similar, as summarized above.

**7. Summary and future trends**

high-quality images from text.

*6.3.3 Image editing*

image matting.

on GANs.

**37**
