1. Introduction

One of the central goals of computer vision is object recognition and scene understanding. Humans are able to (1) possess a remarkable ability to parse an image simply by looking at them. (2) analyze an image and separate all the components present in it (3) recognize a new object that has never seen before based on observing a set of objects. The Holy Grail of a computer vision system is to able to perform as same tasks as humans are with the ability of across huge variations in pose, appearance, viewpoint, illumination, occlusion, etc. Obviously, this is a very difficult computational problem and involves in eventually making intelligent machines. An image understanding system can be roughly defined in four different subproblems, i.e., image classification, object detection, semantic segmentation, semantic instance segmentation, according to its level of complexity.


4.Instance segmentation: label as well as provide pixel-level segmentation to each object instance in the image. This subproblem is the higher level of semantic segmentation. It can be considered as a combination of the second subproblem and third problem.

This chapter focus on the problem of semantic instance segmentation by reformulating the classic LS contour evolution in a new learnable deep framework recurrent level set (RLS) that employs gated recurrent unit under the energy minimization of a LS functional. In the proposed RLS framework, the curve deformation process in the classic LS acts as a hidden state evolution procedure and updated by minimizing an energy functional composed of fitting forces and contour length. By sharing the convolutional features and collaborating with region proposal detection in a fully end-to-end trainable framework, we extend RLS to Contextual RLS (CRLS) to address semantic segmentation in the wild.

This book chapter is organized as follows: Section 2 introduce basic concept of both LS and deep learning including convolutional neural networks (CNNs) and recurrent neural networks (RNNs). The next two sections focus on contextual recurrent level set for object instance segmentation. The last section is targeting at experimental results which have been conducted on PASCAL VOC and COCO datasets.

#### 2. Background

This section starts with introducing the convolutional neural networks (CNNs) models, and recurrent neural networks (RNNs), specialized artificial neural networks (ANNs) architecture particularly useful for computer vision problems. Then, we introduce active contour (AC) and a classic variational level set (LS) mechanism which have been used in image segmentation.

#### 2.1 Convolutional neural networks

Convolutional neural networks (CNNs) [1–4] are a special case of fully connected multi-layer perceptrons that are mainly focusing on weight sharing to process grid-like topology data, i.e., an image. In CNNs, the spatial correlation of the signal is used to constrain the architecture in a more sensible way. There are two keys properties in their architecture, i.e., spatially shared weights and spatial pooling. These properties are made based on biological visual system which help them extremely useful for image applications. In such kind of network, the pooling layers aim at producing feature that are shift-invariant whereas the convolution layers is to lean visual representation. Since 2012, one of the most notable results in deep learning is the use of convolutional neural networks to obtain a remarkable improvement in object recognition for ImageNet classification challenge.

A typical convolutional network has multiple stages. Each stage contains one or many convolution layer, activation layer, pooling layer. The output of each stage is feature maps which are a set of 2D arrays. In its more general form, a convolutional network can be written as:

$$\begin{aligned} \mathbf{h}^0 &= \mathbf{x} \\ \mathbf{h}^l &= pool^l \left( \sigma\_l \left( \mathbf{w}^l \mathbf{h}^{l-1} + \mathbf{b}^l \right) \right), \forall l \in \mathbf{1}, 2, \dots \\ \mathbf{o} &= \mathbf{h}^L = f(\mathbf{x}, \theta) \end{aligned} \tag{1}$$

where w<sup>l</sup> , b<sup>l</sup> are trainable parameters as in MLPs at layer l. x∈Rc�h�<sup>w</sup> is vectorized from an input image with c is color channels, h is the image height and w �w<sup>0</sup> is the image width. o∈Rn�h<sup>0</sup> is vectorized from an array of dimension h<sup>0</sup> � w<sup>0</sup> of output vector (of dimension n). pool<sup>l</sup> is a (optional) pooling function at layer l.

The main difference between multi-layer neural networks (MLPs) and CNNs lies in the parameter matrices wl. In MLPs, the matrices can take any general form, while in CNNs these matrices are constraints to be Toeplitz matrices. That is, the columns are circularly shifted versions of the signal of various shifts for each columns in order to implement spatial correlation operation using matrix algebra. Furthermore, these matrices are very sparse because the image size is much bigger than the kernel size. Thus, the hidden unit h<sup>l</sup> can be expressed as a discrete-time convolution between the kernel w<sup>l</sup> and the previous hidden unit h<sup>l</sup>�1. A point-wise non-linearity and pooling are then applied on the hidden unit. There are numerous variants of CNNs architectures in the literature. However, their basic components are very similar which contains of convolutional layer, pooling layer, activation layer and fully connected layer.

