2.2 Recurrent neural networks

The Recurrent Neural Networks (RNNs) is an extremely powerful sequence model and was introduced in the early 1990s [6]. A typical RNNs contains three parts, namely sequential input data (xt), hidden state (ht) and sequential output data (ot).

RNNs contains a sequence of elements and each element performs the similar task. The input of future element depends on the output of current element. The RNNs contains one or many input and one or many output depending on the applications. Some examples, of RNNs architecture are given in Figure 1.

In RNNs, the activation of the hidden states at timestep t is computed as a function F of the current input symbol x<sup>t</sup> and the previous hidden states h<sup>t</sup>�1. The output at time t is calculated as a function G of the current hidden state h<sup>t</sup> as follows:

$$\begin{aligned} \mathbf{h}\_t &= \mathbf{F}(\mathbf{U}\mathbf{x}\_t + \mathbf{W}\mathbf{h}\_{t-1}) \\ \mathbf{o}\_t &= \mathbf{F}(\mathbf{V}\mathbf{h}\_t) \end{aligned} \tag{2}$$

where W is the state-to-state recurrent weight matrix, U is the input-to-hidden weight matrix, V is the hidden-to-output weight matrix. F is usually a logistic sigmoid function or a hyperbolic tangent function and G is defined as a softmax function.

Most work on RNNs has made use of the method of backpropagation through time (BPTT) [7] to train the parameter set ðU; V;WÞ and propagate error backward through time. In classic backpropagation, the error or loss function is defined as:

$$E(\mathbf{o}, \mathbf{y}) = \sum\_{t} \left\| \mathbf{o}\_{t} - \mathbf{y}\_{t} \right\|^{2} \tag{3}$$

where o<sup>t</sup> is prediction and yt is labeled groundtruth.

For a specific weight W, the update rule for gradient descent is defined as <sup>W</sup>new <sup>¼</sup> <sup>W</sup> � <sup>γ</sup> <sup>∂</sup><sup>E</sup> <sup>∂</sup>W, where γ is the learning rate. In RNNs model, the gradients of the error with respect to our parameters U, V and W are learned using stochastic gradient descent (SGD) and chain rule of differentiation.

There are two popular improvements of RNNs in order to solve the problems of exploding and vanishing when dealing with long time dependency.

• Long short time memory: to address the issue of learning long-term dependencies, Hochreiter and Schmidhuber [8] proposed Long Short-Term Memory (LSTM), which is able to maintain a separate memory cell inside it that updates and exposes its content only when deemed necessary. Similar to RNNs, LSTMs are defined under a chain like structure, but the repeating module has a different structure. Instead of having a single neural network

Figure 1. Example of RNNs architecture.

Recurrent Level Set Networks for Instance Segmentation DOI: http://dx.doi.org/10.5772/intechopen.84675

layer (single tanh layer), LSTMS have four neural networks in each layer and they are interacting in a very special way. The key to LSTMs is the cell state (Ct) which has the ability to remove or add information by optionally letting information through. Gates are composed out of a sigmoid layer and a pointwise multiplication operation.

• Gated recurrent units: to make each recurrent unit adaptively capture dependencies of different time scales, [9] proposed gated recurrent unit (GRU) as an extension of RNN. Similar LSTM unit, GRU contains gating units that control the flow of information inside the unit. However, GRU does not have separate memory cells. By using leaky integration, GRU is able to fully expose its memory content each timestep and balance between the previous memory content and the new memory content strictly. Notably, its adaptive time constant is controlled by the update gate. In traditional RNNs, the content of activation unit is always replaces by a new value calculated from the current input and the previous hidden state. Instead of replacing, GRU jsut adds the new content on top and still keeps the existing content. This addition provides shortcut paths that bypass multiple temporal steps. In GRUs, thank to these shortcuts, the error is back-propagated easily without too quickly vanishing as a result of passing through multiple, bounded nonlinearities. Thus it help to reduce the difficulty due to vanishing gradients. Furthermore, using the addition, each unit is able to remember the existence of a specific feature in the input stream for a long series of steps.

#### 2.3 Variational level set

Variational level set is an implicit implementation of active contour (AC). The main ideas of applying AC for image segmentation is to start with an initial random (guess) contour C which is represented in a form of closed curves. The curve is then iteratively adjusted by shrinking or expanding under an image-driven forces until it reach to the boundaries of the desired objects. The entire process is called contour evolution or curve evolution, denoted as <sup>∂</sup> ∂ C t .

There are two main approaches in active contours: snakes and level sets. Snakes explicitly move predefined snake points based on an energy minimization scheme, while level set approaches move contours implicitly as a particular level of a function.

Level set (LS)-based or implicit active contour models have provided more flexibility and convenience for the implementation of active contours, thus, they have been used in a variety of image processing and computer vision tasks. The basic idea of the implicit active contour is to represent the initial curve C implicitly within a higher dimensional function, called the level set function ϕðx; yÞ : Ω ! R, such as: C ¼ ðx; yÞ : ϕðx; yÞ ¼ 0, ∀ðx; yÞ∈ Ω, where Ω denotes the entire image plane.

A zero level set function is used to formulate the contour, i.e., the contour <sup>¼</sup> <sup>∂</sup>ϕðx;y<sup>Þ</sup> evolution is equivalent to the evolution of the level set function, i.e., <sup>∂</sup><sup>C</sup> . The <sup>∂</sup><sup>t</sup> <sup>∂</sup><sup>t</sup> reason of using the zero level set is that a contour can be defined as the border between a positive area and a negative area. Thus, everything on the zero area belongs to the contour and it is identified by signed distance function as follows:

$$\phi(\mathbf{x}) = \begin{cases} d(\mathbf{x}, \mathbf{C})) & \text{if } \mathbf{x} \text{ is inside } \mathbf{C} \\ \mathbf{0} & \text{if } \mathbf{x} \text{ is on } \mathbf{C} \\ -d(\mathbf{x}, \mathbf{C})) & \text{if } \mathbf{x} \text{ is outside } \mathbf{C} \end{cases} \tag{4}$$

where d xð ;CÞ denotes the distance from an arbitrary position to the curve.

The computation is performed on the same dimension as the image plane Ω, therefore, the computational cost of level set methods is high and the convergence speed is quite slow.

Under the scenarios of image segmentation, we consider an image in 2D space, Ω. The level set is to find the boundary C of an open set ω∈ Ω, which is defined as: <sup>C</sup> <sup>¼</sup> <sup>∂</sup>ω. In LS framework, the boundary <sup>C</sup> can be represented by the zero level set <sup>ϕ</sup> as follows:

$$\forall (\mathbf{x}, \mathbf{y}) \in \Omega \begin{cases} \mathbf{C} = & \{(\mathbf{x}, \mathbf{y}) : \phi(\mathbf{x}, \mathbf{y}) = \mathbf{0} \} \\ \mathbf{inside}(\mathbf{C}) \quad \{(\mathbf{x}, \mathbf{y}) : \phi(\mathbf{x}, \mathbf{y}) > \mathbf{0} \} \\ \mathbf{output}(\mathbf{C}) \quad \{(\mathbf{x}, \mathbf{y}) : \phi(\mathbf{x}, \mathbf{y}) < \mathbf{0} \} \end{cases} \tag{5}$$

Under the image segmentation problem, entire domain of an image I is presented by Ω which contains three regions corresponding to: inside contour (¿0), on contour (=0) and outside the contour (¡0). Clearly, the zero LS function ϕ partitions the region Ω into two regions: region inside ω (foreground), denoted as inside (C) and region outside ω (background) denoted as outside (C). Using the zero level set function, the length of the contour C and the area inside the contour C are defined as follows:

$$\begin{aligned} \text{Length}(\mathbf{C}) &= \int\_{\Omega} |\nabla H(\phi(\mathbf{x}, \mathbf{y}))| d\mathbf{x} d\mathbf{y} = \int\_{\Omega} \delta(\phi(\mathbf{x}, \mathbf{y})) |\nabla \phi(\mathbf{x}, \mathbf{y})| d\mathbf{x} d\mathbf{y} \\ \text{Area}(\mathbf{C}) &= \int\_{\Omega} H(\phi(\mathbf{x}, \mathbf{y})) d\mathbf{x} d\mathbf{y} \end{aligned} \tag{6}$$

where Hεð Þ• is a Heaviside function.

Typically,the LS-based segmentation methodsstart with an initial levelset ϕ<sup>0</sup> and an given image I. The LS updating processis performed via gradient descent by minimizing an energy function which defined based on the difference of image features,such as color and texture, between foreground and background. The fitting term in LS model is defined by the inside contour energy (E1) and outside contour energy (E2).

$$E = E\_1 + E\_2 = \int\_{inside\,C} \left( I\_{(x,y)} - c\_1 \right)^2 d\mathbf{x} dy + \int\_{outside\,C} \left( I\_{(x,y)} - c\_2 \right)^2 d\mathbf{x} dy \tag{7}$$

where c<sup>1</sup> and c<sup>2</sup> are the average intensity inside and outside the contour C, respectively.

One of the most popular region based active contour models is proposed by Chan-Vese (CV) [10]. In this model the boundaries are not defined by gradients and the curve evolution is based on the general Mumford-Shah (MS) [11] formulation of image segmentation as shown in Eq. (8).

$$E = \int\_{\Omega} |\mathbf{I} - u|^2 d\mathbf{x} dy + \int\_{\Omega/\mathcal{C}} |\nabla u|^2 d\mathbf{x} dy + \nu Length(\mathcal{C}) \tag{8}$$

CV's model is an alternative form of MS's model which restricts the solution to piecewise constant intensities and it has successfully segmented an image into two regions, each having a distinct mean of pixel intensity by minimizing the following energy functional.

$$E(c1, c2, \phi) = \mu \text{Area}(\omega\_1) + \nu \text{Length}(\mathbf{C}) + \lambda\_1 \int\_{\omega\_1} \left| \mathbf{I}(\mathbf{x}, y) - c\_1 \right|^2 d\mathbf{x} dy + \lambda\_2 \int\_{\omega\_2} \left| \mathbf{I}(\mathbf{x}, y) - c\_2 \right|^2 d\mathbf{x} dy \tag{9}$$

Recurrent Level Set Networks for Instance Segmentation DOI: http://dx.doi.org/10.5772/intechopen.84675

where c<sup>1</sup> and c<sup>2</sup> are two constants. The parameters μ, ν, λ1, λ<sup>2</sup> are positive parameters and usually fixing λ<sup>1</sup> ¼ λ<sup>2</sup> ¼ 1 and μ ¼ 0. Thus, we can ignore the first term in Eq. (9). Thus the energy functional is rewritten as follows:

$$\begin{split} E(c\_1, c\_2, \phi) &= \mu \int\_{\Omega} H(\phi(\mathbf{x}, \boldsymbol{y})) d\mathbf{x} d\mathbf{y} + \nu \int\_{\Omega} \delta(\phi(\mathbf{x}, \boldsymbol{y})) |\nabla \phi(\mathbf{x}, \boldsymbol{y})| d\mathbf{x} d\mathbf{y} \\ &+ \lambda\_1 \int\_{\alpha\_1} |\mathbf{I}(\mathbf{x}, \boldsymbol{y}) - c\_1|^2 d\mathbf{x} d\mathbf{y} + \lambda\_2 \int\_{\alpha\_2} |\mathbf{I}(\mathbf{x}, \boldsymbol{y}) - c\_2|^2 d\mathbf{x} d\mathbf{y} \end{split} \tag{10}$$

For numerical approximations, the δ function needs a regularizing term for smoothing. In most cases, the Heaviside function H and Dirac delta function δ are defined as in Eq. (11) and Eq. (12), respectively.

$$H\_{\varepsilon}(\infty) = \frac{1}{2} \left( \mathbf{1} + \frac{2}{\pi} \arctan\left(\frac{\infty}{\varepsilon}\right) \right) \\ \text{and } \delta\_{\varepsilon}(\infty) = H'(\infty) = \frac{1}{\pi} \frac{\varepsilon}{\varepsilon^2 + \pi^2} \\ \tag{11}$$

$$\delta\_{\varepsilon}(\mathbf{x}) = H'(\mathbf{x}) = \frac{1}{\pi} \frac{\varepsilon}{\varepsilon^2 + \mathbf{x}^2} \tag{12}$$

As ε ! 0, δε ! δ, and H<sup>ε</sup> ! H. Using Heaviside function H, the Eq. 10 becomes Eq. 13.

$$\begin{split} E(\mathbf{c}\_1, \mathbf{c}\_2, \boldsymbol{\phi}) &= \mu \int\_{\Omega} H(\boldsymbol{\phi}(\mathbf{x}, \mathbf{y})) d\mathbf{x} d\mathbf{y} + \nu \int\_{\Omega} \delta(\boldsymbol{\phi}(\mathbf{x}, \mathbf{y})) |\nabla \boldsymbol{\phi}(\mathbf{x}, \mathbf{y})| d\mathbf{x} d\mathbf{y} \\ &+ \lambda\_1 \int\_{\Omega} |\mathbf{I}(\mathbf{x}, \mathbf{y}) - \mathbf{c}\_1|^2 H(\boldsymbol{\phi}(\mathbf{x}, \mathbf{y})) d\mathbf{x} d\mathbf{y} + \lambda\_2 \int\_{\Omega} |\mathbf{I}(\mathbf{x}, \mathbf{y}) - \mathbf{c}\_2|^2 (\mathbf{1} - H(\boldsymbol{\phi}(\mathbf{x}, \mathbf{y}))) d\mathbf{x} d\mathbf{y} \end{split} \tag{13}$$

In the implementation, they choose ε ¼ 1. For fixed c<sup>1</sup> and c2, gradient descent equation with respect to ϕ is:

$$\frac{\partial \phi(\mathbf{x}, \mathbf{y})}{\partial t} = \delta\_\varepsilon \,\phi(\mathbf{x}, \mathbf{y}) \left[ \nu \kappa \,\phi(\mathbf{x}, \mathbf{y}) - \mu - \lambda\_1 (\mathbf{I}(\mathbf{x}, \mathbf{y}) - \mathbf{c}\_1)^2 + \lambda\_2 (\mathbf{I}(\mathbf{x}, \mathbf{y}) - \mathbf{c}\_2)^2 \right] \tag{14}$$

where δε is a regularized form of Dirac delta function and c1, c<sup>2</sup> are the mean of inside the contour ωin and the mean of the outside of the contour ωout, respectively. The curvature κ is given by:

$$\kappa(\phi(\mathbf{x}, \mathbf{y})) = -\text{div}\left(\frac{\Delta\phi}{|\Delta\phi|}\right) = -\frac{\phi\_{\text{xx}}\phi\_{\text{y}}^2 - 2\phi\_{\text{x}}\phi\_{\text{y}}\phi\_{\text{xy}} + \phi\_{\text{yy}}\phi\_{\text{x}}^2}{\left(\phi\_{\text{x}}^2 + \phi\_{\text{y}}^2\right)^{1.5}}\tag{15}$$

where ∂xφt, ∂yφ<sup>t</sup> and ∂xxφt, ∂yyφ<sup>t</sup> are the first and second derivatives of φ<sup>t</sup> with respect to x and y directions. For fixed ϕ, gradient descent equation with respect to c<sup>1</sup> and c<sup>2</sup> are:

$$c\_1 = \frac{\sum\_{\mathbf{x}, \mathbf{y}} \mathbf{I}(\mathbf{x}, \mathbf{y}) H(\phi(\mathbf{x}, \mathbf{y}))}{\sum\_{\mathbf{x}, \mathbf{y}} H(\phi(\mathbf{x}, \mathbf{y}))} \\ \text{and } c\_2 = \frac{\sum\_{\mathbf{x}, \mathbf{y}} \mathbf{I}(\mathbf{x}, \mathbf{y}) (\mathbf{1} - H(\phi(\mathbf{x}, \mathbf{y})))}{\sum\_{\mathbf{x}, \mathbf{y}} (\mathbf{1} - H(\phi(\mathbf{x}, \mathbf{y})))} \tag{16}$$

herein, we use notation φ<sup>t</sup> to indicate φ at the iteration t th in order to distinguish the φ at different iterations. Under this formulation, the curve evolution is shown as a time series process which helps to give better visualization of reformulating LS. From this point, we redefine the curve updating in a time series form for the LS function φ<sup>t</sup> as in Eq. (17).

$$
\rho\_{t+1} = \rho\_t + \eta \frac{\partial \rho\_t}{\partial t} \tag{17}
$$

The LS at time t þ 1 depends on the previous LS at time t and the curve evolution <sup>∂</sup>φ<sup>t</sup> with <sup>a</sup> learning rate <sup>η</sup>. <sup>∂</sup><sup>t</sup>

## 3. Recurrent level set for object instance segmentation

To reformulate level set under a RNNs/GRUs deep framework, we need to solve the following problems:


As shown in Table 1, one of the most challenging problems of reformulating LS as RNNs is data configuration. RNNs work on sequential data whereas LS uses single image as input and produce a mask as single image too. The first question here is how to generate sequential data from a single image. Moreover, there are two source inputs used in LS, i.e., an input image I and an initial contour which is treated as initial LS function ϕ<sup>0</sup> and updated by Eq. (17). That means we need to generate a sequential data xt (t ¼ 1, ⋯, N) from single image I in our proposed RLS. In order to achieve this goal, we define a function gðI; ϕ Þ as in Eq. (18). <sup>t</sup>�<sup>1</sup>

$$\mathbf{x}\_{t} = \mathbf{g}(\mathbf{I}, \phi\_{t-1}) = \phi\_{t-1} + \eta \left[ \kappa (\phi\_{t-1}) - \mathbf{U}\_{\mathbf{g}} (\mathbf{I} - c\_{1})^{2} + \mathbf{W}\_{\mathbf{g}} (\mathbf{I} - c\_{2})^{2} \right] \tag{18}$$

In Eq. (18), c<sup>1</sup> and c<sup>2</sup> are average values of inside and outside of the contour presented by the LS function ϕ<sup>t</sup>�<sup>1</sup> and defined in Eq. (16). κ denotes the curvature and defined in Eq. (15). U<sup>g</sup> and W<sup>g</sup> are two matrices that control the force inside and outside of the contour.

In such equation, the curve evolution in the proposed RLS functions as same as in the traditional LS defined by Eq. (14), i.e. the input at iteration t th, xt, is updated based on the input image I and the previous LS function ϕ<sup>t</sup>�1. In our proposed RLS, x<sup>t</sup> is considered to be sequential input whereas LS updating ϕ<sup>t</sup> is treated as hidden state updating. Notably, the initial contour in LS function plays the role as initial hidden state as randomly generated. The relationship among I, xt, and ϕ<sup>t</sup> are given in Figure 2.

So far, the question of generating input sequence from the single image I has been answered. In the proposed RLS, we use the same input as defined in LS problem, namely the input image I and the initial ϕ0. We generate sequential data x<sup>t</sup> from the input as single image. The next challenging problem is how to generate the hidden state ϕ<sup>t</sup> from the input data xt and the previous hidden state ϕ<sup>t</sup>�1. Under the same intuition of proposing GRUs [9], the procedure of generating hidden state ϕ<sup>t</sup>


Recurrent Level Set Networks for Instance Segmentation DOI: http://dx.doi.org/10.5772/intechopen.84675

#### Table 1.

Comparison of input, update rule and output used in traditional LS, GRUs, and our proposed RLS.

Figure 2.

The visualization of generating sequence input data x<sup>t</sup> and hidden state ϕ<sup>t</sup> from the input image I and the initial LS function ϕ0.

~ ht is based on the updated gate zt, the candidate memory content and the previous activation unit ϕ<sup>t</sup>�<sup>1</sup> as the rule given in Eq. (19).

$$
\phi\_t = \mathbf{z}\_t \mathbf{\tilde{h}}\_t + (\mathbf{1} - \mathbf{z}\_t) \phi\_{t-1} \tag{19}
$$

The update gate zt, which controls how much of the previous memory content is to be forgotten and how much of the new memory content is to be added is defined as in Eq. (20).

$$\mathbf{z}\_t = \sigma(\mathbf{U}\_x \mathbf{x}\_t + \mathbf{W}\_x \boldsymbol{\phi}\_{t-1} + \mathbf{b}\_x) \tag{20}$$

where σ is a sigmoid function and b<sup>z</sup> is the update bias. However, the propose RLS does not have any mechanism to control the degree to which its state is exposed, but exposes the whole state each time in such naive definition. To address that issue, the new candidate memory content yt is computed. as in Eq. (21).

$$\tilde{\mathbf{h}}\_t = \tanh\left(\mathbf{U}\_{\tilde{h}}\mathbf{x}\_t + \mathbf{W}\_{\tilde{h}}(\phi\_{t-1}\odot\mathbf{r}\_t) + \mathbf{b}\_{\tilde{h}}\right) \tag{21}$$

where ⊙ denotes an element-wise multiplication, b is the hidden bias. The ~ h reset gate r<sup>t</sup> is computed similarly to the update gate as in Eq. (22).

Figure 3.

The proposed RLS network for curve updating process under the sequential evolution and its forward computation of curve evolution from time t � 1 to time t.

$$\mathbf{r}\_{t} = \sigma(\mathbf{U}\_{r}\mathbf{x}\_{t} + \mathbf{W}\_{r}\boldsymbol{\phi}\_{t-1} + \mathbf{b}\_{r}) \leftarrow \tag{22}$$

^ where b<sup>r</sup> is the reset bias. When r<sup>t</sup> is close to 0 (off), the reset gate effectively makes the unit act as if it is reading the first symbol of an input sequence, allowing it to forget the previously computed state. The output o is computed from the current hidden states ϕ<sup>t</sup> and then a softmax function is applied to obtain forey given the input image as follows: ground/background segmentation

$$\begin{aligned} \hat{\mathbf{y}} &= \text{softmax}(\mathbf{o}) \\ \mathbf{o} &= \mathbf{V}\phi\_t + \mathbf{b}\_V \end{aligned} \tag{23}$$

where V is weighted matrix between hidden state and output. Figure 4 (left) illustrate the proposed RLS in folded mode where the input of the network is defined as the same as the LS model, i.e., a given input image I and an initial LS function ϕ0. In our proposed RLS, the curve evolution from ϕ<sup>t</sup> at present time step t to the future ϕ<sup>t</sup>þ<sup>1</sup> at time t þ 1 is designed in the same fashion as the hidden state in GRUs and is illustrated in Figure 4 (right) where ϕ<sup>t</sup>þ<sup>1</sup> depends on both previous LS function ϕ<sup>t</sup> and the present input x<sup>t</sup>þ1. We have summarized the comparison among LS, GRUs and our proposed RLS in Table 1 and visulized the relationship between LS and GRUs as Figure 3. It is easy to see that RLS uses the same input as in traditional LS where the curve evolution (i.e., update procedure) and output in the proposed RLS follows similar fashion as in GRUs.

We summarize the proposed building block RLS in Algorithm 1.

Algorithm 1. Algorithm of the proposed building block RLS

rate η, initial parameters θ ¼ ðUz, Wz, bz, Ur, Wr, b<sup>r</sup> U ~ ~ ~ Input: Given an image I, an initial level set function ϕ0, time step N, learning h , <sup>W</sup>h, <sup>b</sup> <sup>h</sup>, <sup>V</sup>, <sup>b</sup>V<sup>Þ</sup> .

```
 

                                                                     ~
                                                                     ht
for each epoch do
  Set ϕ ¼ ϕ0
  for t=1:T do
    Generate RLS input xt: xt gðI; ϕt� Þ1
    Compute update gate zt, reset gate rt and intermediate hidden unit :
    zt σð Uzxt þ Wzϕt�1 þ bzÞ 
    rt σð Urxt þ Wrϕt�1 þ brÞ
```
Recurrent Level Set Networks for Instance Segmentation DOI: http://dx.doi.org/10.5772/intechopen.84675

> ˜ ° <sup>h</sup><sup>t</sup> h h <sup>h</sup> ~ ht ~ <sup>~</sup> <sup>~</sup> <sup>~</sup> tanh <sup>U</sup> xt <sup>þ</sup> <sup>W</sup> <sup>ð</sup>ϕt�<sup>1</sup> ◦ <sup>r</sup>tÞ þ <sup>b</sup> Update the zero LS ϕt: ϕ<sup>t</sup> ð 1 � ztÞ þ ztϕt�<sup>1</sup> y^ end for Compute the loss function <sup>L</sup>: <sup>L</sup> �<sup>∑</sup> <sup>y</sup> log <sup>n</sup> <sup>n</sup> <sup>n</sup> <sup>∂</sup><sup>L</sup> Compute the derivate w.r.t. <sup>θ</sup>: <sup>∇</sup><sup>θ</sup> <sup>∂</sup>θ. Update θ: θ θ þ η∇θ end for

The proposed RLS described in the previous section performs the image segmentation task, i.e., dividing an image into two parts corresponding foreground and background segments. Given a real-world image, RLS however performs neither the instance segmentation nor image understanding which are significant tasks in many computer vision application. In order solve this requirement, we introduce contextual recurrent level sets (CRLS) for semantic object segmentation which is an extension of our proposed RLS model to address the multi-instance object segmentation in the wild. The proposed CRLS is able to (1) localize objects existing in the given image; (2) segment the objects out of the background; (3) classify the objects in the image. The output of our CRLS is multiple values (each value is corresponding to one object class) instead of two values (foreground and background) as in RLS.

Our proposed CRLS inherits the merits of RLS and faster-RCNN [12] for semantic segmentation which simultaneously performs three tasks which are addressed by three stages, i.e., detection, segmentation and classification in a fully end-to-end trainable framework as shown in Figure 5. In our CRLS, the network

Figure 4. Visualization of relationship between LS and GRUs in a RLS unit.

Figure 5.

The flowchart of our proposed CRLS for semantic instance segmentation with three tasks which are addressed by three stages, i.e. object detection by Faster R-CNN [12], object segmentation by RLS and classiffication.

takes an image of arbitrary size as the input, and outputs instance-aware semantic segmentation results. The network contains three components corresponding three stages of a semantic instance segmentation: object detection for proposing boxlevel, object segmentation is for mask-level and object classification is for categorizing each instance. These three stages are designed to share convolutional features. Each stage involves a loss term and the loss of later stage relies on the output of an predecessor stage, so the three loss terms are not independent. We train the entire network end-to-end with a unified loss function.

#### 3.1 Stage 1: object detection

One of the most important approaches to the object detection and classification problems is the generations of region-based convolutional neural networks (R-CNN) family [12–15].

In order to propose a fully end-to-end trainable network, we adapt the region proposal network (RPN) introduced in Faster R-CNN [12] to predict the object bounding boxes and the objectness scores. Aiming at reducing time consuming, both the detection and segmentation procedure share the convolutional features of a deep VGG-16 network [16].

As for the share convolutional features, we utilize a VGG-16 networks with 13 convolution layers where each convolution layer is followed by a ReLU layer but only four pooling layers are placed right after the convolution layer to reduce the spatial dimension.

In this stage, object detection, the network proposes object instances in the form of bounding boxes which are predicted with an objectness score. The network structure and loss function of this stage follow the work of Region Proposal Networks (RPNs) [12]. RPNs take an image of any size as the input and predicts bounding box locations and objectness scores in a fully-convolutional form. A 3 ˜ 3 convolutional layer for reducing the dimension is applied on top of share convolutional features. The lower dimension features are fed into two sibling 1 ˜ 1 convolutional layers: one is for a box-regression layer and the other is for box-classification layer.

From the share feature maps and at each sliding-window location, multiple region proposals are generated. Denote k is the maximum possible proposals for each location. Each box is presented by four values corresponding to locations (x, y w, h) and two scores corresponding the probability of object or not object. Each anchor, which is centered at the sliding window, is associated with an aspect ratio and scale. In the experiments, each sliding position has 9 (k = 9) anchors corresponding to three scales and three aspect ratios. If the sharing feature maps of size W ˜ H, there are k ˜ W ˜ H anchors. For the purpose of object detection, there are only two anchors considered:


Recurrent Level Set Networks for Instance Segmentation DOI: http://dx.doi.org/10.5772/intechopen.84675

There are two sibling output layers: probability of classification and bounding box regression in the end of RPNs, therefore, the RPNs loss is based on two losses: boxes regression loss and object classification loss.

$$L\_{\rm RPN}(p^i, b^i) = \frac{1}{N\_{\rm ds}} \sum\_i L\_{\rm ds}(p^i, p^{i\*}) + \lambda \frac{1}{N\_{\rm rg}} \sum\_i p^{i\*} L\_{\rm rg}(b^i, b^{i\*}) \tag{24}$$

In this equation:


$$\begin{aligned} b\_{\mathbf{x}}^{i\*} &= \left(\mathbf{x}^{i\*} - \mathbf{x}^{a}\right) / w^{a} & b\_{\mathbf{x}}^{i} &= \left(\mathbf{x}^{i} - \mathbf{x}^{a}\right) / w^{a} \\ b\_{\mathbf{y}}^{i\*} &= \left(\mathbf{y}^{i\*} - \mathbf{y}^{a}\right) / h^{a} & b\_{\mathbf{y}}^{i} &= \left(\mathbf{y}^{i} - \mathbf{y}^{a}\right) / h^{a} \\ b\_{w}^{i\*} &= \log\left(w^{i\*} / w^{a}\right) & b\_{w}^{i} &= \log\left(w^{i} / w^{a}\right) \\ b\_{h}^{i\*} &= \log\left(h^{i\*} / h^{a}\right) & b\_{h}^{i} &= \log\left(h^{i} / h^{a}\right) \end{aligned} \tag{25}$$

• The bounding boxes regression is defined by smoothl<sup>1</sup> as follows:

$$L\_{\text{reg}} = \sum\_{\substack{u \in \mathfrak{x}\_2 \ y\_2 \ w\_2 h}} smooth\_{l\_1} (b\_u^i - b\_u^{i\*}) \tag{26}$$

$$smooth\_{l\_1}(\mathbf{x}) = \begin{cases} \frac{1}{2}\mathbf{x}^2 & \text{if } \quad |\mathbf{x}| < \mathbf{1} \\\\ |\mathbf{x}| - \mathbf{0.5} & \text{otherwise} \end{cases} \tag{27}$$

The classification loss ðLclsÞ is log loss over two classes (object vs. not object). The RPNs loss is defined as in Eq. (28) where i is the ground truth class.

$$L\_{\rm cls}(p^i, p^{i\*}) = -\log\left(p^i\right). \tag{28}$$

To easy to follow, we denote the loss of this stage as: LRPN ¼ LRPNðBð Þ Θ Þ

where Θ represents all network parameters to be optimized. B is the network output of this stage, representing a list of bounding boxes B ¼ b <sup>i</sup> f g . Each bounding box is presented by four coordinations (box center, width, and height) and ˜ ° predicted objectness probability bi <sup>¼</sup> bx; by; bw; bh; pi <sup>i</sup> .

#### 3.2 Stage 2: object segmentation

The second stage takes the share features and the results from the first stage (predicted box) as inputs. The output of this stage is binary segmentation which contains foreground (object) and background.

For each predicted box, we make use of RoI warping layer to crop and warp a region on the feature map into the target fixed size by interpolation. Each predicted box from the deep feature map (conv5) is now resized to m � m where m ¼ 21 in our experiment. To reduce the size of a region, RoI warping layer is performed by using a pooling payer to convert features inside any valid region of interest into a small feature map. Each RoI is defined by top-left corner ðr;cÞ and its height and width ðh; wÞ. That means, it is defined by a four-tuple ðr;c; h; wÞ. The feature map of <sup>w</sup> size <sup>h</sup> � <sup>w</sup> is partitioned into <sup>m</sup> � <sup>m</sup> grid of sub-windows of approximate size <sup>h</sup> � . m m The max pooling is then applied into each sub-window. As a result, RoI warping layer outputs a grid cell.

The fixed size extracted features are passed through the proposed RLS together with a randomly initial ϕ<sup>0</sup> to generate a sequence input data x<sup>t</sup> based on Eq. (18). The curve evolution procedure is performed via LS updating process given in Eqs. (14) and (17). This task outputs a binary mask as given in Eq. (23) sized m � m and parameterized by an m<sup>2</sup> dimensional vector.

Given a set of predicted bounding box from the first stage, the loss term LSEG of the second stage for foreground segmentation is given by:

$$L\_{\rm SEG} = L\_{\rm SEG}(\mathcal{S}(\Theta) | B(\Theta)) \tag{29}$$

Here, S is the network designed by RLS and is presenting a list of segmentations S ¼ f gS <sup>i</sup> . Each segmentation Si is an m � m mask.

#### 3.3 Stage 3: object classification via fully-connected network

The third stage takes the share feature, bounding box from the first stage, object segmentation from the second stage as inputs. The output of this stage is class score for each object instance.

For each predicted bounding box from the first stage, we first extract the feature by ROI pooling (fROI). The feature is then go through the proposed RLS and presented by a binary mask (f SEGðΘ)) from the second stage. The input of this stage (f MSK) is the masked feature which depends on the segmentation results (f SEGðΘ)) and ROI feature (fROI) and computed by element-wise product of those, ðΘ Θ Þ ¼ f ðΘ. To predict the category of a region, we first apply two ROI f MSK ð Þ ∗ f SEG fully-connected layers to masked feature f MSK. As a result, we receive mask-based feature vector which is then concatenated with another box-based feature vector to build a joined feature vector. Notably, the box-based feature vector is computed by applying two fully-connected layers to ROI feature fROI. Finally, two fullyconnected layers are attached to the joined feature and each gives class scores and

Recurrent Level Set Networks for Instance Segmentation DOI: http://dx.doi.org/10.5772/intechopen.84675

Figure 6. An illustration of Stage 3 for object classification using f ROIð Þ Θ Θ ∗ f as inputs. SEGð Þ

refined bounding boxes using softmax classification of ðK þ 1Þ classes (including background). The entire procedure of Stage 3 is illustrated in Figure 6.

Let C is the network output of this stage and representing a list of category predictions for all instances: C ¼ C <sup>i</sup> f g . The loss term LCLS of the third stage is expressed in Eq. (30).

$$L\_{\rm GLS} = L\_{\rm GLS}(\mathbf{C}(\Theta) | \mathbf{B}(\Theta), \mathbf{S}(\Theta)) \tag{30}$$

The loss of entire proposed CRLS network is defined as in Eq. (32).

$$\begin{split} L(\Theta) &= L\_{\text{RPN}} + L\_{\text{SEG}} + L\_{\text{CLS}} \\ &= L\_{\text{RPN}}(\mathcal{B}(\Theta)) + L\_{\text{SEG}}(\mathcal{S}(\Theta)|\mathcal{B}(\Theta)) + L\_{\text{CLS}}(\mathcal{C}(\Theta)|\mathcal{B}(\Theta), \mathcal{S}(\Theta)) \end{split} \tag{31}$$

### 4. Inference

In our experiment, the top-scored 300 ROIs are first chosen by RPNs proposed boxes giving an image. Using Non-maximum suppression (NMS) with IoU and threshold is chosen as 0.7 to filter out highly overlapping and redundant candidates. The proposed RLS isthen applied on to eachROI to partition itinto object (foreground) and non-object (background). From each ROI, we obtain a segmenting mask and a category score prediction via two fully-connected layers followed by a soft-max layer.

Besides softmax loss, which is to optimize when classifying a region is object or non-object in the first stage, there are three "dependent" losses in which the later loss depends on the predecessor loss. Three losses are assigned for each ROI, i.e., (1) smooth l<sup>1</sup> loss—bounding box regression loss in the first state, (2) cross entropy loss —foreground segmentation loss in the second stage (3) softmax loss—instance classification loss in the third stage. Among all the losses, computing the gradient w. r.t. predicted box positions and address the dependency on the bounding box Bð Þ Θ is the most difficulty. Given a full size feature map Hð Þ Θ , we crop a bounding box ˜ ° region (predicted box) bið Þ¼ <sup>Θ</sup> bx; by; bw; bh and warp it to <sup>a</sup> fixed size by interpolation. The wrapped region ROI is presented as:

$$\mathcal{F}\_{ROI}(\Theta) = \mathcal{T}(b\_i(\Theta))\mathcal{F}(\Theta) \tag{32}$$

where <sup>I</sup> is cropping and warping operations. As for dimension, <sup>F</sup> <sup>Θ</sup> <sup>∈</sup>R<sup>N</sup> ð Þ is <sup>a</sup> vector reshaped from the image, N ¼ W � H. The cropping and warping matrix

<sup>R</sup><sup>N</sup> <sup>I</sup> <sup>∈</sup><sup>R</sup> . <sup>F</sup>ROIð Þ <sup>h</sup> and <sup>M</sup> <sup>¼</sup> <sup>ð</sup> ð Þ <sup>M</sup> � <sup>Θ</sup> is <sup>a</sup> target region sized <sup>w</sup> � <sup>w</sup> � <sup>h</sup>. <sup>I</sup> bi <sup>Θ</sup> <sup>Þ</sup> presents transforming a that box bið Þ Θ from size of bw � bh into the size of w � h. Let ðx0 ; y<sup>0</sup> Þ and ðx; yÞ are two points on target feature map FROIð Þ Θ size of w � h and original map Fð Þ Θ size of ðW � HÞ, respectively. Using interpolation with (G) is the bilinear interpolation function, Iðbið Þ Θ Þ is computed as follows:

$$\begin{aligned} \mathcal{L}(b\_i(\Theta)) &= (G) \left( b\_x + \frac{\varkappa'}{w} - \varkappa \right) (\mathbf{g}) \left( b\_y + \frac{\mathcal{Y}'}{h} - \mathcal{Y} \right) \\ \mathbf{g}(\mathbf{z}) &= \max(\mathbf{0}, \mathbf{1} - |\mathbf{z}|) \end{aligned} \tag{33}$$
