5. Implementation details

The proposed RLS approach is implemented using the TensorFlow system [17] whereas the extended CRLS semantic segmentation is implemented using Caffe environment [18]. Three functional sub-networks corresponding to three tasks, i.e., detection, segmentation and classification (shown in Figure 5), are connected together to construct an end-to-end network. The first and the third sub-networks, i.e., object detection and object classification, are adopted from Faster R-CNN framework [12]. In the second sub-network, we re-implement the defined layers from RLS-Tensorflow to RLS-Caffe for both forward and backward operations in Python. Since TensorFlow supports automatic differentiation capabilities, RLS-TensorFlow is therefore easier to implement than the new layers in RLS-Caffe.

In the experiment, the pre-trained VGG-16 model with 13 convolution layers is utilized to obtain the share features. Each convolution layer is always followed by a ReLU layer. There are 4 max-pooling layers are placed right after the convolution layer.

In the first task of CRLS, i.e., object detection, we use a 3 � 3 convolutional layer to reduce feature dimensions and learn the feature representation and then two consecutive 1 � 1 convolutional layers predicts object's locations and object's presenting scores. Furthermore, we choose the two normalization terms (Ncls, Nreg) are chosen as Ncls ¼ 256 and Nreg ¼ 2400. The balancing parameter λ is set as λ ¼ 10. As for non-maximum suppression (NMS), which is used to reduce the number of boxes generated from the first stage (� <sup>10</sup><sup>4</sup> regressed boxes are produced from the first stage), the threshold of the Intersection-over-Union (IoU) ratio is chosen as 0.7. As a result, the top-ranked 300 boxes are kept for the second stage.

In the second task of CRLS, i.e., object segmentation by the proposed RLS, we first extract a fixed-size (21 � 21) deep feature from an arbitrary box predicted using the object detection task. The proposed RLS takes the extracted feature as the input together with the randomly initial ϕ<sup>0</sup> to generate a sequence input data xt based on Eq. (18). The curve evolution procedure is performed via LS updating process given in Eq. (14). This task outputs a binary mask as given in Eq. (23) sized <sup>m</sup> � <sup>m</sup> and parameterized by an <sup>m</sup><sup>2</sup> dimensional vector.

In the final task of CRLS, i.e., object classification, using the shared convolutional features inside the bounding box region, we extract a feature representation for each ROI. Through the second task of CRLS (object segmentation), we obtain the segmenting mask prediction for that ROI. The masked feature goes through two fully-connected layers to produce the classification score for that ROI.

The proposed CRLS framework was implemented on Caffe environment [18] and under SGD optimization. For each training image, if its shorter side is larger than 600, the image is down-scaled to 600 on the shorter side. To perform the

experiments on PASCAL VOC [19], we train the network with 32 and 8 k iterations at learning rates of 0.001–0.0001, respectively. On the larger dataset like MSCOCO [20], we train the network with 180 and 20 k iterations at learning rates of 0.001– 0.0001, respectively.
