**5. Our proposal**

Our proposal is focused on the terrain perception stage for off-road environments. Our approach uses an existing public dataset named Freiburg Forest Dataset (FFD) [18] to train a convolutional neural network to segment five different classes in daylight considering good weather conditions.

FFD is an open dataset that contains multi-modal/spectral images. There are 230 training images and 136 validation images. It also contains manually annotated pixel-wise ground truth segmentation masks. Besides RGB images, the other modalities included are two vegetation indexes: the Normalized Difference Vegetation Index (NDVI) and the Enhanced Vegetation Index (EVI); also Near-infrared (NIR)

**Figure 1.** *Sample image Freiburg forest dataset with its ground truth mask.*

and depth data. The five classes in this dataset are Obstacle, Trail, Sky, Grass, and Vegetation. All the data was captured at 20 Hz with a camera resolution of 1024 × 768 pixels (**Figure 1**).

The semantic segmentation was achieved using convolutional neural networks. For this step, we select DeepLab [35], a model created by Google to perform semantic segmentation. This model is commonly used to segment objects like persons, vehicles, animals, etc. There are different versions of DeepLab, but the latest v3+ implements a novel encoder-decoder structure and a spatial pyramid pooling module (ASPP). DeepLab supports different network backbones like Xception [36], MobileNet [37], ResNet [38], and PNASNet [39]. Besides, there are pretrained checkpoints to retrain with different data.

For this work, we select MobileNet as the backbone due to its fast and lightweight structure. We select two checkpoints, the first is pretrained on ADE20K and the second in MS-COCO. The main reason to select these two checkpoints is based on the content of that datasets. Both datasets contain labels related to off-road environments, that is, tree, sand, ground, among others. Since the data is in a different format all the RGB images and the PNG ground truth images were transformed into TFRecord format.

All the training and evaluation were run in Python on a Laptop with a Core i7-8750H and an NVIDIA GTX 1050Ti GPU.
