**2. Related work**

As mentioned before, there exist different classifications for system architecture in AVs; based on their connectivity, we find ego-only systems and connected systems. Ego-only systems are when a single self-sufficient vehicle carries all the necessary automated driving operations at all times. In contrast, connected systems depend on other vehicles and infrastructure to make decisions. This last approach is still in an initial phase, but in the future, with the growing area of the Internet of Things (IoT), this will be possible. It is expected to have vehicle-to-vehicle (V2V) communication, vehicle to infrastructure (V2I) communication, and vehicle to everything (V2X) communication. A large amount of data will be available to vehicles, so more informed decisions will be taken; however, new challenges will exist, and AVs could become even more complex.

A second classification is based on the algorithmic design; we find modular systems and end-to-end systems. In modular systems, AVs are seen as separate tasks to accomplish. Every module represents one task that can be solved separately, and then the results of each are integrated to form a complete system. However, this approach is prone to error propagation. On the other hand, in end-to-end systems, all the modules are seen as a black box. In general, the system receives data from the sensors, and the output is the directions for the actuators of the vehicle.

In this work, we will discuss different proposals found in the literature based on modular systems. Some tasks of this classification are object/pedestrian detection, road detection, obstacle avoidance, terrain perception, mapping of the environment, and path planning.

### **2.1 Object/pedestrian detection**

Object detection identifies and locates instances of objects in an image. In this task, when an object is detected, it is marked with a rectangular bounding box. Some general steps in object detection are preprocessing, Region of Interest (ROI) extraction, object classification, and localization. In the preprocessing steps, some subtasks are performed, such as exposure and gain adjustment, camera calibration, and image rectification. Extract regions of interest can also be implemented as a preprocessing step. Approaches that use ROI extraction usually have more computational cost since the system becomes more complex; however, the results are better. Another disadvantage of ROI is processing time; in modular systems, time is an essential consideration because other modules need to be executed so the vehicle can take and implement the decisions in real-time.

The most common approach is Deep Convolutional Neural Networks (DCNN). One of the most known DCNN is YOLO (You Only Look Once) [4]. YOLO works

*Automatic Terrain Perception in Off-Road Environments DOI: http://dx.doi.org/10.5772/intechopen.99973*

with a single neural network that predicts bounding boxes, confidence for those boxes, and a class probability map. This network process 45 FPS, and there is a modified version that is faster, but the accuracy is lower. A different method is proposed by [5]; this method consists of a multi-scale CNN. First, they use a proposal sub-network and then a detection sub-network. The proposal network could work as a detector, but it is not strong since its sliding windows do not cover objects well. Thus, the reason to include a detection network was to increase detection accuracy.

Alternatively, Tabor et al. [6] not only apply their own initial implementation of a CNN but also considered aggregated channel features (ACF) and deformable parts model (DPM). ACF uses a sliding window approach where candidate bounding boxes are considered at regular intervals throughout images. DPM uses a twostage classification process to model parts of an object to move relative to each other and the object centroid. Another approach is Region Proposal Networks followed (RPN) by boosted forests [7] which is a more simple but effective method. RPN generates the candidate boxes as well as convolutional feature maps, while Boosted Forest classifies the proposal using convolutional features.

### **2.2 Road detection**

There is no general definition of the road detection problem. Mei et al. [8] defines the problem as "detecting the region in front of the robot that is mechanically traversable by the robot that is apt to be chosen by a human to drive." This definition can be applied to off-road environments where there are no defined roads like in cities. Approaches in the literature usually consider the scenarios since some methods are more reliable in urban scenarios than off-road.

Lane and road boundary detectors are proposed despite the lack of boundaries in some unstructured scenarios. Jiménez et al. [9] presents a new algorithm based on this using a laser scanner and a digital map when available. They applied two methods in parallel to increase the robustness of their results. Their first method is about the study of variations in the detection of each layer of the laser scanner. They detect boundaries when there are sidewalks higher than the road. The second method is for the study of the separation between intersecting sections of consecutive laser scanner layers. The solution proposed in the method is to try to identify areas with constant radius differences within a predefined tolerance, which allows determining the roadway area.

Cameras are a common form of perception; some works consider the color model of the terrain surfaces and illumination conditions to extract and segment roads [9]. The problem, in that case, is formulated as a joint classification. Moreover, Procházka [10] uses the Monte-Carlo algorithm to segment road regions. They estimate the probability density function (PDF) of road pixels from a sequence of observations. The sequential Monte-Carlo estimation is the one that approximates the PDF. In contrast, Li et al. [11] combines camera information with laser data. They apply a preprocessing step to detect roads, and then they analyze texture features in grayscale images. The laser sensors provide a traversable region near the front of the vehicle.

### **2.3 Obstacle avoidance**

This task satisfies the objective of non-intersection or non-collision with objects, and it is very related to path planning. Obstacle avoidance is a crucial aspect of autonomous driving; however, some researchers took more emphasis on optimizing the avoidance of crashes while others only comply with satisfying this task but not in the best form. Some important aspects to consider are the vehicle's

characteristics, like the turning radius and the velocity. Similar to some path planning proposals, some authors use cubic splines [12] to generate several paths considering obstacles, and the best path is selected using optimization techniques. Other approaches use fuzzy algorithms [13] to control AVs, considering vehicle dynamics and the geometry of the obstacles.

### **2.4 Terrain perception**

This task is a vital component of AVs in an off-road environment. In contrast with cities, off-road scenes are more unstructured, and surfaces are not expected to be flat. AVs must be able to decide whether the terrain ahead is passable easily, passable with caution, or whether it is better to avoid. Usually, the information processed in this task comes from images. Another widely used sensor is a laser; information obtained with this kind of sensor helps build a 3D map of the scene to understand terrains with different altitudes.

Cameras are usually mounted on top of AVs, but in some cases, like in small robots, cameras view only the ground, so only one type of terrain is perceived at the time. In automobiles, the perspective is different; bigger pictures are obtain containing also information from the sky. Some researchers use a more classical approach; it is common to see feature extractors and classifiers. There are different forms of feature extraction; for example, Filitchkin and Byl [14] uses a bag of visual words from speeded up robust features (SURF). Other works use local binary patterns (LBP) and local ternary patterns (LTP) [15]. Besides that, some approaches create a combination of features, for instance, color and edge directivity (CEDD) and fuzzy color and texture histogram (FCTH) [16].

A common classifier used not only in computer vision tasks but in other areas is Support Vector Machine (SVM), which is one of the most robust prediction methods in the literature; however, this classifier uses supervised learning. Random forests had been found useful to classify asphalt, tiles, and grass with information from cameras and lasers [15]. A different approach is the use of CNN [17, 18]; usually, no preprocessing steps are performed; only RGB images are the inputs to the network. One disadvantage is the need for large amounts of data needed to train this kind of network.

### **2.5 Mapping of the environment**

Mapping presents a digital representation of the environment; it helps to decide a safer path to follow. Usually, 2 and 3 Dimensional (2D and 3D) information is used. Create 3D maps can be computationally expensive and can increase processing time. Some approaches use a priori maps; the system compares real-time readings with previous data. The main disadvantage in a priori map is the changes in the environment; specifically, in off-road scenes, it is difficult to have the same characteristic all the time, that is, vegetation growth.

Some representations in this task are superpixels, stixels, and 3D primitives [19]. In pixel-based representation, each pixel is a separate entity; due to this, in high-resolution images, the complexity is more. Superpixels are groups of pixels used to solve the problem of complexity. These groups are obtained by segmenting the image into small regions; these should be similar in color and texture. Stixels are presented as a medium-level representation of 3D traffic scenes with the goal to bridge the gap between pixels and objects. These are represented by a set of rectangular sticks standing vertically on the ground to approximate surfaces. 3D primitives are blocks of 3D basic geometric forms such as cubes, pyramids, cones, among others.

With the help of 2D and 3D information obtained from LIDAR and other sensors, the systems can have a sense of the geometric structure around the vehicle. A way to map the environment is by using semantic segmentation combined with CNN [20]. The combination of neural networks with other approaches creates more robust methods than the approaches that used a single algorithm. Nevertheless, sometimes there are difficulties in estimating the pose of lasers, which is required for the proper registration of the range measurements. As a result, Parra-Tsunekawa et al. [21] proposed the use of the extended Kalman filter to estimate in real-time the instantaneous pose of the vehicle and the laser rangefinders by considering various measurements acquired by different sensors.
