**4. Datasets and evaluation metrics**

Datasets are important to train and evaluate algorithms. In the literature, there are several known datasets to use on autonomous driving projects. Some of the most popular are PASCAL VOC [26], KITTI Vision Benchmark [27], MS-COCO [28], ImageNet [29], Berkeley DeepDrive [30], nuScenes [31], Oxford RobotCar [32], Waymo Open [33], and Cityscapes [34]. There are other small datasets like Freiburg Forest Dataset (FFD) [18], Hand-Labeled DARPA LAGR Datasets [35], and NREC Agricultural Person-Detection Dataset [36]. Every dataset contains a different structure, but in general, all have a set for training and others for evaluation. In the area of autonomous driving, the majority of datasets contain images, but there are others containing information from LIDAR sensors. Only a few contain other kinds of data like depth, near-infrared, radar, GPS, vegetation indexes, etc.

One of the most common metrics to evaluate classification algorithms is accuracy [Eq. (1)]. It is defined as the number of correct predictions over the total number of predictions made. In binary classification, it can also be calculated in terms of positives and negatives [Eq. (2)]. In the case of imbalanced data, accuracy does not present an accurate representation. In those cases, precision [Eq. (3)] and recall [Eq. (4)] are better metrics to use. The first metric attempts to answer the question what proportion of positive identifications was actually correct? And the second answer the question what proportion of actual positives was identified correctly?

$$Accuracy = \frac{Number\text{ of correct predictions}}{Total\ number\text{ of predictions}\text{ made}} \tag{1}$$

$$Accuracy = \frac{True\text{ Positives} + True\text{ Negatives}}{True\text{ Positives} + True\text{ Negatives} + False\text{ Positives} + False\text{ Negatives}} \tag{2}$$

 $\mathbf{T}\_{\text{mana}}$   $\mathbf{D}\_{\text{variátiunar}}$ 

$$Precision = \frac{True\ Positives}{True\ Positives + False\ Positives} \tag{3}$$

$$Recall = \frac{True\ Positives}{True\ Positives + False\ Negatives} \tag{4}$$

A common metric used in object detection, road detection, and terrain perception is the Jaccard index, also known as Intersection over Union (IoU) [Eq. (5)]. This metric measures the similarity between two finite sets, in this case, the ground truth and the prediction. Depending on the task, the ground truth can be a bounding box or a mask. IoU is defined as the area of overlap between the ground truth and the prediction divided by the area of the union of both. The metric range goes from 0 to 1 (0–100%), where 0 means no overlap at all and 1 is a perfect overlap of masks.

$$IoU = \frac{Area\ of\ union}{Area\ of\ overlap} \tag{5}$$

There are other forms of evaluation proposed by several researchers that are not standardized metrics. In some papers for obstacle avoidance, it is not only evaluated if the vehicle hits or not an obstacle but also the distance of the AV with objects. The evaluation can be very subjective for path planning since there is not an exact and unique path to follow. An important consideration is the mechanical characteristics of the AV. Not all vehicles can traverse through the same roads, that is, an all-terrain vehicle compared to a commercial automobile or a military vehicle. In Mei et al. [8], they proposed a metric called mechanical traversability; they defined it as the percentage of extracted road pixels that are mechanically traversable. Another form of evaluating an AV is proposed by Bojarski et al. [23], which measure the percentage of autonomy of the vehicle [Eq. (6)]:

$$
tanonamy = \left(1 - \frac{\left(number\ of\ interactions\right) \cdot 6\ seconds}{elapsed\ time}\right) \cdot 100\tag{6}
$$

They assumed that human intervention in an AV would require 6 seconds to take control of the vehicle, re-center it, and restart the autonomous mode. The elapsed time is the total time in seconds of the simulated test.
