Practical Insights on Automotive SLAM in Urban Environments

*Piotr Skrzypczyński*

#### **Abstract**

This chapter tackles the issues of simultaneous localization and mapping (SLAM) using laser scanners or vision as a viable alternative to the accurate modes of satellitebased localization, which are popular and easy to implement with modern technology but might fail in many urban scenarios. This chapter considers two state-of-the-art localization algorithms, LOAM and ORB-SLAM3 that use the optimization-based formulation of SLAM and utilize laser and vision sensing, respectively. The focus is on the practical aspects of localization and the accuracy of the obtained trajectories. It contributes to a series of experiments conducted using an electric car equipped with a carefully calibrated multisensory setup with a 3D laser scanner, camera, and a smartphone for collecting the exteroceptive measurements. Results of applying the two different SLAM algorithms to the data sequences collected with the vehicle-based multisensory setup clearly demonstrate that not only the expensive laser sensors but also monocular vision, including the commodity smartphone camera, can be used to obtain off-line reasonably accurate vehicle trajectories in an urban environment.

**Keywords:** SLAM, LiDAR, vision, GNSS, factor graph, evaluation

#### **1. Introduction**

In the last decade, we have witnessed a rapid development of methods, algorithms, and technologies that make vehicles more autonomous. This trend focuses on selfdriving cars but includes also public transportation vehicles and advanced driver assistance systems. In all cases, the knowledge of the vehicle's pose and the availability of an internal world model are enabling conditions for efficient task and motion planning, reasoning about the environment's semantics, and man–machine interaction.

Although the most used localization solution in automotive applications is the global navigation satellite system (GNSS), the availability of GNSS signal is limited in many practical scenarios, such as driving through a tunnel, maneuvering in an underground parking lot, or navigating among tall buildings in the downtown. Therefore, simultaneous localization and mapping [1] and visual odometry (VO) [2] are considered a complementary means of yielding vehicle pose estimates that enable safe navigation of autonomous or semi-autonomous vehicles. While there is a large body of research on SLAM and VO, there are still many open issues when it comes to more

complicated yet more practical cases, including 3D perception and lifelong map learning [3].

The two sensing modalities that dominate automotive SLAM are passive cameras and 3D laser scanners, known as LiDARs (light detection and ranging). Both classes of sensors have important advantages, but also some limitations. Active 3D laser sensors are well-suited to work under any lighting conditions, while passive cameras struggle with poorly illuminated scenes and sudden lighting changes. LiDAR sensors provide reasonably dense depth images within the range of tens, or even hundreds of meters. These data make it possible to build dense maps of the environment, while passive vision SLAM systems usually rely on sparse maps of point features. However, passive cameras are certainly the most affordable and easy-to-use sensors that can be deployed at minimal cost, while visual SLAM systems can achieve satisfying accuracy of trajectory estimation [4].

This chapter presents an experimental study on trajectory estimation accuracy by two popular open-source SLAM systems in the context of navigation for autonomous vehicles in urban environment. It contributes a unique set of experiments conducted using an autonomous electric vehicle equipped with a rich set of exteroceptive sensors. An important part of these experiments is the ground truth trajectories collected using an accurate RTK-GPS (real-time kinematics global positioning system). These data allow us to test two state-of-the-art SLAM solutions: LiDAR odometry and mapping (LOAM) [5] and ORB-SLAM3 [4] using two sensing modalities: LiDAR and passive vision, respectively.

Additionally, while performing these experiments we collected also images from a camera of a commodity smartphone attached to the windshield of the vehicle. This low-cost camera serves as a test bed for minimal-cost localization in urban environments, as it can be attached to any vehicle.

The aim of the presented research is to answer the question of how accurate are the trajectories obtained using different SLAM solutions and from different sensing modalities in scenarios imitating urban driving but under full control of an accurate GNSS solution providing precise ground truth trajectories.

The collected data are processed off-line, as often SLAM is used to obtain reference positions for data collected by a vehicle, which is the case of the CityBrands project, the presented research is part of, where trajectories are collected to identify locations of billboards and other advertisement installations. One of the purposes of this project is to verify the thesis that it is possible to obtain vehicle pose estimates of a local translational error not exceeding 1 m using only a smartphone's camera as a sensor.

### **2. Concept and implementation of experiments**

The experiments were carried out using a set of integrated sensors attached rigidly to a common frame and mounted on a vehicle. A Melex electric vehicle (work car) was used during the experiments. The rigid frame with the sensors was mounted on its roof, and data was collected by a computer mounted in the luggage compartment, except for data from a smartphone, stored directly in its memory. The basic component is an aluminum frame with a GNSS receiver and an Ouster OS1-128 Gen2 3D laser scanner with a range of 150 m and a 360° horizontal field of view. Other components of the system: a Basler acA1440-220um USB3 camera with a Basler C125-0418-5M F1.8 f4mm lens and an AHRS (attitude and heading reference system) Xsens MTi-30-2A8G4 sensor (**Figure 1A**).

*Practical Insights on Automotive SLAM in Urban Environments DOI: http://dx.doi.org/10.5772/intechopen.108262*

**Figure 1.**

*Melex electric vehicle with a sensory system configured for the experiments: 1 – Camera, 2 – LiDAR, 3 – IMU, 4 – GNSS, 5 – Smartphone (A), a smartphone visible behind the vehicle's windshield (B), and a close-up of the main sensors on the roof (C).*

The whole sensor set was calibrated both in terms of internal camera parameters (intrinsic) and in terms of external parameters (extrinsic) defining transformations between sensor coordinate systems. All these onboard exteroceptive sensors were integrated under Linux using robot operating system (ROS) [6] with ROS nodes deployed for each of the sensors, writing to defined ROS topics, and finally, saving the data using the flexible "rosbag" format. The ROS system provided also time synchronization to all these nodes by proper time stamping of the messages written to ROS topics and then to rosbag files.

Including in the multisensory system, a Samsung Galaxy A20 smartphone (**Figure 1B**) used as a second camera requires mutual calibration of the external parameters (translation and rotation) of the smartphone's camera relative to the Basler camera (**Figure 1C**). Since the smartphone runs Android, it is not connected to the system as a ROS node and is not subject to system time synchronization. The images from the smartphone's camera were collected using a dedicated Android application, as the default Samsung app that supports the camera on the A20 phone does not allow setting a specific focus value (locking autofocus). With automatic settings of the camera's focus and white balance, it was not possible to calibrate it accurately.

#### **2.1 Calibration**

Calibration is essential in a multi-sensor system that is dedicated to provide data for comparison of different methods and sensing modalities. In our case, calibration was performed offline, on data sequences that were collected before starting the tests, applying a number of different tools.

The main software package used to calibrate the sensors was Kalibr [7], a system for calibration of multi-camera and camera-IMU (inertial measurements unit) systems. Kalibr is integrated with ROS, which made many operations on data stored in rosbag files much easier. The entire calibration process was accomplished in several steps:

1.Calibration of the Basler camera parameters using Kalibr.

2.Calibration of the Basler camera with the Xsens AHRS (which is a type of IMU), implemented with Kalibr and the frame with sensors removed from the vehicle,

as the procedure required to move the camera/IMU system in order to produce appropriate data.


Surprisingly, the last two tasks turned out to be most difficult. An attempt to use Kalibr for these calibrations failed, probably because the smartphone's camera has a rolling-shutter frame, and the Kalibr procedure was unable to find some of the corners of the checkerboard pattern used for calibration (**Figure 2A**). The resulting camera parameters, although looking correct, were inappropriate for the ORB-SLAM3 system, causing problems with the initialization of the system **(Figure 2B** and **C**). Finally, the Galaxy A20 camera was calibrated using the classic approach [9] and software that allowed us to manually point the calibration corners on the checkerboard. These parameters allow ORB-SLAM3 to initialize correctly (**Figure 2D**). The resulting calibration parameters of the relative translations and rotations of the sensors were used to build a TF tree in the ROS environment (**Figure 2E**).

### **2.2 Ground truth acquisition**

The solution for obtaining the reference (ground truth) trajectory is a satellite navigation system. In order to achieve the centimeter-level accuracy required in the

#### **Figure 2.**

*Calibration of the multisensory system on the vehicle: Calibration checkerboard shown to the galaxy A20 smartphone's camera (A), initialization of the ORB-SLAM3 system on an image from galaxy camera (B), ORB-SLAM3 incorrectly initialized because of improper calibration of the camera (C), point features tracked correctly by ORB-SLAM3 when the camera is correctly initialized (D), at the ROS RViz view of the transformations tree in the fully calibrated multisensory system (E).*

application under consideration, it is necessary to transmit to the RTK-GPS system corrections coming from a ground station with a known geographic position [10]. The setup used in the experiments applied an Ublox C099-F9P sensor with a ZED-F9P module for RTK-GPS, which can provide a localization accuracy of 2 cm with a position frame rate of 10 Hz when operating in L1/L2 modes. With access to commercial RTK corrections transmitted via the Internet and LTE modem, it was not necessary to establish a separate reference station at a known position.

## **3. SLAM problem**

Over the past few years, traditional SLAM algorithms based on filtering (extended Kalman filter, particle filter) [11] have been replaced by methods based on optimizing graph representations of the SLAM problem [12], most often in the form of a sensor poses graph [13] or factor graph [14] binding multiple variables subject to optimization, describing both the state of the agent (vehicle/sensor) and a map of the environment in the form of features.

As shown in [12], the SLAM problem can be viewed in terms of factor graph optimization. The factor graph formulation can be applied to many forms of SLAM and localization, regardless of the used sensing modality (e.g., to WiFi fingerprints [15]), and regardless of the used features/landmarks, for example, visual point features [4, 16] or LiDAR-based segment features [17]. The most general form, a method known in the computer vision field as structure from motion (SfM), assumes that all historical sensor positions are related to observed features through measurements. These relationships can be thought of as a random Markov field with variables *x* describing the sensor poses and variables *f* describing the features (**Figure 3A**). However, if SLAM is used not for a limited scene, but a large area, the factor graph grows when new positions are added when the vehicle moves, while new features appear when the vehicle/sensor explores new parts of the environment.

Processing a very large factor graph should be avoided due to computational performance limitations. For this reason, pose-based graph SLAM systems marginalize all historical features (and thus do not have an explicit map) and keep only the pose graph with constraints between them (**Figure 3B**). These constraints are derived from the estimated movement of the vehicle and are formed locally, within a certain window along the trajectory, solving a problem analogous to visual odometry. Constraints (factor graph arcs) can also represent loop closures, that is, relationships between locations that are distant in the position chain but lie close enough spatially to be observed simultaneously. Constraints on loop closures are essential to the

#### **Figure 3.**

*Markov random field for the SfM problem (A). Factor graph for position-only SLAM (B) and factor graph for bundle adjustment SLAM (C). Larger blue circles are vehicle/sensor positions, green circles are features considered in optimization, empty (white) circles are those features and poses that have been marginalized, graph arcs are measurements, and rectangles denote factors from pose-to-pose constraints (red) or pose-to-feature constraints (black).*

successful implementation of the SLAM task, as they allow the system to reduce trajectory drift compared to a comparable visual odometry system. The constraints are represented on the graph as factors, shown in **Figure 3B** by small rectangles. In the pose graph formulation of SLAM [13] direct observations of the features are not used, as these measurements are utilized only to compute the constraints between agent poses in the graph. In pose graph optimization new pose values are estimated that minimize the following error:

$$\underset{\mathbf{T}\_{\mathbf{T}\_{i},\ldots,\mathbf{T}\_{s}}}{\operatorname{argmin}} \left( \sum\_{i} h(\mathbf{T}\_{i}, \ \mathbf{T}\_{i+1}) \boldsymbol{\Omega}\_{i,i+1} h(\mathbf{T}\_{i}, \ \mathbf{T}\_{i+1}) + \sum\_{j} h(\mathbf{T}\_{l(j),1}, \ \mathbf{T}\_{l(j),2}) \boldsymbol{\Omega}\_{l(j)} h(\mathbf{T}\_{l(j),1}, \ \mathbf{T}\_{l(j),2}) \right), \tag{1}$$

where **T**1, … ,**T***<sup>n</sup>* are the optimized sensor poses, *h* **T***<sup>i</sup>* ð Þ , **T***<sup>i</sup>*þ<sup>1</sup> is the pose error between the *i*-th and ð Þ *i* þ 1 -th poses from the measurement of the visual odometry process, and *<sup>h</sup>* **<sup>T</sup>***l j*ð Þ , 1 , **<sup>T</sup>***l j*ð Þ , 2 � � is the pose error of the *<sup>j</sup>*-th loop closure measurement defined as the error between *l j*ð Þ , 1 -th and *l j*ð Þ , 2 -th poses related by this loop closing event.

In contrast, a full SLAM system uses a nonlinear estimation technique [18] to optimize the constraint graph, which, however, has a much larger number of factors directly linking feature positions to sensor poses (**Figure 3C**). Also not all poses are used directly—only those considered as keyframes are stored in the map. The choice of keyframes strongly depends on the used sensing modality and the processing of features but is crucial to both the efficiency and robustness of SLAM [19]. An edge represents the resulting constraint from a sensor measurement between the *i*-th position and *j*-th feature. The uncertainty of each constraint is represented by its information matrix Ω, which can be determined by inverting the covariance matrix of a particular measurement [19]. This approach is similar to the bundle adjustment (BA) method used to efficiently solve the SfM problem [20] but can be applied in real time and used for large optimization problems (thus large maps) due to its computational efficiency [21]. The optimization step can be described mathematically as:

$$\underset{\mathbf{T}\_{\mathbf{T}},\mathcal{F}}{\operatorname\*{argmin}} \left( \sum\_{i=1}^{n} \sum\_{j \in \mathcal{F}\_{i}} f(\mathbf{f}\_{i}, \ \mathbf{T}\_{\mathbf{j}}) \boldsymbol{\Omega}\_{i,\mathbf{j}} f(\mathbf{f}\_{i}, \ \ \ \ \mathbf{T}\_{\mathbf{j}}) + \sum\_{k} h(\mathbf{T}\_{l(k,\ 1)}, \ \ \ \ \ \ \mathbf{T}\_{l(k,\ 2)}) \boldsymbol{\Omega}\_{l(k)} h\left(\mathbf{T}\_{l(k,\ 1)}, \ \ \ \ \ \ \ \mathbf{T}\_{l(k,\ 2)}\right) \right),\tag{2}$$

where T is a set of all optimized sensor poses, F is the set of all optimized features, F*<sup>i</sup>* defines the sets of indices of all features visible from *i*-th sensor pose, and *f* **f***i*, **Tj** � � is the error function of the measurement between the *i*-th feature and *j*-th pose. Note that the part of the optimization problem related to loop closing is described the same way as in (1), with *<sup>h</sup>* **<sup>T</sup>***l k*ð Þ , 1 , **<sup>T</sup>***l k*ð Þ , 2 � � being the pose error of the *<sup>k</sup>*-th loop closure measurement. The SLAM formulations (1) and (2) take the notion of topological loop closing, as defined in [19], which requires detecting the event of revisiting an already mapped area and constructing an additional pose-to-pose constraint by matching the reobserved features to the known part of the map. However, [19] points out that there is also geometric loop closing possible, where the vehicle/sensor does not need to detect the revisit event but matches the observed features to the known map using local criteria. These local criteria differ depending on the type of features. In LiDARbased SLAM with planar or linear features, geometric distances are applied to

discriminate incorrect matches [22], while in visual SLAM with point features [16], robust matching of descriptors can be employed [23]. Anyway, this geometric loop closing is only local, because if a very large loop has to be closed the accumulated pose drift may prevent our agent from executing the correct matching of the local features to the map. This is not the case for the loop closing formulation shown in (1) and (2), because in topological loop closing the respective locations (poses) are matched using global, appearance-based methods [24]. Note that the conceptual diagram in **Figure 3C** shows both ways of loop closing on a single graph.

#### **3.1 LOAM system**

The open-source LiDAR odometry and mapping is a state-of-the-art solution that ranked first in car localization based on the KITTI benchmark [25]. The processing of laser scans in LOAM is divided into several stages for feature extraction, odometry, mapping, and integration. The system operates on detected planar and edge features, which make it possible to reduce the extensive 3D laser scanner data stream. These features are then matched between the current and previous scans. In this step, the LiDAR sensor pose (and indirectly the vehicle pose) is estimated using constraints set by the associated features. The LOAM system assumes that scanning takes place continuously while the scanner is in motion. Therefore, after the odometry stage, the measured points are projected onto the coordinate system associated with the starting point of the scanning process, and a geometrically corrected point cloud is obtained. This corrected cloud is the input to the mapping procedure, which optimizes the match between the new cloud from the sensor and the global map (**Figure 4A**), giving a more accurate pose estimate than that obtained in the odometry step. Map optimization is performed less frequently (1 Hz) than odometry (10 Hz) because this step is more computationally intensive. The next localization step combines the pose from the map optimization with the latest available pose from laser odometry to provide the best available sensor pose estimate at the time. Note that LOAM does not use explicit (i.e., topological) loop closing. Although it is possible to implement this type of loop closing in a LiDAR-based SLAM with planar and line features [17], LOAM uses an approach that can be rather seen as instantaneous, geometric loop closing by matching the local features to the already known part of the map.

The LOAM system is set up to work with the Velodyne VLP-16 scanner; however, the scans from the Ouster OS1-128 sensors were used in the described study, modifying the program code accordingly. The modifications made included parameters

#### **Figure 4.**

*Visualization of the LOAM global point cloud map during one of the experiments (A), the point features extracted by ORB-SLAM3 in its internal map representation (B), and detected directly on an image (C).*

related to different horizontal and vertical observation angles, as well as necessary changes in assumptions about how points are acquired in each scan to correctly represent measurements at the beginning of the scan based on the current motion estimate.

#### **3.2 ORB-SLAM3 system**

The visual SLAM system ORB-SLAM3 is a variant of the ORB-SLAM algorithm presented in Ref. [16]. It was developed as a monocular SLAM using the concept of point feature map and bundle adjustment optimization. The third version, available as open source software [4], can use data from a single camera (monocular SLAM), stereo camera, or an RGB-D camera providing depth measurements. Constraint graph optimization is implemented in ORB-SLAM3 using the g<sup>2</sup> o library [18]. The system architecture is divided into a front-end and a back-end storing the map (**Figure 4B**). ORB-SLAM3 retains all the features of monocular SLAM, also taking into account features devoid of depth information and optimizing the map only on the basis of feature reprojection errors in the images. In addition, ORB-SLAM3 fully implements loop closure based on scene view recognition, using DBoW2 bag-of-visual-words algorithm [24]. This makes it possible to efficiently close loops of any size, not just local loops. Note that although the detection of loop closures in ORB-SLAM3 is appearance-based, the optimization in factor graph after closing a loop is accomplished on the level of features rather than pose-to-pose constraints. In ORB-SLAM3, the fully multi-scale ORB features [26] are used throughout the system: for tracking sensor motion (**Figure 4C**), matching the current perception to the map (creating constraints), and closing loops. ORB-SLAM3 uses optimization with the factor graph concept at several stages of processing, not only for global optimization, as shown in (2), but frequently performs local bundle adjustment over several keyframes and their associated features to obtain local map reconstruction. A robust Huber kernel [27] can be used for optimization with reprojection error to reduce the influence of spurious associations.

During tracking, a simple model of camera motion is used, assuming motion at a constant speed. This model can be considered sufficient in light of the scenarios of the experiments conducted, in which the Melex vehicle moved at speeds from 10 to 30 km/h. Although ORB-SLAM3 can work also as visual-inertial SLAM, which makes it much more robust, for the sake of comparison between results from the two cameras used in the presented setup the application of ORB-SLAM3 was restricted to monocular SLAM only.

A more challenging aspect of applying ORB-SLAM3 in the conducted experiments is the use of a smartphone as a camera. The developers of ORB-SLAM3 do not provide a separate implementation for mobile devices and do not support Android. However, as offline processing of the collected data is assumed, only the quality of the images and the availability of correct calibration data are important, because the images from Samsung A20 are further processed on an x86 computer under Linux.

#### **4. Methodology for assessing the accuracy of SLAM**

With regard to the VO and SLAM algorithms, it is possible to talk about the accuracy of estimation of two components of the created model: the sensor trajectory and the map itself. Both of these components are of practical importance in the tasks

*Practical Insights on Automotive SLAM in Urban Environments DOI: http://dx.doi.org/10.5772/intechopen.108262*

of navigation and recognition of the environment, and they are interrelated since the accuracy of the trajectory estimation determines the accuracy of the registration of the position of the features that form a map of the environment.

A standard sensor for providing precise reference trajectories in outdoor SLAM is GNSS, which was used in the form of integrating the RTK-GPS in the sensor test bed on the Melex vehicle. In order to assess the accuracy of a trajectory estimated to be either visual or LiDAR SLAM the available ground truth trajectory needs to be compared to the estimated one using a proper method. Such methods have been proposed in Ref. [28] and now are widely used in robotics localization research. These are methods for evaluating absolute trajectory error (ATE) and relative pose error (RPE) of sensor positions. The ATE value is the Euclidean distance between the corresponding points of the estimated trajectory and the ground truth trajectory. Thus, the ATE metric allows us to determine how far the estimated position is from the reference position at a certain moment in time on the estimated trajectory. The ATE metric is calculated for the entire trajectory as root mean squared (RMS) error, which allows convenient comparison of results for different SLAM algorithms for the same test data set. Assuming that two trajectories are given: a ground truth and an estimated one with the same number of positions *n*, and the positions of the sensor are given as homogeneous matrices, the ATE metric for the *i*-th frame on the trajectory can be determined as:

$$\mathbf{E}\_i^{\rm ATE} = \left(\mathbf{T}\_i^{\rm gt}\right)^{-1} \mathbf{T}\_i,\tag{3}$$

and then the ATE value for the entire trajectory is obtained by calculating the RMS error value for all frames. Note that in order to obtain a correct ATE, the trajectories must be converted to a common global coordinate system before the calculation is performed. The need for geometric matching occurs in the case of a trajectory estimated by visual odometry or a monocular visual SLAM algorithm because for algorithms of this kind the trajectory scale *s* is an unobservable variable [19]. For this reason, the estimated trajectory may exactly reproduce the shape of the reference trajectory, but have a different scale. It is possible to address the scale problem in monocular systems if data from additional sensors are available, in particular inertial measurements from an IMU or AHRS [29]. However, this possibility was not exploited in this project because the stream of images from the smartphone could not be synchronized accurately in time with the AHRS measurements, while the aim was also to compare the trajectories obtained using visual SLAM from the two different types of cameras mounted on the Melex. On the other hand, the correct scale can be easily retrieved by processing the trajectory offline and using the Umeyama algorithm [30] with only several GNSS global pose measurements as reference points.

In the case of LiDAR SLAM, the scale drift problem does not occur. Trajectory matching is performed offline by finding the transformation that minimizes the distance between two groups of rigidly connected points representing these trajectories (vehicle poses), the most commonly used method to accomplish this task is the Umeyama algorithm [30].

On the other hand, the RPE metric determines the difference in transformation (separately the rotational and translational parts) that would exist after following the estimated trajectory and the reference trajectory independently for a given number of frames/images and then calculating the roto-translation between the estimated trajectory and its ground truth trajectory counterpart. The RPE metric for the *i*-th image frame is given by the formula:

$$\mathbf{E}\_{i}^{\rm RPA} = \left( \left( \mathbf{T}\_{i}^{\rm gt} \right)^{-1} \mathbf{T}\_{i+1}^{\rm gt} \right)^{-1} \left( \mathbf{T}\_{i}^{-1} \mathbf{T}\_{i+1} \right). \tag{4}$$

Determining the translational or rotational part, one gets the local translational or rotational error measure. The values of the RPE metrics for the entire trajectory are calculated from the corresponding part (translational or rotational) as the RMS error value for all nodes.

The main measure of sensor position estimation error used in the process of evaluating the accuracy of the tested algorithms will be ATE for the *i*-th frame. The aggregated measure of ATE (RMS) is of lesser importance in this context, since it is not assumed that the resulting maps will be globally accurate, but that the user can get reasonably small errors, for example, translational error smaller than 1 meter, in the coordinate system associated with the local map. This ensures correct recognition of the road structure by the vehicle, which in turn makes it possible to keep on the proper lane and to localize some objects in the vicinity of the road.

In this context, the translational part of the RPE metric (RPE*t*) is also useful, as, from the point of view of evaluating the results of the experiments performed, the main advantage of the RPE metric is that it can be directly applied to the evaluation of the trajectory estimated from a monocular camera, in the presented case both the Basler camera and the smartphone's camera. When using the ATE metric, such a possibility does not exist, because, for monocular SLAM, the estimated trajectory and the reference trajectory differ in scale.

Roll and pitch angles of the sensor are of lesser importance, as it is assumed that the vehicle moves on generally flat ground in the urban environment. However, the roll and pitch angles for the reference trajectory can be read from the AHRS relative to the gravity vector. This is used to ensure that the assumption of running on a reasonably flat surface is met for the collected trajectories.

#### **5. Experimental results**

Three fully documented experiments were conducted using a pre-calibrated set of sensors on an electric car. The vehicle traveled at speeds ranging from 10 to 30 km/h along various routes in the area of the Warta campus of Poznan University of Technology and the streets adjacent to this area. The routes were documented by plotting ground truth trajectories (obtained from RTK-GPS) on maps from the OpenStreetMaps service (**Figure 5**).

#### **Figure 5.**

*Ground truth trajectories of three experiments obtained from RTK-GPS and plotted on OpenStreetMaps: Exp. 1 – A single loop inside the campus (A), exp. 2 – Multiple loops inside the campus (B), and exp. 3 – A large loop partially outside the campus (C).*

*Practical Insights on Automotive SLAM in Urban Environments DOI: http://dx.doi.org/10.5772/intechopen.108262*


**Table 1.**

*Relative translational errors of the trajectory estimated by LOAM (LiDAR SLAM) and ORB-SLAM3 (visual SLAM) for all experiments.*

**Figure 6.**

*Plots of the ground truth and estimated trajectories for LOAM/ouster with visualization of the ATE values (red lines) for exp. 1 (A), exp. 2 (B), and exp. 3 (C).*

In each run, an estimated vehicle trajectory was obtained from the OS1-128 laser scanner data, the relative position error of which did not exceed 1 m with respect to the RTK-GPS data (**Table 1**). At the same time, all the estimated trajectories are topologically consistent and not distorted (**Figure 6**), also characterized by small absolute ATE errors (the red error lines are faint).

At the same time, the presented experiments confirmed the veracity of the main these—it is possible to obtain the localization of a vehicle equipped with a single passive camera with a relative accuracy of no worse than 1 m (**Table 1**) under conditions similar to the real operating conditions of urban navigation. The difference between the conditions of the experiment and the real conditions came down to the use of a vehicle moving slower than a typical car and the absence of heavy traffic of other vehicles (with the presence of single vehicles and pedestrians).

There are clearly significant differences in the accuracy of the trajectory estimated by ORB-SLAM3 depending on the scenario of a given experiment **Figure 7**. The decisive influence on the quality of localization is the number of point features observed and tracked over a sufficient number of frames. The absence of such features, for example, because of driving at a great distance from objects with distinct

#### **Figure 7.**

*Plots of the ground truth and estimated trajectories for ORB-SLAM3/Basler with visualization of the ATE values (red lines) for exp. 1 (A), exp. 2 (B), and exp. 3 (C).*

surface texture and/or the presence of unstable (weakly repeatable) features generated on trees, causes a sharp increase in local (relative) errors and, consequently, distortion of the estimated trajectory. In the case of the second experiment, in which the vehicle made several passes over the same terrain (loop), it can also be seen that the data from a single camera (monocular visual SLAM) does not allow in some cases to accurately close the loop and completely reduce the drift of the trajectory. In the consequence, the ATE values are much bigger for the vision-based ORB-SLAM3 than for the LiDAR-based LOAM, as shown in **Table 2**.

The differences in the accuracy of the trajectory estimated by ORB-SLAM3 from the low-cost camera of the smartphone measured by the relative measure (**Table 1**) are small, and the obtained values of relative translation errors (both RMS and maximum values) do not differ significantly from analogous values obtained when estimating the trajectory based on images from a professional camera (**Figure 8**).

In the case of experiments 2 and 3 for the Basler camera, the maximum value of the error estimated from the smartphone images is even significantly smaller (0.49 m vs. 1.71 m for the professional camera). This allows us to conclude that the smartphone camera under proper calibration is a sufficient sensor for feature-based monocular SLAM, despite having a rolling shutter and simplified lens. It should be noted that an advantage of the Galaxy A20 smartphone camera (and many others) over the Basler camera with an f = 4 mm lens is a wider angular field of view in the horizontal plane. This can increase the number of matching point features during more rapid changes in vehicle orientation. At the same time, however, the depth of field of the smartphone camera is noticeably smaller than for the classic camera configuration compared here, which in turn leads to the detection of fewer point features overall, especially at greater distances from the vehicle. The number of point features observed and tracked by a sufficient number of frames has a decisive impact on the quality of pose tracking (local localization). Lack of a sufficient number of features results in a sharp increase in local (relative) errors and, consequently, distortion of the estimated trajectory.


**Table 2.**

*Absolute errors of the trajectory estimated by LOAM (LiDAR SLAM) and ORB-SLAM3 (visual SLAM) for all experiments.*

#### **Figure 8.**

*Plots of the ground truth and estimated trajectories for ORB-SLAM3/galaxy with visualization of the ATE values (red lines) for exp. 1 (A), exp. 2 (B), and exp. 3 (C).*

In the case of experiments 2 and 3, during which the vehicle made several loops returning to the same areas, it can also be seen that the data from the Galaxy A20 smartphone camera does not allow, in some cases, for accurate loop closure and trajectory drift reduction. This is particularly evident in **Figure 8C**—ineffective loop closing, probably due to too few features, led to a very large ATE error for the long trajectory (15,288 images were processed), despite the small relative errors.
