**5. Experimentation and results**

The approach described in this work was fully implemented and tested with real data. The DI-D SLAM with pseudo-stereo feature initialization was programmed in MATLAB® to test and evaluate it. Several sequences were captured in semi-structured environments using a robotic platform and wearable headgear.

#### **5.1. Experimental system implementation**

The sequences were reduced to a resolution of 720 × 480 pixels and grayscale color, shortening the computational effort for the image processing step. Each sequence corre‐ sponds to a collaborative exploration of the environment at low speed, including a human and a robotic platform, each one equipped with the monocular sensors assumed earlier, Ch and Cr, respectively. The data collected include monocular sequences, odometry from the robot, estimation of the human pose with respect to the robot, and the orientation of the camera Ch. During the sequences, the camera Cr whose sequence would be used for the SLAM process was deployed looking forward, towards the advance direction. This kind of movements produces singularities in estimation, as the visual axis of the cameras is aligned with the movement, producing 'tunnel blindness', where the elements near the centre of the captured images produce negligible parallax, and thus, only variations in scale are percep‐ tible in short intervals.

The robot functions were performed by a robotic platform based on the Pioneer 3 AT (see **Figure 6**). The platform runs a ROS distribution over an Ubuntu 14.04 OS. The platform is equipped with a pair of laser range finders Leuze RS4-4 and a Logitech C920 webcam, able to work up to 30 frames per second (fps) at a resolution of 1080p. The sensors worn by the human are deployed on a helmet, including another Logitech webcam camera and an Xsens AHRS. All the data have been captured and synchronized through ROS in the robotic platform hardware. The ROS middleware provides the necessary tools to record and time-stamp the data from the sensors connected to the platform.

**Figure 6.** The robotic platform (based on the Pioneer 3 AT) used to capture the data sequences to test the described approach.

To estimate the pose of Ch, orientation data from the IMU are combined with the approximate pose of the human, estimated with the range finders [17]. The final position of the camera is computed geometrically as a translation from the estimated position of the Atlas and Axis vertebrae (which allow most of the freedom of movement of the head). These vertebrae are considered to be at a vertical axis over the person position estimated with the range finders, with height modeled individually for each person. In this work, it is assumed that the envi‐ ronment is a flat terrain, easing the estimation process.

The pose of the camera Ch with respect to the Cr is not assumed to be perfectly known. Instead, it is considered that a 'noisy' observation of the pose of Ch with respect to Cr is available by means of the methodology described above. The inherent error to the observation process is modeled, assuming that the observation is corrupted by Gaussian noise. The value of the parameters used to model the inaccuracies for computing the pose of Ch was obtained statistically by comparing actual and estimated values. It is also important to note that an alternate method could be used for computing the relative pose of Ch, for instance, using different sensors.

### **5.2. Experiments and results**

movements produces singularities in estimation, as the visual axis of the cameras is aligned with the movement, producing 'tunnel blindness', where the elements near the centre of the captured images produce negligible parallax, and thus, only variations in scale are percep‐

The robot functions were performed by a robotic platform based on the Pioneer 3 AT (see **Figure 6**). The platform runs a ROS distribution over an Ubuntu 14.04 OS. The platform is equipped with a pair of laser range finders Leuze RS4-4 and a Logitech C920 webcam, able to work up to 30 frames per second (fps) at a resolution of 1080p. The sensors worn by the human are deployed on a helmet, including another Logitech webcam camera and an Xsens AHRS. All the data have been captured and synchronized through ROS in the robotic platform hardware. The ROS middleware provides the necessary tools to record and time-stamp the

**Figure 6.** The robotic platform (based on the Pioneer 3 AT) used to capture the data sequences to test the described

To estimate the pose of Ch, orientation data from the IMU are combined with the approximate pose of the human, estimated with the range finders [17]. The final position of the camera is computed geometrically as a translation from the estimated position of the Atlas and Axis vertebrae (which allow most of the freedom of movement of the head). These vertebrae are considered to be at a vertical axis over the person position estimated with the range finders, with height modeled individually for each person. In this work, it is assumed that the envi‐

The pose of the camera Ch with respect to the Cr is not assumed to be perfectly known. Instead, it is considered that a 'noisy' observation of the pose of Ch with respect to Cr is available by means of the methodology described above. The inherent error to the observation process is

tible in short intervals.

98 Recent Advances in Robotic Systems

approach.

data from the sensors connected to the platform.

ronment is a flat terrain, easing the estimation process.

The introduction of an auxiliary monocular sensor which can provide non-constant stereo information was proven useful. One of the weaknesses discussed earlier of the DI-D was the need to set an initial metric scale through synthetic feature, which has been removed. This grants more autonomy to the system, exploiting the implicit human–robot interaction without enforcing utilization of artificial landmarks. Besides, as the metric scale initialization can introduce more features into the initial state because it is not limited to the artificial landmark, the scale propagates in a smoother way with reduced drift on the local scale.

### *5.2.1. Visual odometry accuracy and scaling*

**Figure 7** shows results for two sample trajectories, with and without the utilization of the proposed non-constant stereo DI-D feature initialization approach, in blue and orange lines, respectively. The trajectory on **Figure 7** (left) was captured in an inner courtyard, with several seats and trees. This trajectory ran for 19 m, with two 90° turns, on a semi-structured environ‐ ment with plenty of objects within view that could be mapped. On the other side, the trajectory shown in **Figure 7** (right) was capture as a straight 10 m movement in a blind alley with painted walls with homogenously textured surfaces, reducing the chances to obtain robust features. These two sequences contained plenty the most disadvantageous characteristics for monocular SLAM: singular movements where parallax cannot be observed for the central region of the camera, quick changes in orientation and turning velocities, surfaces/environments showing low count of robust visual features, natural lightning and shadows, etc.

**Figure 7.** Trajectories estimated with classical DI-D monocular SLAM (orange plots) and with the new non-constant stereo DI-D approach for feature initialization (blue plots). Green line denotes robot ground truth; gray line denotes Ch ground truth.

The introduction of the pseudo-stereo initialization of features enables initialization of features with actual depth estimation instantly, without relying on heuristic or having a delay where data are being processed but not used in the estimation. Each of these situations has strong deterrents; for example, the heuristics used for depth initialization can vary between sequen‐ ces, or even different SLAM 'runs' of the same video sequence, accounting for the uncertainty in the prediction model and the feature selection process. When a feature is initialized with a delayed method after it has been seen, computational power is spent on estimating a landmark that likely will not be used and never introduced into the EKF map.

At the end, the proposed approach made the system more resilient, especially to quick view changes, such as turning, and long singular movements—front advance. These movements can be seen in **Figure 7**, left and right, respectively. During close turns, delayed monocular SLAM approaches have very little time to initialize features because the environment changes quickly and the features are observed for short periods. This produces a decrease in the number of initialized features that decreases the odometry estimation accuracy. At the end of the run, the uncertainty becomes so big that errors cannot be corrected, the EKF loses convergences, and the estimation process results become useless. **Figure 7** (left) shows how the two turns can greatly degrade the orientation estimation for a classic delayed SLAM method, while the proposed approach can track the turns much more closely, with less than half the error. On the other side, **Figure 7** (right) illustrates the issues of singular movements: The odometry scale is very hard to estimate for pure monocular methods, because features present reduced parallax. Not only the length of the trajectory is affected by this phenomenon, but the accuracy of the orientation estimation also becomes compromised due to the inability of the EKF to reduce uncertainty quickly enough.

### *5.2.2. Computational costs analysis*

The apparent increase in the computational effort that would suppose the utilization of the presented approach could be hard to justify within the field of filtering-based SLAM, which generally try to keep computational costs as low as possible.

In the considered sequence set, there were a total of 9527 frames for Cr. Although Ch had a paired frame for each one on the Cr sequences, the overlap only was found in 3380 of the pairs. This means increasing the visual processing and feature capturing costs on a 35.47% of the frames. Increasing the computational cost of the most demanding step in a third of iterations may look daunting, but there are few considerations. The technique rarely implies processing an additional full frame: The region where the overlap is interesting is predicted and modeled as a ROI into the Ch image, limiting the area to explore. Besides, the cost increase is bounded by the number of frames where it is applied, so, if there are enough features visible in the map, there is no need to execute the pseudo-stereo depth estimation.

Moreover, it is worth noting that newly proposed approach made less effort per feature to initialize it, as it can 'instantly' estimate landmarks on the process of being initially measured through parallax accumulation. This trades off with the fact that the pseudo-stereo initializa‐ tion can initialize with more frequency weak features which the delayed initialization would not been able to handle, and must be rejected during the data association validation step.

**Table 1** shows the features initialized by each approach and the tracking effort required until the initialization of the features is done. Note how the non-constant stereo DI-D feature initialization approach uses about 8% more features, but the effort used to initialize them is much lower, as seen by the number of frames which the feature is tracked prior to being introduced into the map. This is because many features that are being tracked are instantly initialized through stereo once they lay in the overlapped field of view. This is advantageous because it allows to introduce features known to be strong (enough to be tracked) directly without more tracking effort, compensating the effort used for the Ch processing and stereobased initialization.


**Table 1.** Statistics of initialized features and frames from first observation until initialization for DI-D SLAM and pseudo-stereo initialization.

Furthermore, in real-time applications employing this technique, the Cr sensor could be upgraded to an 'intelligent' sensor, with processing capabilities, using off-the-shelf technolo‐ gies—low-cost microcomputers, FPGA, etc. This approach would integrate image processing in the Ch sensor, allowing parallel processing of features, and sending only extracted features, reducing required bandwidth and transmission time. This processing step could be done while the robotic camera Cr makes the general EKF-SLAM process, and thus, it would be possible to have the SURF landmarks' information after the EKF update, in time for the possible inclusion of new features.
