**4. Introducing the human component into monocular SLAM**

landmarks or points of interest in the map are assumed to be part of the environment, the

After the prediction step, a conventional EKF-SLAM would produce an actual measure‐ ment from the sensors and compare them with the predicted measurements obtained through the direct observation model. This step requires solving the data association problem, which consists in matching the predicted measurements with the actual measurements from the sensors. Given the computational cost of extracting all the possible points of interest at each frame and matching them with those predicted points, and the issues produced by the uncertainty of the prediction given that it is based on random movements, an active search strategy is used to deal with the problems [32]. Under this strategy, the features in the map are predicted into the pixel space using the direct observation model, and for each pixel, a search is performed looking for the most similar point (according to the stored patch), using zero-normalized cross-correlation (ZNCC). Ideally, each feature predicted will be matched to a new pixel that is the same feature in the latest frame. This process can fail due to visual artifacts, the geometry of the environment, the presence of dynamic objects, and several other causes. So these pairs of points (namely, the feature predicted and the match found in the latest frame) are checked through a data association validation methodology, in our case the

Once the predicted landmarks and its associated pairing are found in image space, in pixel coordinates, the innovation of the Kalman filter or residual can be computed following the

Although the iterative estimation of the map as an EKF is pretty straightforward, there is still a critical process which defines many characteristics of any give monocular SLAM approach: the feature initialization process. When using conventional point detector and descriptors, it is frequent that several dozens or even hundreds of points will appear in an image, and most of them will be ignored for the SLAM process—based on spatial distribution, position on image, etc.—but still, no depth information is available in an instant way. Thus, two main strategies exist to deal with this issue: Undelayed approaches try to 'guess' the value to initialize the depth, normally relying on heuristics, while delayed approaches track a feature over a time, until they have a good estimation of its depth and only then proceed to initialize

These two types of strategies define many characteristics of the SLAM procedures. As undelayed approaches try to use point features as landmarks just after have been seen, the points are quickly introduced into the filter, accepting many outliers that have to be validated later or rejected at the data association validation step [28, 31]. On the other side, delayed approaches track and estimate the points before using them, so the used landmarks are

The delayed inverse-depth (DI-D) monocular SLAM is a delayed feature initialization technique [12, 13]. The delay between a landmark being observed for the first time and being initialized allows estimating the parallax achieved through the estimated odometry. This in

turn enables obtaining depth estimations for the landmarks through triangulation.

generally more stable and reliable with delayed initialization [33].

hypothesis used is that they will remain static and so their position does not change.

HOHCT [13].

it.

usual EKF methodology.

92 Recent Advances in Robotic Systems

The DI-D procedure, although it was shown to be a strong monocular EKF-SLAM methodol‐ ogy, still presents several features that reduces its usability and scalability, mainly the need for an initialization process using synthetic or known a priori landmarks. These known landmarks would help initially to produce the odometry estimation and thus are critical to solve the scale problem of the map.

Shifting the monocular SLAM problem from an isolated sensor fusion point of view to a component into a bigger human**-**robot collaborative effort allows considering new options. Given the features of current exploratory robots, it is worth noting that an exploratory team composed of robots and humans will outperform any robotic device. If the desired tasks increase in complexity (emergency situations, those required management and decision under high uncertainty), the advantage of a human**-**robot collaborative team increases dramatically. Assuming that the human wears a headwear device with several sensors, the SLAM capabil‐ ities of the robot can be improved (**Figure 1**). Thus, the camera deployed in the helmet will be used to obtain 'instantaneous parallax', thus achieving complete measurement when the human is looking at the same direction as the robot, in a stereo-like situation, as it was initially proposed and described in authors previous work [14].

**Figure 1.** DI-D monocular EKF-SLAM components within human-robot collaborative.

To achieve this, in addition to the new camera on the human (Ch) which will perform the depth estimation with the robot camera (Cr), a combination of sensors and mechanisms able to estimate the pose between the cameras should be deployed. From a software point of view, the modules required to treat these data and estimate the pose, and those new which will deal with the new depth estimation process, must be implemented. From the EKF-SLAM method‐ ology, the initialization of features is a local process that introduces the features into the EKF using the inverse-depth parametrization, which remains the same, but will require also using a new inverse observation model that treats the whole system to estimate features as a single complex sensory process.

### **4.1. Multiple monocular vision sensor array: pseudo-stereo vision**

The weak points discussed before can be solved within a cooperative context exploiting data from another monocular camera. Assuming that the Ch with known pose is near to the robotic camera performing SLAM (Cr), joining observations from both cameras allow performing stereo-like estimation when their fields of view overlap. This way, a new non-constant stereo inverse-depth feature initialization approach will be used to address the issues.

Classical stereo approaches [34, 35] rely on epipolar geometry to create a calibrated camera rig with multiple constraints. These constraints typically include that both cameras' projection planes lie in the same plane in world coordinates; this allows optimizing the correspondence problem as the match on an image of another's image pixel will lie in the corresponding epipolar line, and rectification can turn them into straight lines parallel to the horizontal axis. Several works have dealt with rectification of stereo images for unrestricted pose cameras both calibrated [35] and uncalibrated [36].

**Figure 2.** Image pair sample captured at one experimental sequence.

**Figure 3.** Rectification of images left and right at **Figure 2**. Scale distortions are produced due to the multiple reprojec‐ tion operations.

Fusiello et al. [35] detailed the first method to rectify stereo pairs with any given pairs of calibrated cameras. The method is based on rotating the cameras until they have one of their axis aligned to the baseline and forcing them to have their projective planes contained within the same plane to achieve horizontal epipolar lines. Other works have proposed similar approaches to rectifying stereo pairs assuming calibrated, uncalibrated, or even multiple view [37, 38] stereo rigs. These approaches need to warp both images according to the rectification found (see **Figure 2** left and right and **Figure 3**) and, in some cases, producing great variations in terms of orientation and scale (**Figure 3**), thus rendering them less attractive in terms of our approach.

At any case, dealing with stereo features without rectified images is not a big problem in the proposed approach. The process of stereo features search and matching will be done sparsely, only to introduce new features: during the initialization, or when the filter needs new features. For both cases, only a part of the image will be explored, and when adding new features in a system already initialized, additional data from the monocular phase can be used to simplify the process.

### **4.2. Scaled feature initialization with collaborative sensing**

using the inverse-depth parametrization, which remains the same, but will require also using a new inverse observation model that treats the whole system to estimate features as a single

The weak points discussed before can be solved within a cooperative context exploiting data from another monocular camera. Assuming that the Ch with known pose is near to the robotic camera performing SLAM (Cr), joining observations from both cameras allow performing stereo-like estimation when their fields of view overlap. This way, a new non-constant stereo

Classical stereo approaches [34, 35] rely on epipolar geometry to create a calibrated camera rig with multiple constraints. These constraints typically include that both cameras' projection planes lie in the same plane in world coordinates; this allows optimizing the correspondence problem as the match on an image of another's image pixel will lie in the corresponding epipolar line, and rectification can turn them into straight lines parallel to the horizontal axis. Several works have dealt with rectification of stereo images for unrestricted pose cameras both

**Figure 3.** Rectification of images left and right at **Figure 2**. Scale distortions are produced due to the multiple reprojec‐

inverse-depth feature initialization approach will be used to address the issues.

**4.1. Multiple monocular vision sensor array: pseudo-stereo vision**

complex sensory process.

94 Recent Advances in Robotic Systems

calibrated [35] and uncalibrated [36].

tion operations.

**Figure 2.** Image pair sample captured at one experimental sequence.

The requirement of metric scale initialization of the DI-D method can be avoided under the assumption of a cooperative framework. Classical DI-D required the presence of a set of known, easily identifiable features to estimate them initially through the PnP problem and initiate the EKF with scale. Assuming that at the start of the exploration a cooperating, free moving camera is near, the data from this camera can produce the features needed through pseudo-stereo estimation. This process is shown in **Figure 4**, where, after the pose between the robot camera and the human camera is known, the maximum distance from a camera where a point with a given minimum parallax (*plmin*) could lie is found. This distance is employed to build a model of the field of view of each camera, as a pair of pyramids, with each apex in the optical center of a pinhole camera, and the base centered along the view axis. Then, it can be guaranteed that any point with parallax—between cameras—equal or greater than *plmin* will lie in the space intersected by the two fields of view modeled as pyramids, as seen in **Figure 5**. So the intersection between the different polygons composing the pyramids is computed as a set of segments (two point tuples), as described by **Algorithm 1**. Once all the segments are known, they are projected into the 2D projective space of each camera, and a search region is adjusted around them, determining the regions of interest where the stereo correspondence may be useful and significant.

In the interest regions found, SURF-based feature descriptors [39] are matched to produce new stereo features to initialize in the EKF state vector when needed. SURF is chosen over SIFT and FAST [39] due to the more convenient trade-off offered in terms of matching accuracy and efficiency, and could be replaced by any other feature descriptor. Each pair of matched points between cameras allows estimating the world coordinates of the landmark feature seen through triangulation, back tracing the points on the images from the robot camera and the human camera. Then, the landmarks found and fully measured (with real depth estimation) are introduced in the monocular EKF according to the unified inverse depth parametrization. To take advantage from the computational effort made during the non-overlapping frames, the landmarks that were being tracked to be initialized prior to the pseudo-stereo measurement are given priority to be introduced; these landmarks are robust because they were tracked for several frames previously.

**Figure 4.** Block diagram of the final implementation, with sensors on gray boxes and software processes on clear blocks. The updated landmark initialization is one of the cornerstone processes in any feature-based EKF-SLAM tech‐ nique.

**Figure 5.** Graphical representation of the intersection of both cameras' fields of view.

#### **ALGORITHM 1:**

To take advantage from the computational effort made during the non-overlapping frames, the landmarks that were being tracked to be initialized prior to the pseudo-stereo measurement are given priority to be introduced; these landmarks are robust because they were tracked for

**Figure 4.** Block diagram of the final implementation, with sensors on gray boxes and software processes on clear blocks. The updated landmark initialization is one of the cornerstone processes in any feature-based EKF-SLAM tech‐

**Figure 5.** Graphical representation of the intersection of both cameras' fields of view.

several frames previously.

96 Recent Advances in Robotic Systems

nique.

