**4.1 Hybrid functionalities for human collaboration**

Hybrid functionalities (cf. **Table 1**) combine real and virtual components when rendering and displaying telepresence views. There are two main options, which are described shortly in the following, namely:


There are multiple implementations and services using the first approach, e.g. https://remoteface.ai (+ YouTube https://www.youtube.com/ watch?v=prpPqwV5Weo). This approach requires either capturing a participant's facial features in order to animate the avatar, or in the minimum, capturing a participant's speech to estimate underlying facial muscle movements and corresponding animation parameters.

The second approach was already illustrated in Chapter 2.5, where a real-time captured human is rendered into a typically in-advance modeled virtual space (cf. e.g. **Figure 4**). Note that a virtual space may even enable a remote participant to make virtual visits to that space (i.e. seeing to it from widely varying viewpoints) – in particular, if the viewing is supported by glasses display.

Note that using glasses for viewing makes the interaction easily nonsymmetrical or even one-way, as the glasses prevent either capturing facial movements of their wearer, or seeing his/her face and eyes. Somewhat working solutions to avoid this have however been described in literature, based e.g. on real-time manipulation of facial areas [7].

#### **4.2 Remote XR support functionalities**

When developing 3D telepresence solutions, we are particularly interested in supporting remote XR functionalities. There are two main approaches for doing it. The first approach requires only delivering of images or video to the remote site(s), and coding and streaming is supported straightforwardly by existing video coding methods. However, better support for remote interaction is provided by coding data from a depth sensor, and after streaming the data, making 3D reconstruction at a

*Advances in Spatially Faithful (3D) Telepresence DOI: http://dx.doi.org/10.5772/intechopen.99271*

remote site. Algorithms used for local reconstructions are applicable also for remote reconstructions.

Thus, in addition to better 3D perception, video-plus-depth data supports also forming (or copying) 3D reconstructions at remote sites. These reconstructions, which we have denoted as Visual Twins, can support various 3D remote support functionalities, e.g. 3D monitoring, control, and analysis, as well as remote augmentation with visualizations and instructions. As described in Chapter 3.4, this is feasible by applying existing coding methods.

We have tested both video-based (sc. Ad-hoc AR) and video-plus-depth based (sc. Visual Twin) approaches, and they are described in more detail in the following:

1.Local 3D reconstruction, pointed remotely for positioning augmentations (Ad-hoc AR)

In this option (**Figure 8**), a 3D reconstruction is made in a local space using e.g. an RGB-D sensor carried by a moving person or a robot. The orientation of each RGB-image is derived in a normal way in the reconstruction process (e.g. using SLAM [34] and TSDF [35]), and stored locally with the image ID (e.g. a simple timestamp). The images are coded and streamed separately (e.g. following a manual selection) or as a sequence to a remote space. In the remote space, a person selects a point (pixel) in an image to show an augmentation, and messages back the image ID, target pixel coordinates, and data (or ID, if stored on a common server) of the AR object.

At the local site, the image's orientation w.r.t. 3D reconstruction is fetched from the local memory. The point to show the augmentation is obtained by raytracing through the defined pixel to the known orientation. Ray-tracing defines a 3D surface point on the 3D reconstruction (and the space), and enables local participant(s) to see the augmentation from various directions. **Figure 8** illustrates the process.

2.Remote 3D reconstruction using streamed depth sensor data (Visual Twin)

#### **Figure 8.**

*Ad-hoc AR, enabling remote augmentation of a locally reconstructed space.*

**Figure 9.** *Principle of visual twin, here used for remote monitoring & control.*

In this approach (**Figure 9**), data for 3D reconstruction (e.g. RGB-D data) is coded and streamed over a network, and the reconstruction is made using the data decoded at the remote site. The solution described in Chapter 3.4 for the coding and streaming of video-plus-depth data supports directly this approach when made in real time (implementation for this is soon completed by VTT).

Both of the approaches have been implemented and demonstrated by VTT, and are suitable for enhancing our 3D telepresence solution with XR functionalities. The first ray-tracing based option is simpler, but being based on images/video only, does not allow viewing the data in 3D at the remote site. Note that receiving video-plusdepth data enables also the ray-tracing approach, meaning that a combination of the approaches is also possible.
