**2.6 About using glasses and screen displays**

AR/VR glasses or head-mounted displays (HMD) – together referred to as neareye displays (NED) – can in principle support full (sc. 6DoF) freedom in choosing ones viewpoint to the displayed 3D content. Naturally, this requires also enough physical space, and precise tracking of user's motion and orientation (sc. pose).

HMDs (VR glasses) are well accepted for playing immersive and interactive computer games, and are commonly used when using VW-based telepresence and online platforms. However, they are still challenged by resolution, weight, and lack of support for natural focus (accommodation), causing discomfort and nausea when viewing stereoscopic content [6, 21, 22].

Optical see-through (OST) AR glasses (cf. MS HoloLens) have succeeded best in XR applications, but in addition to sharing the above challenges of HMDs, they are lacking natural occlusions, e.g. the ability to block a real background by augmentations, which makes AR objects to appear translucent. When augmenting natural views with 3D objects or other visual elements, closer objects should in general occlude those further away. Real and virtual objects may be in any order, meaning that foreground objects need to occlude the background whether they are

#### *Advances in Spatially Faithful (3D) Telepresence DOI: http://dx.doi.org/10.5772/intechopen.99271*

real-world captures or virtual renderings [23]. These sc. mutual occlusions are especially difficult to support by optical-see-through (OST) glasses - like MS HoloLens.

In telepresence use, a serious drawback of both HMDs and OST AR glasses is that they block their user's face, which makes it difficult to see a participant's facial features and eye-directions, either when animating an avatar in virtual approaches, or when viewing photorealistic captures. Correspondingly, although advanced considerably in recent years, NEDs are not yet good enough to be generally accepted and applied to 3D telepresence.

On the other hand, screen displays have developed by size, accuracy, and economy. While usually seen from some distance, they are e.g. less prone for perceiving VAC in stereoscopic viewing. Naturally, they restrict choosing ones viewpoints, and in XR, make it about impossible to mix remote and local content naturally in depth dimension (except when mirroring a local environment with augmentations). In short, screen displays are less immersive, but more easy to view than NEDs.

While the size and accuracy of screen displays is growing, and the distance of viewing them is reducing, supporting accommodation may become necessary also with screen displays. However, existing external accommodative displays support viewing either from very fixed viewpoints, or suffer from other severe limitations in rendering (cf. lack of colors, brightness and occlusions when using e.g. holographic volume displays).

### **2.7 About XR and its role in 3D telepresence**

Spatial faithfulness is an inherent requirement in XR visualization, which aims at replacing, in a seamless way, parts of a physical view with virtual elements. XR visualization can as well be used for rendering 3D models (cf. avatars) or visual reconstructions of human participants into a participant's view. Correspondingly, for more than two decades, developing enablers for XR has correspondingly advanced also 3D telepresence solutions. VTT has made a lot of research in these topics, and examples of results can be found e.g. at web (http://virtual. vtt.fi/virtual/proj2/multimedia/) and in YouTube (www.youtube.com/user/ VTTAugmentedReality).

VTT made also an early implementation of MR telepresence (MR Conferencing) in 2008–2009 [24]. The implementation supported participation to telepresence sessions using normal videoconferencing terminals and screens, and a VR space (SecondLife) with avatars. Registration of avatars to a real space used visual markers, which at that time was the main approach in AR visualization. Note that today's MR telepresence implementations do not necessarily differ too much from the VTT example, except using feature based tracking instead of markers (**Figure 5**).

Traditionally, making 3D captures at the target location ("on the spot") has been the only way to *support precisely* either positioning or viewing AR content in the location. By precisely, we mean that AR content can be bound to both shapes and textures (i.e. to precise visual context). This has meant making 3D capture and reconstruction locally, as an offline and in-advance process. Correspondingly, remote production and positioning of XR objects - without knowledge of local visual context - has been based on: 1) assumed textures (e.g. on known/assumed markers/pictures/objects/color patterns), 2) locally scanned shapes when displaying (e.g. physical delimiters like floor and wall panes), or 3) actions by the viewer (e.g. positioning of avatars and talking heads for communication).

However, tele interaction and XR applications can be supported better if 3D capture for making augmentations is enabled also from remote and more in real-time. This is possible by using efficient 3D capture, coding and streaming methods. The importance of efficient and high quality 3D streaming and interaction is growing fast

**Figure 5.** *MR conferencing between virtual and real spaces: a) second life view (screenshot), b) real life (augmented video).*

due to the transformation towards distributed industrial processes, and having at the same time needs for reducing physical travels. Writers of this paper have got successful results in applying standard coding methods into real-time streaming of videoplus-depth data from RGB-D sensors (including means of supporting high enough pixel dynamics for the depth sensor data). These are presented later in Chapter 3.4.

As a natural trend in 3D telepresence implementations, there is a need for increasing accuracy (pixel dynamics) and resolution in 3D reconstruction. Following the progress in industrial applications, Lidar sensors are likely to become also into use in telepresence solutions. In addition to now common point cloud coding and transmission (e.g. using octrees [18]), this will likely require new coding methods which – in addition or instead of point clouds – support efficiently realtime transmission and visualization of high-quality surfaces and color textures (cf. approaches used with RGB-D sensors).

### **2.8 Focus of the research and the rest of our paper**

Most of existing telepresence solutions are either photorealistic or virtual, i.e. fall into the first and fourth quadrants in **Table 1**. Hybrid approaches mix real and virtual components (for either participants or spaces) meaning that they are XR approaches (cf. discussion in Chapter 2.5). Note that in telepresence, augmenting remote participants, spaces or objects occurs over network, meaning that it is about remote XR (cf. Chapter 2.7), which requires delivering more position and 3D data than in traditional AR, both for augmentation and viewing.

Further, although augmented content can be viewed also on fixed or mobile screens, the best and most immersive way of viewing 3D augmentations is by using AR glasses. Correspondingly, accurate tracking is needed both for positioning augmentations (note that in telepresence this needs to happen over network/distance), and seeing them from a correct viewpoint in the target space. Supporting the same for multiple remote sites and participants causes further complexity, especially if the goal is to support a shared understanding of participant positions (cf. face-to-face meetings).

**Table 2** summarizes our exemplary focus on videoconferencing type of photorealistic telepresence approaches (cf. quadrant real human - real space in **Table 1**), with hybrid enhancements based on 3D streaming and XR visualization.


#### **Table 2.**

*Defining the focus to screen based 3D telepresence solutions.*

A simplified hypothesis for our study is that much of the complexity of 3D telepresence solutions can be avoided by aiming at a screen-based solution without a (fully) realistic meeting geometry. An important cue is motion parallax, supported by tracking small user motions and serving with new viewpoints accordingly. Because of this choice, tracking and exchanging user positions for maintaining a unified meeting geometry is omitted, simplifying the solution considerably. Correspondingly, although beneficial in some geometry supporting solutions, the earlier described video-on-demand approach is not needed either.

Despite the demarcation to screen displays, the solution can be enhanced with remote XR functionalities, i.e. by bringing benefits of hybrid approaches to a photorealistic screen based solution. With screen displays, it is also easier to support natural occlusions when compiling remote views (e.g. no need to use XR approaches for displaying remote views around a local participant). Using external (flat) screens is naturally also a solution to avoid (the need of) covering faces by glasses display, i.e. better supporting photorealistic capture of participants.

Correspondingly, the next chapter focuses on describing the above photorealistic approach for 3D telepresence, giving more details on its main challenges and the status of related technical enablers. Most important of those enablers is support for coding and streaming RGB-D data, for which an exemplary implementation is described with some numerical results.
