**3. 3D telepresence solution using screen displays and supporting XR**

#### **3.1 Introduction**

In the following chapters, main choices, enablers and components are described for a photorealistic telepresence with screen displays. Features from hybrid approaches are included, e.g. possibility to replace visual captures of a remote participant by an animated avatar. Further, in addition to screen based communication, XR interactions can be supported separately by streaming 3D scanning results between meeting sites, and viewing either locally or remotely produced augmentations e.g. by AR glasses.

#### **3.2 Serving viewpoints by screen displays**

Generally, serving moving participants requires views from arbitrary viewpoints. This in turn requires tracking of participant positions and virtual meeting geometry

**Figure 6.** *Supporting motion parallax to a mosaic of 2D or 3D renderings on screen.*

in real time. Further, although it may be enough to model a meeting environment in advance, photorealistic 3D capture of participants needs to be made in real time. This in turn requires a setup of multiple 3D sensors and an efficient reconstruction algorithm.

It is however possible to simplify the implementation considerably by relieving from natural geometry requirement. In the minimum, small motion parallax and even natural focus (e.g. using MFPs [16, 17, 25]) can namely be supported without forming and maintaining a virtual meeting geometry between participants. Although with more limitations than with NEDs, also flat screens can support user mobility and consistency of meeting geometries.

By relieving geometry constraints, more freedom for display arrangement and mobility can be achieved. For example, motion parallax can be supported also when compiling remote participants into a video mosaic on a display (a typical situation during video conferencing, as such) and thus to better support 3D cues (**Figure 6**). Note that it may not be that harmful even if all remote participants have their faces oriented towards a local viewer (cf. a "positive Mona Lisa effect", i.e. getting an eye contact even when not being looked at).

In this simplified approach, accurate tracking and delivery of user positions is not needed, and neither is the definition for a unified meeting geometry. Instead, the tracking reduces to a local and rather approximate process of detecting the direction of participant motion, indicating more a viewer's qualitative desire to perceive motion parallax. Further, by satisfying to frontal 3D captures, only one capture sensor is required.

As a result, 3D cues supported by the suggested system are limited to synthesized motion parallax, true eye contact (cf. avoiding the effect of a camera-display offset), and perception of depth. All these are important improvements over existing videoconferencing solutions. As described in [16, 17, 25], supporting natural focus/accommodation is also possible, provided that practical solutions for MFP displays or alike come to market.

#### **3.3 Simplified user tracking and geometry formation**

Generally, user tracking and positioning is an important functionality of 3D telepresence solutions. User positioning is required for 1) forming a consistent virtual geometry between participants, and 2) serving a participant with viewpoints

#### *Advances in Spatially Faithful (3D) Telepresence DOI: http://dx.doi.org/10.5772/intechopen.99271*

complying his/her movements in the defined meeting geometry. A tracking device can be carried by a participant or can measure the person from outside. Visual tracking is commonly assisted by other electronic sensors (IMUs or a like) and by fusing the results for better accuracy.

For a telepresence session, most favorably, a common server makes the formation of a virtual meeting geometry. For this purpose, the server needs participant positions from each telepresence terminal (cf. varying participant positions in **Figure 3c**). Bitrates for delivering 3D positions may be reduced by a suitable coding method, e.g. differential, run-length (RL), variable length coding (VLC), or their combination.

In general, user tracking, from either outside or by wearable sensors, has evolved considerably in recent years. A good solution to provide 6DoF head motion tracking is visual-inertial odometry (VIO), which estimates the relative position and orientation of a moving device in an unknown environment using a camera and motion sensors (https://en.wikipedia.org/wiki/Visual\_odometry). A big advantage of VIO is that it can be processed on glasses or HMD without external setups, i.e. sensors, markers, cameras, or lasers set-up throughout the room. A comparison of several VIO approaches is presented in [26].

Note that in our simplified telepresence approach using screen displays, the perception of a consistent geometry between participants is relieved to ease up the implementation. For the screen-based communication, there is no need to derive user positions accurately, nor to deliver them to remote sites. Correspondingly, there is no need to track a camera or cameras for 3D reconstruction either. For supporting motion parallax, rather qualitative detection of user motions is enough, i.e. to detect simply, whether a viewer is moving slightly (e.g. leaning left or right) to perceive a slightly altered view. These small viewpoint changes can be supported locally, e.g. by synthesizing the viewpoints, so that there is no need to deliver captured motions to other participants.

In case the solution is enhanced by the support for seeing augmented objects in a participant's space, the tracking needs to be more wide base and accurate. However, if the support is only for seeing XR objects locally, there is no need to deliver viewer motions to other sites.

#### **3.4 Coding and streaming 3D data**

Efficient 3D capture, coding and streaming are important for future 3D telepresence solutions [27–29]. As we introduced in Chapter 2.4, coding and delivery of 3D volume data is not reasonable nor necessary for supporting spatial faithfulness, as a viewer is able to see a 3D environment or content only from one (binocular) viewpoint at a time. This suggests a solution using viewpoint-on-demand approach, which, instead of delivering complete 3D views, serves remote viewers with videoplus-depth (V + D) perspectives from desired viewpoints. A prerequisite of this approach is that user positions are tracked and set into a unified geometry defining (virtual) lines-of-sight between participants.

Luckily, V + D format suggested above serves also well in enhancing screen based telepresence solutions, both with additional 3D cues (motion parallax, and depth, both for stereoscopy or supporting natural focus/accommodation) as well as with XR visualizations and functionalities. Although communication is based on viewing remote participants on screens, a system can also support producing and delivering XR objects, viewed by a local participant's with glasses or by looking through a mobile device.

Using video-plus-depth captures simplifies and eases-up the implementation and reduces bitrates and complexity in data coding and streaming. We applied

existing video coding methods supported by FFMPEG for encoding of RGB-D data (e.g. HEVC/X265). A basic challenge is that Kinect type of sensor produces 16 bit/ sample depth values, which are not supported by video coding methods. For that reason, we rounded 16 bit depth values to closest 12 bit integers before coding.

**Figure 7** illustrates the pipeline in our experiments. The quality of our videoplus-depth coding and streaming was experimented by comparing direct reconstruction result of a moving RGB-D sensor to the reconstruction made after coding and streaming the data by a HEVC/X265 (FFMPEG) codec. The reconstruction algorithm was the one provided by Open3D. The test sequence was sequence 016 (here denoted as 'Bedroom') from the SceneNN dataset at http://www.scenenn.net, obtained by using Asus Xtion PRO, a Kinect 1 type of depth sensor. The sequence consists of 1364 color and depth frames (captured in about 45 seconds), both in PNG format with 480x640pel per frame.

In **Figure 7**, video and depth sequences were transferred into two video type sequences using the sc. depth blending, modulating the original input video by a linearly weighted depth map and its inverse [25]. This results with two video-like sequences with the partition-of-unity property, meaning that the output sequence is obtained by summing up the modulated (and coded and streamed) video components in the receiver. The coded depth map sequence is obtained from the ratio of luminance(s) for the corresponding pixels. Note that the same approach is typical when forming MFPs for accommodation supportive displays. Here, we omit further details of the coding process and suggest an interested reader to study e.g. the above references.

In our experiment with the above Bedroom sequence, the average bitrate for the original video-plus-depth data from the sensor was 103Mbit/s (RGBD frames in png format, 30fps), and the average bitrate for the coded and streamed data was 567kbit/s, corresponding to about 180:1 compression ratio. Standard (RMS) deviation of the output voxels was 4.2 mm compared to the input ('original') surface, as derived from the reconstructions by the CloudCompare SW (see https:// en.wikipedia.org/wiki/CloudCompare). PSNR was calculated from the differences between corresponding YCbCr pixels of the input and output sequences. Average PSNR was 50.3 dB for the luminance (Y), and 48.6 dB and 55.2 dB for Cb and Cr components. YCbCr format was chosen for being traditionally used in compression research and for better specifying obtained PSNR values. Calculations were made using Matlab (r2018b) functions for format conversions and PSNR. These above numerical results are very good, and when viewing by eye, both the video and the reconstructions appear identical.

#### **Figure 7.**

*3D pipeline showing data (color plus depth data from a moving RGB-D) being captured, coded, streamed, and reconstructed in our experiment. Quality comparison (PSNR) is made between original and coded videos, and reconstruction accuracy (RMS distance to the nearest voxel) is measured from 3D reconstructions using original and coded depth images.*

#### *Advances in Spatially Faithful (3D) Telepresence DOI: http://dx.doi.org/10.5772/intechopen.99271*

Note that the above pipeline is still an offline implementation, using stored files for both input and output data. Correspondingly, there were no real-time limitations in the above simulations. The writers expect to complete also a real-time version of the pipeline by the autumn 2021.

3D streaming solution described above can obviously support also higher resolutions. E.g. with a fourfold resolution to the experiments, the bitrate would remain in the order of 2Mbit/s. As long as there are no better coding methods available, it will be more challenging to support higher pixel dynamics, i.e. more bits/pel (e.g. by depth blending as in our solution).

Generally, the bitrate for video-plus-depth is much less than when streaming multiple-view or volume videos. As a comparison, the approaches used in [7] resulted with the average of 1–2 Gbps transfer rate for a 30 fps stream. According to Qualcomm1 , 6DoF video demands bit-rates in the range of 200 Mbps to 1000 Mbps depending on the end-to-end latency. These figures are just indicative, as the bitrates depend heavily on the used coding scheme and many factors affecting quality (notably resolution used in 3D data capture). Interested readers may find more information from the references given in the beginning of this chapter.

Our simulations on 3D streaming indicate that reconstructing 3D models from coded and streamed video-plus-depth data succeeds with an adequate quality for 3D viewing and reconstruction. Note that the same simple data pipeline enables also various remote support functionalities, including remote 3D analysis based on coded information.

The principle of using compressed data for visualization and analysis is denoted as compress-then-analyze (CTA) approach [30, 31]. According to [31], the opposite analyze-then-compress (ATC) approach may outperform at low bitrates. However, ATC limits a system flexibility, as for example normal viewing of the stream is not possible using only received visual features. Further, ATC fixes the feature selection method at a captured space, limiting applicable approaches for remote analysis. In fact, CTA provides superior flexibility in multipoint settings by enabling for example any analysis approach by multiple remote receivers. According to [30], CTA may also outperform ATC at high bitrates. It is worth noticing, that the referred studies for CTA used jpeg compression for the visual features, which wastes bitrate and lowers quality compared to our efficient spatiotemporal CTA approach.
