**2.1 Spatial faithfulness supports naturalness in perception**

3D telepresence solutions aim to support natural perception in 3D – sc. spatial faithfulness – better than video conferencing systems [1]. Basic problem in videoconferencing systems is the lack of support for eye contact [2]. In flat screen based solutions, it stems for example from the displacement or offset between a display (showing a counterparty's face) and camera (counterparty's eyes) of a videoconferencing terminal. Note, that although eye contact is one of the early goals for 3D telepresence solutions, it is still not supported in most of the existing telepresence systems.

Hydra system (**Figure 1**) is an early approach for supporting spatial faithfulness in telepresence [3, 4]. With a mesh of connections and a separate (proxy) terminal for each remote counterpart, it aims to support virtual lines-of-sight between participants. With small displays and small camera-display offset(s), participants are

#### **Figure 1.**

*a) VTT idea for a Hydra system using tablets for a communication space and a computer display for a collaboration space, b) corresponding connections between cameras and displays, c) practical implementation showing all contents on one display (cameras on top corners are indicated by red circles) [3].*

#### *Advances in Spatially Faithful (3D) Telepresence DOI: http://dx.doi.org/10.5772/intechopen.99271*

able to get an approximate eye contact with each remote participant, and in certain conditions, even have a shared understanding of each other's relative positions.

Note that perceiving eye contact does not mean full gaze awareness, i.e. understanding of also other, intermediate eye directions. In a traditional Hydra system, terminals for each remote party can be placed independently by each local participant, which easily results with inconsistent meeting geometries across meeting sites, and thus also inconsistent positions and eye directions of the parties.

Note that in principle, any position of displays can support virtual lines-of-sight between participants. The situation can be compared to private house residents seeing their neighbors through windows, which, however, is the more unlikely the more of neighbors like to communicate with each other.

The easiest solution unifying a virtual meeting geometry between participants is to position proxy terminals in the same relative order into vertices of an equilateral polygon (e.g. a triangle, square, pentagon, hexagon, etc.). This naturally restricts the seating of participants more than in a face-to-face meeting.

A more recent screen based solution is Viewport by Zhang et.al [5]. In the Viewport system, high-quality 3D models are formed for each user in real time, and extracted and embedded into a common virtual geometry. Using 3D models enables correcting camera-display offset, and supporting depth perception by stereoscopy. The system supports eye contact between three sites, with one user at each site. In particular, limiting the number of sites into only few is a factor limiting the usability of corresponding solutions.

Natural perception of depth and distances belongs to the factors of spatial faithfulness. Note that this is not possible using 2D displays lacking depth, and strictly speaking not even with stereoscopic 3D (S3D) displays, whether multiplexed, polarized, or autostereoscopic, due to their incapability to support natural focus/accommodation (suffering from the sc. vergence-accommodation conflict, VAC [6]).

#### *2.1.1 Advances by XR technologies and computer games*

Note that spatial faithfulness is an inherent requirement in XR visualization, which aims at replacing, in a seamless way, parts of a physical view with virtual elements (or vice versa). XR visualization can as well be used for rendering models (cf. avatars) or visual reconstructions of human participants into a participant's view. Correspondingly, for more than two decades, developing enablers for XR has also advanced 3D telepresence solutions. These enablers include e.g. sensors for 3D capture, coding and streaming methods, low latency networks, tracking and detection for XR, camera and user positioning, motion capture and tracking methods, new display technologies, and general advances in algorithms and processing power.

In the same way, developing game technologies has advanced 3D telepresence solutions based on virtual modeling and rendering, here denoted as Virtual World (VW) approaches. Traditionally, in VW approaches, visual content has been modeled/produced in advance, and rendering of the content is based on real-time transfer of parameters for viewpoint and object positions, dynamic 3D shapes and poses (animation), etc. Although VW approaches are rather common and have their specific benefits, their description is omitted in our presentation, focusing on rendering of photorealistic real-time captures. This focus is reasoned in more detail in Chapter 2.5.

A recent example of a photorealistic telepresence solution based on advanced 3D displays is Google Project Starline (https://blog.google/technology/research/ project-starline/). A good example of a 3D telepresence system based on AR/VR (XR) visualization is MS Holoportation [7] (cf. https://www.youtube.com/ watch?v=7d59O6cfaM0). Both of them are quite impressive but obviously also

complicated and costly. In this article, we aim to define a more economical solution with good sides of both photorealism and XR visualization. Note that even a lone talking head on a screen - whether camera captured or 3D modeled - may well be a value-adding functionality. Communication and attractiveness may namely be supported by using e.g. a speech-controlled, look alike or anonymous virtual head (cf. https://remoteface.ai/ and a video at: https://www.youtube.com/ watch?v=prpPqwV5Weo).

## **2.2 Remarks on mobility and serving with viewpoints**

In above, Hydra system was described as an early attempt towards spatially faithful 3D telepresence. Using such full-mesh approach, and by making restricting assumptions on participant positions ("seating order"), all participants may perceive eye directions and participant positions consistently, however, apart from solving the disturbing camera-display offset. Furthermore, a regular setup with fixed camera and display positions naturally limits the mobility of participants within their meeting sites.

Note, that although a participant position is fixed, a whole meeting room with its occupant may move virtually. In [8], a solution is described for compiling captures of regular sensor and display setups into a landscape, enabling participants to mingle together with their meeting spaces within each other, like people in a cocktail party (**Figure 2**). For example, a capture setup in a square or hexagonal formation can be used. Writers of this paper are however not aware if someone has implemented and tested such arrangement.

Let us consider a Hydra setup a little bit further. If a Hydra system with all its terminals was in one large hall or open space, a participant is able to switch (walk) between different sites and perceive spatial faithfulness in each of them separately, i.e. participant mobility is supported in discrete locations.

User mobility may be supported in principle at any viewpoint when using near-eye glasses (NEDs) for viewing. Limitations set by fixed camera positions can be relieved using a setup of multiple 3D sensors, or multiple cameras in an array. For the latter, solutions based on wall-mounted camera arrays or moving cameras are described in [9, 10], correspondingly. Arbitrary viewpoints can be supported to remote spaces, provided that complete and real-time enough 3D reconstructions of those spaces are available, or that one of the multiple cameras provides the viewpoint along a desired line-of-sight (**Figure 3**).

For serving viewpoints from varying positions, i.e. receiving viewpoints ondemand, the system needs to deliver participant positions between sites in unified coordinates (a unified geometry). Renderings of remote participants need to be

#### **Figure 2.**

*Fixed capture setups can support collaboration in dynamic 2D landscapes: a) capture setup in hexagonal grid, b) captures arranged into a tessellated landscape, enabling c) moving with user spaces like people in a cocktail party [8] (image c) is creative commons image by Lucas Maystre from Renens, Switzerland - 053/365: Apéro au forum).*

*Advances in Spatially Faithful (3D) Telepresence DOI: http://dx.doi.org/10.5772/intechopen.99271*

**Figure 3.**

*Idea of bringing real-world meeting sites into a common geometry: a) three sites with users (dots) captured by RGB-D cameras in local coordinates, b) lines-of-sight between users in a unified coordinate system, and c) supporting lines-of-sights (new viewpoints) for a moving participant.*

compiled in each participant view correspondingly. A viewer's head orientation needs to be detected and tracked to serve him/her with a correct part (frustum) of the compiled scene.

The writers have disclosed several inventions using the above described approach [8–10]. Note that instead of delivering visual information as large and bitrate consuming 3D volumes, a visual stream may be a video, a stereoscopic video, or a video-plus-depth stream (V + D, [11]) allowing forming a stereoscopic video. When using the viewpoint-on-demand approach in this way, the bitrate requirement may be much lower than when streaming 3D volume data. Note that Chapter 2.4 will describe the viewpoint-on-demand approach in more detail.

As a summary, spatially faithful telepresence solutions aim to support realworld-like geometries among participants. However, as will be explained in Chapter 3.2, both virtual and photorealistic approaches may be feasible even without such support, and even much easier. For example, perception of depth and motion parallax can be supported without forming and maintaining perfectly unified virtual meeting geometries.

#### **2.3 Transmitting and displaying volume videos (3D streams)**

Ideally, 3D reconstructions are coded, delivered and displayed as real-time 3D streams. However, this is very challenging e.g. due to high computation power, and very high bitrate it requires. The benefits of volume videos include more freedom in choosing one' viewpoint (cf. motion parallax and alternative viewpoints) and support for multiple local viewers. However, both capturing participant spaces and supporting viewing with glasses bring considerable complexity to this approach.

Viewing from various viewpoints may be supported also using multi-view streaming and display methods. However, without capture and delivery of user positions (cf. knowledge of a mutual meeting geometry), users need to choose their position accurately among a priori specified locations e.g. in order to perceive correct eye contact(s). Several approaches are using this approach, although simplified by reducing the volume being supported. Multi-view video coding methods and standards are available and applicable for this [12], but more advanced (realtime) 3D coding method are still under development e.g. by MPEG. Special 3D displays are already available, supporting different 3D viewpoints for multiple local viewers.

In the future, by advances in transmission and display (e.g. using light fields), real-time streaming and display of 3D volumes becomes more feasible. An apparent benefit of these solutions is that simultaneous viewers can see the 3D content from their individual viewpoints, like in the real world.
