View Synthesis Tool for VR Immersive Video

*Sarah Fachada, Daniele Bonatto, Mehrdad Teratani and Gauthier Lafruit*

### **Abstract**

This chapter addresses the view synthesis of natural scenes in virtual reality (VR) using depth image-based rendering (DIBR). This method reaches photorealistic results as it directly warps photos to obtain the output, avoiding the need to photograph every possible viewpoint or to make a 3D reconstruction of a scene followed by a ray-tracing rendering. An overview of the DIBR approach and frequently encountered challenges (disocclusion and ghosting artifacts, multi-view blending, handling of non-Lambertian objects) are described. Such technology finds applications in VR immersive displays and holography. Finally, a comprehensive manual of the Reference View Synthesis software (RVS), an open-source tool tested on open datasets and recognized by the MPEG-I standardization activities (where "I" refers to "immersive") is described for hands-on practicing.

**Keywords:** DIBR, RVS, view synthesis, depth map, virtual reality, rendering, 3D geometry, light field, non-Lambertian

### **1. Introduction**

Photography has a vast history as it is used to preserve our lives' most important memories. As such, it tries to conserve a scene as realistically as possible. During the years, it evolved from the camera obscura [1, 2] where scenes were captured only for a brief moment, to black and white photography, requiring to stay still in front of the camera for long hours, to nowadays imaging devices, where the picture is captured instantaneously, digitalized, and the colors are close to what our eyes perceive [3].

Though it preserves the content of the scene, the immersion is lost, as well as the depth information, since the camera projects the scene from 3D to 2D.

To increase the immersion, the next step is to recreate the parallax of the scene, giving the opportunity to the viewer to move freely and see different perspectives, exactly as if the subject was miniaturized in front of our eyes, or the environment virtually rendered around us. Despite this desire, no device capable of acquiring the scene in its entirety directly in 3D has been designed so far.

Creating the parallax effect assumes capturing the scene from all the possible viewpoints and selecting the viewpoint to display on demand for the user's viewing position. This is physically impossible; instead, we may synthesize any viewpoint from only a couple of captured viewpoints generating all missing information following some basic assumptions [4, 5].

#### **Figure 1.**

*DIBR brings photographs to 3D by using depth information to create new viewpoints. It preserves photorealism and allows the user to experience motion parallax.*

There exist many approaches to generate novel viewpoints from input views. Early methods were based on 3D reconstruction [6, 7] to render the obtained 3D model. More recently, neural radiance fields (NeRF) [8] used machine learning methods to recreate a volumetric representation of the scene. Other methods avoid the explicit 3D information reconstruction, such as depth image-based rendering (DIBR) [9] that will be described in this chapter, or multiplane images [10, 11]. Finally, novel viewpoints can be synthesized by an intelligent interpolation using physical invariants (the epipolar plane image), rather than interpolating directly the image's colors. Representatives of this last category are the shearlet transform [12] and techniques using deep learning [13].

This chapter provides comprehensive elements to bring photographs of a natural scene to the third dimension, for example, making the captured scene immersive, through holography or virtual reality (VR). The presented 3D rendering technique differs from traditional computer graphics by its input—instead of modeling 3D objects with their geometry and materials that interact with light sources, we use photographs that are warped to follow the viewer gaze direction using the reference view synthesis (RVS) [14–16] software that follows the view synthesis process of **Figure 1**.

RVS has been developed during the exploration and standardization activities of MPEG-I – where "I" refers to Immersive – focusing on developing new compression and file formats for immersive video.

The chapter is structured as follows—the first part explains the principles of depth image-based rendering, gives an overview of the possible artifacts that can be encountered when creating a DIBR implementation, and finally, implementation details of RVS are described. The second part provides practical advice for using RVS on some example datasets.

### **2. Principles of depth image-based rendering**

To recreate the parallax effect, we use the depth image-based rendering [9] method. It warps or distorts the input color image as a function of its associated depth map, which itself stores, for every pixel, the distance between the camera and the projected point along the camera optical axis. This method is based on the observation that a stereoscopic pair of images, for example, taken with a few centimeters shift

between each other, carry the depth information of the photographed subject. As shown in **Figure 2**, the relative shift *d*, aka. the disparity, of foreground objects is larger than for background objects.

#### **2.1 Projection equation and disparity**

Let us consider two pinhole cameras facing an object at distance *D* (see **Figure 2**). The projection of this object on each image will have a disparity *d*. Using the similar triangles ratios of *<sup>f</sup> <sup>D</sup>* and *<sup>d</sup>*1þ*d*<sup>2</sup> *<sup>B</sup>* , we obtain:

$$d = \frac{B \times f}{D} \tag{1}$$

where *B* is the baseline, i.e., the distance between the two camera centers, and *f* their focal length.

This implies that, given two images and their depth maps, we can create a virtual view in the middle, between the inputs, by shifting the pixels over half their disparity.

Eq. 1 can be generalized to any camera settings using the pose (translation and rotation – extrinsic parameters, Eq. 2) and internal camera parameters (focal length *f <sup>x</sup>* and *f <sup>y</sup>* expressed in pixels in the *x* and *y* directions, and principal point *ppx*, *ppy* � � – intrinsic parameters, Eq. (3).

We call an input image and its camera parameters an *input* viewpoint. We aim to recreate a new virtual view with given new parameters, called *target* viewpoint. For this, we deproject (i.e., from 2D to 3D) the pixels of the input image to 3D, and reproject (i.e., from 3D to 2D) them to the target image using the projection equation.

Let *P* ¼ ð Þ *R*j*t* be the inverse (i.e., world to camera) 3 � 4 pose matrix of a camera with *R* the rotation matrix and *t* the translation:

$$P = (R|t) = \begin{pmatrix} r\_{11} & r\_{12} & r\_{13} & t\_x \\ r\_{21} & r\_{22} & r\_{23} & t\_y \\ r\_{31} & r\_{32} & r\_{33} & t\_x \end{pmatrix},\tag{2}$$

and *K* its 3 � 3 intrinsic matrix:

$$K = \begin{pmatrix} f\_x & 0 & pp\_x \\ 0 & f\_y & pp\_y \\ 0 & 0 & 1 \end{pmatrix}. \tag{3}$$

**Figure 2.** *Disparity d* ¼ *d*<sup>2</sup> � *d*<sup>1</sup> *between two pixels representing the same projected point.*

In homogeneous coordinates, a point *<sup>X</sup>* <sup>¼</sup> ð Þ *<sup>x</sup>*, *<sup>y</sup>*, *<sup>z</sup>*, 1 *<sup>t</sup>* at depth *<sup>D</sup>* from the input camera *pin* <sup>¼</sup> ð Þ *<sup>u</sup>*, *<sup>v</sup>*, 1 *<sup>t</sup>* is projected to a pixel *pin* <sup>¼</sup> ð Þ *<sup>u</sup>*, *<sup>v</sup>*, 1 *<sup>t</sup>* following the projection Equation [6]:

$$D p\_{in} = K\_{in} P\_{in} X \tag{4}$$

Hence, given the input image and the depth value of the pixel, we can deproject *X*:

$$X = D \left( R\_{in} |-R\_{in}^{-1} t\_{in} \right)^{T} K\_{in}^{-1} p\_{in} \tag{5}$$

Eventually, this allows to reproject *X* in the new camera, using Eq. 4 with *Pout* and *Kout*:

$$p\_{out} \propto DK\_{out} P\_{out} \left(R\_{in} \left|-R\_{in}^{-1} t\_{in}\right)^{T} K\_{in}^{-1} p\_{in} \tag{6}$$

To obtain the pixel value, we divide the obtained vector by the third coordinate (i.e., the depth of the point in the new camera).

Applying this operation to every pixel of the input image creates a novel view.

The core principle of DIBR is to apply this deprojection and reprojection to all the pixels of the input images, using a depth map (i.e., a single-channel image encoding the depth value of each pixel). RVS uses this basic principle, but of course, there are many pitfalls one should handle correctly. This is further explained in the following sections.

### **2.2 Frequent artifacts**

We now know the basic principles of DIBR. Unfortunately, simply shifting the pixels of an input image in the function of their depth does not create a photo-realistic result.

The first problem is occlusion handling. When an object is visible in the input image but hidden by an object lying more in the foreground in the target, it is occluded and its pixels should not appear in the rendered image. This can be solved by choosing, among all the pixels from various objects ending up in the same pixel on the screen, the pixel with the minimal depth. A more critical problem is the one of disocclusions, for example, when an object should be visible in the target image but does not appear in the input image because it is hidden (**Figure 3a**). In that case, a

#### **Figure 3.**

*(a) Disocclusion artifact (classroom dataset), (b) crack artifacts (Toystable dataset), (c) Artifacts due to inconsistency in color among the input images. (dataset fencing, courtesy of Poznan University of Technology), (d) ghosting artifacts (dataset Toystable).*

hole is created in the rendered image. One solution is to add more input images in the hope to obtain this missing information [15, 17]. Another approach is to inpaint the empty pixels [18, 19]. In RVS, it is possible to choose any number of viewpoints and a basic inpainting fills the remaining disocclusions.

Cracks and dilation are other frequent DIBR artifacts. We can observe them in **Figure 3b**. They are created as the user moves forward (step-in), increases the resolution (zoom), or observes slanted objects. Those cracks correspond to pixels in the target that do not have a preimage in the input view (i.e., no input pixel is mapped to them). However, as their neighboring pixels have a preimage, their color can be interpolated. In other words, the input pixels should be mapped to more than one pixel to compensate for this effect. This can be done using superpixels [20], adapting the pixels size to the camera movement [21], or linking neighboring pixels for rasterization [15, 16] (chosen solution in RVS: adjacent pixels are grouped into triangles that are colorized).

Even if increasing the number of input images can reduce the number of disocclusions, it brings new challenges, as those views need to be consistent in color, in estimated geometry, and in estimated pose. Notably, the depth estimation and the blending of multiple views together rely on consistent colors between the images. As not all camera sensors are equal, small differences in color rapidly generate incoherent depth estimations or nonhomogeneous color patches during view blending (**Figure 3c**). Color correction is usually needed prior to the view synthesis [22, 23] or during the blending step [24, 25].

Moreover, as DIBR relies on the depth information, errors in the depth estimation, a misalignment between the color image and the depth map, or errors in the camera pose estimation lead to ghosting artifacts. When several views are blended together, these artifacts make the objects or their borders appear doubled (**Figure 3d**). A depth map refinement [26, 27] is one way to solve this problem. Another is to choose weighted blending coefficients based on the reliability of each input image [11, 16] (chosen solution in RVS).

Finally, DIBR is structurally limited to the rendering of diffuse objects. Indeed shifting the pixels in the function of their depth assumes that they do not present view-dependent aspects, such as transparency or specularity. When such objects, so-called non-Lambertian, are present in the scene, the linear hypothesis in pixel displacement in the function of the camera displacement is not valid anymore. Adapting the DIBR principles to non-Lambertian objects is nevertheless possible by exploiting additional information, such as structure, normal, and indexes of refraction [28], or a more accurate approximation of the pixel displacement [29–31] (chosen solution in RVS).

#### **2.3 RVS in practice**

The DIBR software RVS is designed to render novel viewpoints from any number of input images and depth maps and their camera parameters, without suffering from the above limitations. In order to create a novel view, the input images are warped sequentially. The obtained result is then blended into an image accumulating the outputs of each reprojected input image. This pipeline is shown in **Figure 4**. The warping and blending operations are performed alternatively for each input image using OpenGL [32] or on the CPU [15].

To obtain high-quality results, it is recommended to select candidate input views properly. Therefore, the first step in RVS is an optional view selection. The *n* views the

**Figure 4.**

*Overview of the processing pipeline. (1) view selection (optional), (2) warping, (3) blending, (4) Inpainting (optional).*

closest to the target image are selected in order to reduce the computation time. Otherwise, all the input images are used to create a new viewpoint.

The second step in RVS is the warping phase. Each input image is divided into a grid of triangles whose vertices are adjacent pixels (**Figure 5a**). Each of these vertices is reprojected to fit to the new camera pose and parameters (**Figure 5b**) and rasterized to avoid the cracks artifacts of **Figure 3b**. Then, each new triangle is given a score that will be used in the blending phase. This score describes the quality of a warped triangle—if the pixels lie on a disocclusion area, their triangle will be stretched and should hence be discarded from the final result (black areas in **Figure 4**-warping and **Figure 5c**). The remaining triangles are then rasterized according to their vertex color in the input image (**Figure 5c**). A depth test prioritizes the pixels with the lowest depth.

When the input images are warped to the target viewpoint, the results need to be blended into one single image. For a given pixel, the final output color *c* is the weighted mean of the color *ci* of each warped input:

$$\mathcal{L} = \frac{1}{\sum\_{i} w\_{i}} \sum\_{i} w\_{i} c\_{i} \tag{7}$$

**Figure 5.**

*Adjacent pixels of an input image (a) are grouped into triangles independently of their depth before being reprojected to their new image location (b). Triangles detected as lying on a disocclusion are discarded, resulting in a new warped image (c).*

*View Synthesis Tool for VR Immersive Video DOI: http://dx.doi.org/10.5772/intechopen.102382*

where *wi* is a weight representing the quality of a triangle [16], prioritizing foreground objects and highest quality triangles.

Finally, as shown in the inpainting of **Figure 4**, when multiple views are blended together, several occluded regions remain; if the occlusions are small enough, a basic inpainting process can be applied to remove them. Of course, the quality of the inpainting can compromise the overall image quality, hence, inpainting is not recommended. In RVS, the inpainting is not automatic but can be activated. In that case, the empty pixels take the color of the nearest non-empty pixel.

#### *2.3.1 Non-Lambertian case*

In the general case, DIBR uses depth maps to predict pixel displacement. However, a point on a non-Lambertian surface does not have a proper color (its color can rapidly change with a change in viewing direction); its appearance is a function of the surrounding scene, the normal at that point, and the index of refraction for refractive surfaces (see **Figure 6**). This not only makes depth estimation through stereo matching impossible but also implies that even with a correct depth map, the object cannot be rendered by a simple pixel shifting.

Alternatively, to model the non-Lambertian surface in itself, it is possible to track its feature movements on the surface [29, 33, 34]. DIBR can be generalized to non-Lambertian objects by replacing the usual depth maps with the coefficients of a polynomial approximating the non-Lambertian features displacement [30, 31]. To clearly understand what this means, let us start with what happens for diffuse objects, where for a lateral camera movement ð Þ *x*, *y* , the new position ð Þ *u*, *v* of a pixel ð Þ *u*0, *v*<sup>0</sup> is given by:

$$
\begin{pmatrix} u \\ v \end{pmatrix} = \begin{pmatrix} u\_0 \\ v\_0 \end{pmatrix} + \frac{f}{D} \begin{pmatrix} \varkappa \\ \jmath \end{pmatrix}. \tag{8}
$$

#### **Figure 6.**

*The aspect of non-Lambertian objects is view dependent—Their surface does not appear the same color in each viewing direction.*

We extend this equation for non-Lambertian objects using polynomials:

$$
\begin{pmatrix} u \\ v \end{pmatrix} = \begin{pmatrix} u\_0 \\ v\_0 \end{pmatrix} + \begin{pmatrix} P\_u(\mathbf{x}, y) \\ P\_v(\mathbf{x}, y) \end{pmatrix}, \tag{9}
$$

with *Pu*ð Þ¼ *<sup>x</sup>*, *<sup>y</sup>* <sup>P</sup> *i* P *j aijxi <sup>y</sup> <sup>j</sup>* and *Pv*ð Þ¼ *<sup>x</sup>*, *<sup>y</sup>* <sup>P</sup> *i* P *j bijxi y j* . Clearly, the diffuse case corresponds to *<sup>a</sup>*1,0 <sup>¼</sup> *<sup>b</sup>*0,1 <sup>¼</sup> *<sup>f</sup> <sup>D</sup>* and all other coefficients *aij*, *bij* set to zero.

Consequently, Eq. 9 approximates by a polynomial the nonlinear displacement of a refracted or reflected feature moving on non-Lambertian objects.

However, the polynomial expression rapidly diverges in extrapolation (e.g., when synthesizing a target view that is outside of the input images' hull). The computed feature displacement becomes greater than the inverse of the non-Lambertian object's depth, making the feature to be rendered outside of the non-Lambertian surface. This approximation is hence designed for interpolation and small extrapolation only.

Furthermore, these polynomials are not directly related to the physical reality of the non-Lambertian object. Hence, contrary to the simple relation linking the depth to the disparity of a diffuse object cf. **Figure 2** and Eq. (1), the polynomials of Eq. (9) do not give the object geometry or the index of refraction.

The polynomial is rather designed to "track" non-Lambertian features that move nonlinearly across the input images. It nevertheless encounters the following limitations. For content with semi-transparent objects, the maps should be divided into several layers before applying the polynomial or depth image-based rendering. Scenes with glints and glossiness make it difficult to track features on their surfaces, often leading to a failure case of the proposed method.

### **3. Reference view synthesis (RVS) software**

This section provides practical recommendations for the use of the reference view synthesizer (RVS) [14–16, 32, 35] (https://gitlab.com/mpeg-i-visual/rvs) developed as a DIBR-based view synthesizer for the MPEG immersive video (MIV) standard (https://mpeg-miv.org). Without further details on the compression and storage of immersive content [36], we give a comprehensive method to practically use the software on some test sequences (also provided to the MPEG community while developing RVS).

The following paragraphs give documentation on the image format, the axis system, and the data structure to synthesize new viewpoints from available ready-to-use datasets and/or new content users may provide.

#### **3.1 Input images**

RVS can accept any number of input images with depth maps, the only limitation being the computer memory. Each input color image must be provided along with its corresponding depth map.

#### *3.1.1 Color images*

The color images can be encoded on three RGB color channels, with 8-bit integers each, in any image format readable by OpenCV, for example, PNG or JPEG format.

Additionally, raw images in YUV can be used, with a bit depth of 8, 10, or 16 bits. In this case, multi-frame raw video can be used, applying the view synthesis on all specified frames.

#### *3.1.2 Depth maps*

The depth maps represent the depth coordinate of every point in the image following the forward axis of the camera. Similar to color images, they can be provided in different formats. They have to match the resolution of the input images, but they use only one channel.

The first option is to use the OpenEXR format. In that case, the software reads the depth value in float and uses it directly for reprojection.

In the case of integer coded formats, such as YUV or PNG, the precision can be set to 8, 10, or 16 bit per depth value—the bit depth. YUV files can be encoded in YUV420 or YUV400 format, only the Y channel being used. However, the quantization does not allow to directly use the integer as a depth value. Indeed, it would be impossible to use a depth map in meter units for objects in the range of a few meters or centimeters from the cameras.

To overcome this problem, the depth value is encoded into MPEG's disparity format, mapping the closest object to 2bitdepth � 1, and the farthest to 1. To obtain the actual depth value, first we divide the encoded depth map value by 2bitdepth � 1 to obtain a value *d* in the range of 0, 1 ½ �, then remap the value in the range ½ � *n*, *f* using:

$$d' = \frac{f \times n}{n + d \times (f - n)}.\tag{10}$$

With *n* and *f* the near and far values of the scene and *d*<sup>0</sup> the depth value lying in ½ � *n*, *f* .

For very far objects, this equation is simplified (f ≥1000) to

$$d' = \frac{n}{d} \tag{11}$$

The value 0 in the encoded depth maps is considered as invalid depth. It corresponds, for example, to disocclusions in a depth-sensing device-acquired map.

**Figure 7.**

*Encoded depth map on integer values. Due to the shift between the color sensor and the depth sensor, the depth map reprojected to the color image misses some information, leaving invalid pixels, encoded on 0. The foreground objects are encoded on high disparities, while the background objects are encoded on low disparities.*

**Figure 7** shows an encoded depth map with invalid pixels and objects at different depths. Clearly, the foreground has high values, which corresponds to being a disparity value, that is, the inverse of a depth, cf. Eq. (11).

In the case of polynomial maps for non-Lambertian objects, it is possible to encode up to degree 3 polynomials, hence 18 coefficients, and pass an additional depth map and mask for the non-Lambertian objects. Those coefficients are encoded similarly to the depth maps, using EXR (directly the float value) or YUV (normalized) format. The polynomial maps are numbered from 0 to 19 as follows.

$$P\_u(\mathbf{x}, \mathbf{y}) = a\_0 \mathbf{x}^3 + a\_1 \mathbf{x}^2 \mathbf{y} + a\_2 \mathbf{x} \mathbf{y}^2 + a\_3 \mathbf{y}^3 + a\_4 \mathbf{x}^2 + a\_5 \mathbf{x} \mathbf{y} + a\_6 \mathbf{y}^2 + a\_7 \mathbf{x} + a\_8 \mathbf{y},\tag{12}$$

$$P\_v(\mathbf{x}, \mathbf{y}) = b\_0 \mathbf{x}^3 + b\_1 \mathbf{x}^2 \mathbf{y} + b\_2 \mathbf{x} \mathbf{y}^2 + b\_3 \mathbf{y}^3 + b\_4 \mathbf{x}^2 + b\_5 \mathbf{x} \mathbf{y} + b\_6 \mathbf{y}^2 + b\_7 \mathbf{x} + b\_8 \mathbf{y}$$

with *ai* corresponding to the map *i* and *bi* to the map 10 þ *i*. The remaining map 9 is used to encode the depth map for Lambertian objects and the map 19 is used as a mask representing non-Lambertian objects (0 for Lambertian, 1 for non-Lambertian). The coefficients not used are left to 0. If the coefficients are encoded in YUV format, the depth (map n<sup>∘</sup> 9) is normalized using Eq. 10, the mask (map n<sup>∘</sup> 19) has 0 and 1 values and the other coefficients are linearly normalized between minimal *m* and maximal *M* values: *ai* <sup>0</sup> ¼ ð Þ *M* � *m ai* þ *m*.

#### **3.2 Camera parameters**

Additionally to the input images, the camera parameters must be known to create a novel view with DIBR and RVS. The extrinsic parameters describe the position and the rotation of the camera (Eq. 2), while the intrinsic parameters describe the projection matrix (Eq. 3). Perspective and equirectangular projections are also supported, requiring a slightly different description, as explained hereafter.

#### *3.2.1 Extrinsic parameters*

Common graphics processing software and APIs, such as Blender [37], COLMAP [38], OpenGL [39], Vulkan [40], specify their own coordinate system, often admitting different axes and directions, and different image coordinates. Transferring data from one application to the other requires then several coordinate transformation steps, which will be summarized here. We use the Omnidirectional Media Format (OMAF) [41] coordinate system of MPEG-I, combined with yaw-pitch-roll angles.

OMAF is the first industry standard for VR. It specifies the coordinate system used in VR applications, the projection and rectangular region-wise packing methods, the metadata storage, encapsulation, signaling, and streaming of omnidirectional data, and finally the media profiles and presentation profiles. For these reasons, it has been adopted in the camera configuration files of RVS.

The OMAF coordinate system is described in **Figure 8**. The axes are defined as follows:


**Figure 8.** *The omnidirectional media format coordinate system.*

• Z: Vertical, up

The rotations in degrees are defined with the Yaw-pitch-roll:


A camera facing forward has all its rotation angles set to 0. The rotation matrix of the camera (world to camera) in our axis coordinate system is then given by:

$$R = R\_x(\text{yaw}) R\_y(\text{pitch}) R\_x(\text{roll}) \tag{13}$$

In order to transform a coordinate system from an application to OMAF, one needs to define the coordinate change matrix that matches the three axes, for example:

$$P = \begin{pmatrix} \mathbf{0} & \mathbf{0} & -\mathbf{1} \\\\ \mathbf{1} & \mathbf{0} & \mathbf{0} \\\\ \mathbf{0} & -\mathbf{1} & \mathbf{0} \end{pmatrix} \tag{14}$$

This matrix sets *x*<sup>0</sup> , *y*<sup>0</sup> , *z*<sup>0</sup> ð Þ *(OMAF)* = ð Þ �*z*, *x*, �*y (application)*, that is, it represents a coordinate system with the axes (left, down, backward). To transfer from this system to OMAF, we apply it to the rotation and position as follow:

$$\begin{aligned} R' &= P.R.P^T\\ p' &= P.p \end{aligned} \tag{15}$$

**Figure 9.** *Intrinsic parameters of the camera for (a) a pinhole projection, (b) an equirectangular projection.*

where *R*<sup>0</sup> and *p*<sup>0</sup> are the rotation and position of the OMAF system, while *R* and *p* the rotation and position in the old coordinates.

The unit of the coordinate system does not have any prerequisite but must correspond to one of the depth maps.

RVS handles any number of input and target cameras, each of them can have its own parameters and projection types. In the case of a stereoscopic head-mounted display for VR, two target views – one for each eye – need to be synthesized with a relative position (interpupillary distance) corresponding to the eye distance, usually given by the headset's framework along with the intrinsic parameters.

#### *3.2.2 Intrinsic parameters*

The intrinsic parameters can be defined for perspective or equirectangular projections. In both cases, the resolution needs to be specified.

For perspective projection, the input images need to be undistorted. In that way, only the focal length and the principal point need to be specified. Those values are in pixel units, the sensor size corresponding to the image resolution. The focals are given by *f <sup>x</sup>*, *f <sup>y</sup>* , corresponding to the horizontal and vertical axis. The principal point *ppx*, *ppy* is defined from the top-left corner of the image as described in **Figure 9(a)**. A principal point at the center of the image has a value of half the resolution. In the case of equirectangular projection (**Figure 9b**), the horizontal and vertical viewing range must be specified in degrees. For a full 360<sup>∘</sup> panoramic image, the horizontal range is ½ � �180, 180 and the vertical range is ½ � �90, 90 . For a 180<sup>∘</sup> image, the horizontal and vertical ranges are ½ � �90, 90 .

#### *3.2.3 Camera file*

The image specifications and camera parameters are specified in a *json* file with informative headers. An example with a perspective and an equirectangular camera is given here.

**Listing 1.1** Cameras calibration file. Cameras.Json.

An optional parameter, DisplacementMethod, can be set to Polynomial instead of default parameter Depth to specify that, instead of a depth map (Eq. 10), RVS reads a displacement map (Eq. 12). In that case, similarly to the Depth\_range, a Multi\_depth\_range can be specified for the polynomial coefficients in YUV format.
