**3. Binaural Ambisonics-based reverberation and spatial resolution**

The spherical harmonics framework (known as Ambisonics in the context of audio production) allows to express a sound field as a continuous function on a sphere around the listener. Ambisonics sound fields are typically generated from microphone array recordings [73] or plane-wave-based simulations. Alternatively, it is often convenient to measure or simulate an Ambisonics RIR that can be convolved with any anechoic audio signal to generate the sound field, e.g., as in [74]. Once encoded in the Ambisonics domain, a sound field can be mirrored, warped or rotated around the listener through inexpensive algebraic operations [56]. Additionally, it is possible to modify its spatial resolution, which allows to reduce computational costs in the rendering process in exchange for potential perceptual degradation [72, 75, 76].

When a sound field is encoded in the Ambisonics domain, its spatial resolution is defined by its inherent 'truncation order', which is an integer equal or greater than zero. Higher-order signals have a larger number of channels and allow to produce binaural renderings with finer spatial resolution and sound sources that are easier to localise, while lower-order signals are more lightweight (fewer channels) and produce renderings with lower resolution and 'blurry'sources (see **Figure 2**). This was shown by Avni et al. [77], who argued that truncating the order of an Ambisonics signal affected the perception of spaciousness and timbre in the resulting binaural signals. Later, Bernschütz [66] reported that, in perceptual evaluations, listeners could not generally detect differences in binaural signals rendered from Ambisonics sound fields of order 11 and above. Then, Ahrens and Andersson [74] showed that an order of 8 might be sufficient to simulate lateral sound sources that are indistinguishable from BRIR-based renderings, but slight differences were perceived up to order 29 for frontal sound sources.

It has also been shown that the relation between spatial order and perceived quality also depends on the 'decoding' method that is used to translate the Ambisonics sound field to a pair of binaural signals. For instance, the time-alignment method [78] and the magnitude least squares (MagLS) method [79] have both been shown to produce more accurate binaural signals at lower spatial orders than other approaches, such as the widely used virtual loudspeakers method [80]. In the case of MagLS, which focuses on minimising magnitude errors (disregarding phase) at high frequencies, Sun [81] showed that a conceptually similar method was able to produce binaural signals that were indistinguishable from a high-order reference at orders as low as 14.

Overall, previous studies have suggested that binaural signals can be accurately rendered from Ambisonics sound fields as long as the truncation order is high

*Reverberation and its Binaural Reproduction: The Trade-off between Computational… DOI: http://dx.doi.org/10.5772/intechopen.101940*

**Figure 2.**

*Room impulse response encoded in the Ambisonics domain at different truncation orders (0 to 4), for a source placed in front of the listener. Data are plotted as sound pressure (in decibels relative to the peak value) along the time axis and over different azimuth angles on the horizontal plane. Source: Engel et al. [76] ('trapezoid' room).*

enough, probably somewhere between 8 and 29. However, such orders may still be too high to be computationally efficient (the number of channels of an Ambisonics signal is proportional to the square of its truncation order) or just unfeasible in practice (commercially available microphone arrays operate at order 4 or lower). The remainder of this section discusses some recent perceptual studies that explored how the binaural rendering of reverberant sound fields is affected when simplifications are applied in the Ambisonics domain, e.g., reducing the truncation order of different parts of the RIR.

#### **3.1 Hybrid Ambisonics**

A recent listening experiment by Lübeck et al. [75] showed that early reflections and late reverberation may be encoded in Ambisonics at a significantly lower order than the direct sound and still produce binaural signals that are indistinguishable from a BRIR-based rendering. The reason why this may happen is illustrated in **Figure 2**, which shows an RIR encoded in Ambisonics at different truncation orders. It can be seen how the lowest order (0) produces an isotropic signal which does not vary across directions in the horizontal plane, while higher orders achieve a more faithful representation of the sound field by allowing for spatially 'sharper' patterns e.g., note how the direct sound becomes narrower as order increases, converging towards a spatial Dirac delta. Looking at this figure, it becomes apparent that earlier parts of the RIR (blue) are more sensible to spatial resolution changes due to order truncation, compared to late reverberation (green) which is less directional.

According to these observations, it is reasonable to propose an Ambisonicsbased binaural rendering method that employs a high truncation order for the direct sound (and, possibly, some early reflections) and lower orders for the rest of the RIR. Such a method could be highly efficient given that late reverberation usually accounts for the majority of the duration of the RIR. This approach, reminiscent to

the hybrid models discussed earlier, has been tentatively coined as 'hybrid Ambisonics'.

A perceptual study by Engel et al. [76] evaluated binaural signals generated with hybrid Ambisonics and the virtual loudspeaker method, and found that an order between 2 or 3 (dependent on the room) may be enough to render reverberation, assuming that the direct sound path is accurately reproduced through convolution with HRIRs (see **Figure 3**). This is a promising precedent for future efficient binaural rendering methods, although further investigations would be needed to generalise these results to a wider selection of rooms and stimuli types. In the future, a more general model could estimate the needed truncation order adaptively based on the Ambisonics signal (e.g., measuring its directivity over time), which could be used in efficient binaural renderers or as a way to compress spatial audio data.

#### **3.2 Reverberant virtual loudspeaker (RVL)**

In real-time interactive binaural simulations, RIRs are typically recomputed when there is a change in the scene such as movements of the listener or sources. When working in the Ambisonics domain, this recomputation is not needed in order to simulate a head rotation from the listener, as the signal can be efficiently rotated via a rotation matrix ([56], Section 5.2.2). However, translational movements of either the listener or a source still require to recompute the RIRs. As a result, the number of sources that can be rendered simultaneously in a low-cost scenario might be limited.

In such cases, it may be beneficial to employ a rendering method that scales well with the number of sources. One such example is the reverberant virtual loudspeaker method (RVL), an Ambisonics-based approach that has the advantage of requiring a fixed amount of real-time convolutions regardless of the number of sources [72, 76, 83]. This method takes inspiration from the virtual loudspeakers approach [71, 80], which decodes an Ambisonics sound field to a virtual loudspeaker grid around the listener and convolves the resulting signals with HRIRs to generate the binaural output. RVL performs this same process but, instead of HRIRs, the virtual loudspeaker signals are convolved with BRIRs, so the acoustics of the room are effectively integrated with the binaural rendering without the need for additional steps. Therefore, the number of real-time convolutions depends only on the truncation order of the sound field, independently of the number of rendered

#### **Figure 3.**

*Perceptual ratings of binaural renderings generated from the hybrid-Ambisonics RIRs of orders 0 to 4 are shown in Figure 2, where the direct sound was reproduced via convolution with a single HRIR. A dry rendering was used as the anchor signal and the 4th order signal, as the reference. The vertical dotted lines indicate that the groups on the left are significantly different (p*<0*:*05*) from the groups on the right. Source: Engel et al. [76].*

*Reverberation and its Binaural Reproduction: The Trade-off between Computational… DOI: http://dx.doi.org/10.5772/intechopen.101940*

sources. For this reason, RVL is highly efficient at rendering a large number of sources in real time (see **Figure 4**). Its main limitation is that the room is headlocked due to the set of BRIRs being fixed, so head rotations may lead to inaccurate reflections, as shown in **Figure 5**.

RVL was perceptually evaluated in [76], paying particular attention to its effect on head rotations. For the assessment, the method was applied only to the reverberant sound (direct sound was generated through convolution with HRIRs) and the implementation was done with the 3D Tune-In Toolkit spatial audio library [84]. Listeners were asked to compare RVL to first-order hybrid Ambisonics renderings (both head-tracked) of speech and music, by being asked 'Considering the given room [shown in a picture], which example is more appropriate?'. Results suggested that the inaccurate head rotations could indeed be detected by listeners but were not necessarily perceived as a degradation in quality with respect to the more accurate rendering—note the bimodal distribution shown in **Figure 6**, which indicates that there was not a unanimous preference towards either rendering.

One could speculate that the RVL method was preferred by some listeners due to the BRIR-based rendering leading to highly uncorrelated binaural signals, which are typically associated with higher perceived quality when evaluating late reverberation (see the binaural quality index by Beranek [9]). An additional investigation to explore the matter further would be to compare the RVL method to other approaches that specifically aim to optimise interaural coherence, such as the

#### **Figure 4.**

*Comparison between the average execution time of the convolution stage in Ambisonics binaural rendering ('standard') and RVL binaural rendering, as a function of the number of rendered sources, for two different reverberation times (RT). A random input signal with a length of 1024 samples was used as input. Simulations were done in MATLAB (MathWorks) using the overlap-add method [82], running on a quad-core processor at 2.8 GHz. Source: Engel et al. [76].*

#### **Figure 5.**

*Direct sound path and first-order early reflections as they reach the left ear of a listener in three scenarios: (left) before any head rotation; (middle) canonical rendering after a head rotation of 30 degrees clockwise; and (right) RVL rendering after the same head rotation. Note how, in the third scenario, the direct sound path is accurate, whereas the room is head-locked, affecting the incoming direction of reflections. Source: Engel et al. [76].*

#### *Advances in Fundamental and Applied Research on Spatial Audio*

#### **Figure 6.**

*Violin plot showing perceptual ratings from paired comparisons between first-order hybrid Ambisonics (A) and RVL (B) binaural renderings. Negative values represent preference towards a, while positive values represent preference towards B. Source: Engel et al. [76].*

covariance constraint method proposed by Zaunschirm et al. [78] and described by Zotter and Frank ([56], Section 4.11.3).

Regardless, further perceptual evaluations (e.g., in more rooms) would be needed to generalise these results. Overall, RVL could be a viable option to render binaural reverberation of a large number of sources in real time in a low-resource scenario.

### **4. Future directions**

The trade-off between complexity and perceived quality when rendering binaural reverberation is still an area of major interest that has to be further explored. Recent studies have looked at the perceptual impact of varying spatial resolution of Ambisonics-based reverberation, but there are yet aspects of it that warrant further research. For instance, it would be interesting to explore an approach to compress Ambisonics RIRs by truncating their order depending on their directional and temporal information, as a way to compute and store them more efficiently.

Another set of very relevant challenges will come from using artificial binaural reverberation in different contexts and tasks. For example, binaural audio has been used in the past for assisting blind individuals in learning about the spatial configuration of a closed environment before being physically introduced to it [85, 86]. Within that context, the creation of geometrically and spatially accurate real-time reverberation was extremely important and could be achieved only by performing a series of case-specific optimisations in the processing chain, for example, limiting navigation paths to a series of lines rather than a 2-dimensional space, and precalculating a set of Ambisonics RIRs computing in real-time only rotations and interpolations. Such optimisations can be allowed only within a research environment, therefore real-life applications of such techniques are currently very limited. A better understanding of both the computational and perceptual sides of reverberation, possibly specifically for blind and visually impaired individuals, could lead to major advancements in the development and use of auditory displays and assistive technologies, tools and devices.

Looking ahead, AR applications could offer an interesting testbed for further research on binaural reverberation perception and rendering. One of the key research areas in AR/VR is 6DoF (or position-dynamic) audio rendering, where the listener is allowed to move around the scene, as opposed to traditional Ambisonics rendering where only head rotations are allowed (three degrees of freedom). Several methods have been recently proposed to efficiently extrapolate spatial audio signals from one listener position to another, either via simple parametric methods

### *Reverberation and its Binaural Reproduction: The Trade-off between Computational… DOI: http://dx.doi.org/10.5772/intechopen.101940*

[87] or more complex Ambisonics-based approaches that often rely on parametrising the sound field in 'direct' and 'ambient' components [60, 61], or according to the source distance [62, 63]. Significant advancements have also been made in terms of recording complex auditory scenes and to make them navigable in 6DoF—in this case, specialised hardware and software has been released and is already available commercially [88]. Future improvements in 6DoF recording and rendering techniques will in turn allow for an increased level of interactivity within the simulation, as well as more effective evaluations of different audio rendering technologies using AR/VR systems.

Focusing on the AR case, in order to blend real with virtual audio, it is essential to develop techniques for the automatic estimation of the reverberant characteristics of the real environment. New methods will need to be developed and evaluated for blending virtual audio sources within real scenes and to evaluate the impact of blending accuracy through metrics related to perceived realism and scene acceptability. This can be achieved, for example, by characterising the acoustical environment surrounding the AR user, using this in-situ data to synthesise virtual sounds with matching acoustic properties. Machine learning (ML) techniques could be employed to address the issue of blind acoustical environment characterisation by focusing first on overall room fingerprint evaluation (late reverberation), then on the finer details of the room response that vary depending on specific source positions (early reflections). The scene analysis could also be used to extract the direction-of-arrival for multiple sound sources and direct-to-reverberant energy ratio by separating source information from room and user acoustic properties. The data extracted by the model could then be employed to generate realistic virtual reverberation, which will be matched with the real-world reverberation. Of course for each step of this scenario several open challenges still exist, both from the computational point of view (e.g., how to generate geometrically and directionally accurate reverberation in real-time) and from the perceptual point of view (e.g., what is perceptually relevant and should therefore be computationally modelled and rendered, and what can be approximated).

Better understanding the extent and origin of sensory thresholds in terms of reverberation perception, therefore, presents still a very open set of challenges, which will need to be addressed in the future through extensive listening experiments and, why not, also by means of binaural auditory models and ML-trained 'artificial listeners'.
