**2. Simulating reverberation efficiently**

Simulating reverberation can be useful in various applications. In some cases, such as music production, it has mainly an aesthetic value and may not require highly realistic simulations. In other cases, such as architectural acoustics, augmented reality (AR) and, to a lesser extent, virtual reality (VR), the goal is to recreate a real acoustic space, so reverberation needs to be modelled with sufficient accuracy. For instance, an AR system allows the users to perceive the real world integrated with a virtual layer, e.g., a videoconferencing application in which users, wearing a pair of AR glasses, see holograms of their interlocutors which look and sound as if they were in the same room. From the acoustic point of view, this is particularly challenging to implement because the listener is exposed to real sound sources as well as virtual ones, so the simulated acoustics should be realistic enough for the virtual and real sources to be appropriately blended. Even though highly realistic reverberation is often desired, it can easily become too expensive to simulate in real time for interactive applications, where the auditory scene is expected to vary over time—even more so if many virtual sources are simulated [28]. Therefore, it is relevant to explore simplified reverberation models that reduce computational costs without compromising quality.

In the most general case, reverberation is rendered by convolving a dry audio signal with a room impulse response (RIR), which is the time-domain acoustic transfer function between a sound source and a receiver in a given acoustic space (room), assuming that the system formed by these is linear and time-invariant. The RIR can be either measured acoustically [29] or in a simulated environment. Several simulation techniques have been proposed, which range from rigorous but computationally expensive physical models, such as the finite-difference time-domain method [30], to simpler but less accurate geometrical models, such as the image source method [31] or scattering delay networks [32]. Ray-tracing and cone-tracing are also popular techniques that allow for a variable degree of accuracy [28, 33–35], albeit the computational requirements can become rather intensive when sound sources move in space, and real-time implementations are often limited to very simplified models and/or renderings.

Reverberation may also be generated through computationally lighter 'convolution-less' methods, such as Schroeder reverberators [36] or feedback delay networks (FDN) [37–39]. Such techniques are generally less accurate than convolution-based methods but can be useful to efficiently model the less critical parts of the RIR such as the late-reverberation tail [40].

With the goal of finding a balance between computational cost and perceived quality, several parametric reverberation models have been proposed [40–47]. Most of them aim to alleviate computational costs by rendering early reflections with a higher temporal and spatial accuracy than late reverberation, based on the concept of mixing time, i.e., the instant after which the RIR does not perceivably change across different listeners' positions or orientations within the room (see **Figure 1**) [48].

An early example of this approach, known as 'hybrid' reverberation, was presented by Murphy and Stewart [40], who proposed to employ convolution-based rendering for early reflections and simpler methods (e.g., FDN) to produce late reverberation. A key aspect of the hybrid model is correctly establishing the mixing time, which depends on the room volume, being higher for larger rooms [48].

In spatial audio applications, it is important to accurately simulate the direction of arrival of early reflections (and of late reverberation, to a lesser extent) which adds yet another layer of difficulty to the process. This also means that the reproduction method should be able to replicate such spatial cues. An example of a playback system would be a loudspeaker array surrounding the listener that can simulate virtual sources and reflections through amplitude panning [49] or Ambisonics [50]. In the case of binaural audio, such systems may be mimicked through virtual loudspeakers, but other methods also exist, as discussed in Section 2.1.

Note that the scope of this chapter covers reverberation's spatial features from the listener's point of view, but not from the source's point of view. Therefore, sound source directivity is not discussed, even though it is an important topic on its own—e.g., it is essential to model it correctly in a six-degrees-of-freedom (6DoF) application where the listener is allowed to walk past a directional source [51].

#### **2.1 The binaural case**

When rendering reverberation binaurally, directional information of reflected sounds is encoded in the binaural room impulse response (BRIR), i.e., a pair of RIRs that are measured at the listener's ear canals, in the form of monaural and interaural cues. Therefore, the most effective and straightforward way to achieve an accurate binaural rendering is to convolve an anechoic audio signal with a BRIR. Static (nonhead-tracked) BRIR-based renderings can produce highly authentic binaural signals, to the point of being indistinguishable from those emitted by real sound sources [52–55]. On the other hand, dynamic (head-tracked) renderings are more challenging to implement, as they require swapping between BRIRs as the listener or the source move. It is worth noting that, when dealing with binaural renderings of anechoic environments, an angular movement of a source relative to the listener is roughly equivalent to a head rotation of the listener, which is typically trivial to compute in the Ambisonics domain using rotation matrices ([56], Section 5.2.2). However, this does not generalise to reverberant environments, where the room

#### **Figure 1.**

*First 130 ms of an RIR, expressed in decibels relative to the peak value. The RIR was simulated with the image source method [31] for an omnidirectional point source placed 10 m away from the receiver in a room with an approximate volume of 2342.7 m<sup>3</sup> . The mixing time, estimated according to Lindau et al. [48], is indicated.*

### *Reverberation and its Binaural Reproduction: The Trade-off between Computational… DOI: http://dx.doi.org/10.5772/intechopen.101940*

provides a frame of reference, and the angular movement of a source is not equivalent to rotating the listener's head.

A recent study has suggested that BRIRs should be measured by varying the listener position in increments of 5 cm or less in a three-dimensional grid (which can be a costly process) to achieve a dynamic convolution-based rendering in which the swapping is seamless to the listener [57]. Alternatively, one may start from a coarser spatial grid and interpolate BRIRs at intermediate positions. Unfortunately, BRIR interpolation is not trivial because the time and direction of arrival of each reflection may vary depending on the receiver's position, changing the BRIR's temporal structure across the grid. Nevertheless, recent studies have shown promising progress by employing dual-band approaches and heuristics to match early reflections in the time domain [58, 59]. On a related note, another active research topic is the extrapolation of RIRs in the Ambisonics domain for 6DoF applications (e.g., [60–63]), which is further discussed in Section 4.

Although BRIRs are mainly obtained through binaural measurements made on a person's or a mannequin's head [55], they may also be generated from RIRs that were either measured with microphone arrays [64–68] or simulated [28, 35]. This approach typically involves identifying individual reflections and their direction of arrival, e.g., with the help of the spatial decomposition method (SDM) [65], and then convolving each reflection with a head-related impulse response (HRIR) for the corresponding direction [69]—which is equivalent to a multiplication with a head-related transfer function (HRTF) in the frequency domain. However, rendering the full length of the BRIR this way can easily become expensive, which is why simplified models such as the aforementioned 'hybrid' one become important: we can just render a few early reflections accurately while modelling late reverberation as a stochastic, non-directional process, and still produce binaural signals that are not perceptually different from properly rendered ones. This has been recently shown by Brinkmann et al. [47], who suggested that accurately rendering just six early reflections plus stochastic late reverberation may be enough to produce auralisations that are perceptually indistinguishable from a fully-rendered reference, for a simulation of a shoebox-type room.

It should be noted that modelling late reverberation as isotropic is computationally inexpensive but may lead to noticeable degradation when simulating asymmetrical rooms (e.g., a long and narrow corridor) where late reverberation is highly directional [12]. For such cases, Alary et al. have proposed to employ directional feedback delay networks (DFDN) [39], which extend the functionality of traditional FDNs to spatial audio and allow to inexpensively produce non-uniform reverberation, so that the RT is direction-dependent. A downside of DFDNs is their inability to correctly reproduce early reflections, which should be modelled separately for best results.

Another simplification consists in quantising the direction of arrival of reflections by 'snapping' them to the closest neighbour in a predefined grid. This method is explored by Amengual Garí et al. [69], who found that an RIR may be quantised to just 14 directions in a Lebedev grid [70] and still be used to render binaural signals through SDM without perceptual degradation when compared to the original. The scattering delay network method (SDN) is based on a similar premise, quantising the RIR to as many directions as first-order reflections, e.g., six for a cuboid room, while obtaining good results in perceptual evaluations [32]. The rationale of SDN is that early reflections are computed accurately, while later ones are approximated with higher error as time advances, which is a sensible approach from a perceptual point of view. However, it might lead to an inaccurate late reverberation tail, which is why combining SDN with an inexpensive method for late reverberation simulation (e.g., DFDN) might be a promising alternative.

On the other hand, rather than generating separate BRIRs for each rendered sound source, one may also 'encode' the sum of all of them into a single sound field, and then reproduce it binaurally, e.g., by means of a set of virtual loudspeakers. That way, only the virtual loudspeaker signals must be binaurally rendered, independently of the number of sources that form the sound field. This is a convenient simplification when many sources are rendered at once. As mentioned earlier, typical loudspeakerbased sound field reproduction methods include vector-based amplitude panning [49] and high-order Ambisonics [50, 56, 71]. The latter is by far the more popular method for binaural rendering, given its efficient simulation of head rotations ([56], Section 5.2.2) and manipulation of spatial resolution [72]. However, the Ambisonics processing may have perceivable effects on the binaural signals, which are still being investigated. Recent research on this topic is reviewed in Section 3.
