**3. Acoustic echo cancelation**

#### **3.1 Basic concepts**

Acoustic echoes are generated in speech telecommunication networks due to the acoustic feedback from loudspeakers to microphones. This phenomenon deteriorates the perception of sound by causing the users to hear a delayed replica of their own voice being reflected back from the other side of the network [4–6]. **Figure 4** shows a driver in a truck cabin making a phone conversation through embedded microphones and loudspeakers (marked red). The speech signal is denoted by *s[n]* whereas the echo is denoted by *y[n]* and *r[n]* represents the ambient noise. The echo (*y[n]*) can be considered a copy of the far-end speech signal played by the loudspeaker (*x[n]*) that has been filtered by the acoustic path (modeled by an FIR filter with the linear impulse response *h[n]*) between the loudspeaker and the microphone. The received signal (*d[n]*) is an addition of these three signals (*d[n] = s [n] + y[n] + r[n]*).

To remove acoustic echoes from the captured signal, acoustic echo cancelation (AEC) algorithms have been developed that use machine learning methods to adaptively estimate the 'acoustic echo path' in real time and subtract its effect from the captured signal so that only the desired near-end speech components remain [4–6, 20, 21]. Similar adaptive methods are also commonly used for estimating the noise propagation acoustic path in stationary noise reduction applications (e.g. [7]). The most common adaptive method used in AEC tasks is normalized least mean square (NLMS) filtering [5, 6, 20, 21]. Least-mean-square adaptive filters were used even in the earliest generation of AECs [4] and several improved variants of it, such as NLMS, have been developed since then [5, 6, 20, 21].

The true impulse response of the echo path (i.e. *h[n]*) is unknown and the task of an AEC solution is to identify it. To do so, the NLMS algorithm constantly tries to adapt to an impulse response (^ *h n*½ �) that closely matches the true impulse response of the echo path (i.e. ^ *h n*½ �¼ *h n*½ �) and consequently, *y n* ^½ �=*y[n]* and thus the error signal, *e[n]*, becomes zero. The length of ^ *h n*½ �, denoted by *L*, has an important role in the performance of the AEC. The filter should be long enough to realistically model the acoustic path and furthermore to guarantee that the acoustic path can be assumed time-invariant during the time that corresponds to *L* samples. If the goal of

#### *Advances in Fundamental and Applied Research on Spatial Audio*

**Figure 4.**

*A) an overview of the described AEC algorithm. B) a driver in a truck serving as the near-end party during a hands-free phone conversation.*

the AEC is to reduce the echo by 30 dB, then *L* should correspond to T30 reverberation time [5]. In most modern vehicles, 50 ms appears to be a good estimate of T30.

In case the echo (*x[n]*) is effectively the only signal present (*r[n]* and *s[n]* are absent), the output of the adaptive process (*y n* ^½ �) is given by Eq. (6) below where *L* tap of *x[n]* is transposed (represented by *xL <sup>T</sup>*½ � *<sup>n</sup>* ) and multiplied by ^ *h n*½ �*:*

$$d[n] = \hat{\boldsymbol{y}}\,[n] = \boldsymbol{\varkappa}\_{L}{}^{T}[n].\hat{\boldsymbol{h}}[n] \tag{6}$$

#### *Spatial Audio Signal Processing for Speech Telecommunication inside Vehicles DOI: http://dx.doi.org/10.5772/intechopen.105002*

The adaptive process estimates a new ^ *h n*½ � per each sample of y[n] through a small adjustment Δ ^ *h n*½ � in each iteration, as expressed in Eq. (7). This adjustment value is determined based on the error signal and the reference signal according to Eq. (8) where *μ[n]* is known as 'step size'. Choosing an optimized step size is important for the convergence rate and accuracy of the system and is determined by the parameters *α* and *β*. These two parameters need to be adjusted according to the specifics of every given NLMS problem. Choosing appropriate values for *α* and *β* has been comprehensively studied to a great complication [5, 20–22].

$$
\hat{h}[n+1] = \hat{h}[n] + \Delta\hat{h}[n] \tag{7}
$$

$$
\Delta \hat{h}[n] = \mu[n] \times (\varkappa\_L[n] \times e[n])\tag{8}
$$

$$
\mu[n] = \frac{a}{\beta + \delta\_{\varkappa\_{L}[n]}^2}, \delta\_{\varkappa\_{L}[n]}^2 = \nu a r(\varkappa\_{L}[n]) = \varkappa\_{L}{}^T[n] \,\varkappa\_{L}[n]. \tag{9}
$$

Eqs. (6)–(9) are applicable only when the echo is the only present component in the received signal (i.e. *d[n] = y[n]*). In other words, the near-end talker is silent and the ambient noise is insignificant (*s[n] = 0* and *r[n] 0*). However, in a natural fullduplex speech communication both parties (far end and near end) might talk simultaneously sometimes (i.e. a 'double-talk' event may occur). If there is any remarkable double talk in *d[n]* (i.e. a non-zero *s[n]*), then the adaptive process, formulated by Eqs. (7)–(9), might diverge and fail since the *s[n]* components cannot be modeled by *h[n]*. Therefore, every adaptive AEC solution needs to constantly watch out for double-talk events to halt the adaptation as long as double talk is present [23, 24].

#### **3.2 Spatial acoustic echo cancelation**

All conventional NLMS-based adaptive methods, explained in the previous section, rely on modeling the acoustic path by an FIR system and aim to find the coefficients of the corresponding filter (i.e. *h[n]* in **Figure 4**). A major drawback of NLMS-based adaptive approach is that the adaptive process, presented by Eqs. (7)–(9), needs to run constantly which imposes a remarkable computational cost. This is because the acoustic path, characterized by *h[n]*, constantly changes due to slight movements of the objects in the environment and other reasons such as temperature variations and the adaptive process needs to estimate the new impulse response.

Recently, alternative methods based on probabilistic clustering techniques have been successfully used for blind source separation (BSS) of the echo components from the near-end speech signal [25]. The BSS method uses the spatial information from the captured microphone signals to cluster and separate the desired speech signal (*s[n]*) from the echo (*y[n]*). Any BSS method, similar to beamforming, requires multiple microphones to be able to extract the location cues.

Every BSS method assumes that the signal captured by the microphones (*d*1, *d*2, … , *dM*), where *M* denotes the number of microphones, are from *N* independent source signals (*s*1, *s*2, … , *sN*) that have been mixed together. The mixture model is then modeled as described by Eq. (10) below where *hjk* is the impulse response of length *L* that describes the acoustic path from the *kth* source (*Sk*) to the *j th* microphone (*d <sup>j</sup>*).

$$d\_j[n] = \sum\_{k=1}^{N} \sum\_{p=1}^{L} h\_{jk}(p) \* s\_k[n-p] \tag{10}$$

*Advances in Fundamental and Applied Research on Spatial Audio*

$$d\_1[n] = \sum\_{p=1}^{L} h\_{1\varepsilon}(p) \* s[n-p] + \sum\_{p=1}^{L} h\_{1\varepsilon}(p) \* \varkappa[n-p] \tag{11}$$

$$d\_2[n] = \sum\_{p=1}^{L} h\_{2\epsilon}(p) \* s[n-p] + \sum\_{p=1}^{L} h\_{2\epsilon}(p) \* \varkappa[n-p] \tag{12}$$

$$d[n] = [h\_{1s}, h\_{1x}; h\_{2s}, h\_{2x}] \* [s[n]; \mathfrak{x}[n]] = \mathcal{W} \* [s[n]; \mathfrak{x}[n]] \tag{13}$$

The BSS techniques that are extensively used to address the 'cocktail party' problem aim to find the mixing impulse responses (*hjk*) and use this information to de-mix and find the original speech signals [25, 26]. In case there are more microphones than sources (*M* ≥ *N*), the BSS reduces to a 'determined' problem and linear filters can successfully be deployed to effectively separate the mixtures. Otherwise, if there are fewer number of microphones than sources (*M < N*), then the problem is 'underdetermined' and linear filters would not work adequately.

In the case depicted by **Figure 3**, there are two microphones and also two independent sources (*M* = *N* = 2), namely: 1) near-end speech (*s[n]*), and 2) the echo *(x[n]*) that leaks from the loudspeaker to the microphones. Eqs. (11)–(12) formulize the mixture model for the signal received by microphone 1 and microphone 2, respectively. Here, *h*1*<sup>x</sup>* and *h*2*<sup>x</sup>* are the impulse responses of the acoustic path from the loudspeaker to the first and the second microphone, respectively. Eqs. (11)–(12) can be summarized into a matrix form and re-written by Eq. (13) where the relation between the sources and microphones signals is denoted by a Wiener filter. Eq. (13) can be used to inverted so that ½ �¼ *s n*½ �; *x n*½ � *<sup>W</sup><sup>T</sup>* <sup>∗</sup> *d n*½ �.

A conventional approach to solving Eqs. (10)–(13) is using independent component analysis (ICA) [26]. Accordingly, a cost function is defined to estimate the statistical (convolutional) independence of *s[n]* and *x[n]*. The coefficients of *W* (which comprises *h*1*<sup>x</sup>*, *h*1*<sup>s</sup>*, *h*2*<sup>x</sup>*, and *h*2*<sup>s</sup>*) are adaptively updated so that the statistical independence of *s[n]* and *x[n]* is increased. The statistical independence is often increased by either maximizing the non-Gaussianity or by minimizing the mutual information between the two signals.

#### **3.3 The performance of acoustic echo cancelers in vehicles**

The performance of an AEC is primarily measured by two metrics: 1) echo return loss enhancement (ERLE), and 2) convergence time. ERLE is a commonly used indicator for quantifying the achievement of an AEC solution to attenuate echoes [5, 20, 21, 23]. ERLE is calculated according to the Eq. (14) below where *σ*<sup>2</sup> *d n*½ � and *σ*<sup>2</sup> *e n*½ � represent the variance of the captured audio by the microphone (*d[n]*) and the variance of the error signal (*e[n]*) which is the output of the AEC and is ideally echo-free. Since all signals are zero-mean, the variance of a signal is a measure of the magnitude of its intensity. Therefore, Eq. (14) yields the ERLE as the magnitude of the AEC output relative to the microphone input signal.

$$ERLE = \mathbf{10} \times \log\_{10}^{\frac{\sigma\_{d[u]}^2}{\sigma\_{\epsilon[u]}^2}} \tag{14}$$

International telecommunication union (ITU) G.168 standard for AECs [27] declares a number of requirements that should be followed in all speech telecommunication applications. Accordingly, the AEC should yield at least 6 dB of ERLE at

#### *Spatial Audio Signal Processing for Speech Telecommunication inside Vehicles DOI: http://dx.doi.org/10.5772/intechopen.105002*

the second frame (since each frame is 50 ms in a typical automotive solution, this means at 0.1 second). The ERLE should then increase to minimum 20 dB at 1 second. Thereafter, the ERLE should reach its steady state at 10 second and should stay over that steady state value, afterwards.

The convergence time is the time it takes for the AEC to reach to its steady-state ERLE. ITU G.168 requires that the convergence time should be no longer than one second. In the tuning of the adaptive parameters, such as step size, there is a tradeoff between ERLE and convergence time since higher ERLE might result in slower convergence time [21, 22].

We implemented an adaptive NLMS-based AEC described by Eqs. (6)–(9) on a large Volvo truck model. The length of the Wiener filter (*L*) was chosen 800 which corresponds to 50 ms at the sampling rate of 16 kHz which would be consistent with T30 in large vehicles. The term *α* in Eq. (9), which could take a value between 0 and 2, determines the speed of convergence. Higher *α* values result in quicker adaptation of the NLMS algorithm, however, there is a tradeoff between convergence and overall success of the echo canceller in terms of ERLE ([20]). Here, we chose *α = 1.98* to assure the fast convergence of the algorithm. The term *β*, known as the regularization parameter, is meant to improve the performance of the NLMS in noise and it has to be adjusted with regard to the characteristics of the ambient noise (*r[n]* in **Figure 1**) and the signal-to-noise ratio (SNR) of the microphone hardware [20]. Here, we chose *β = 0.1* which corresponds to the SNR of the electret condenser microphones that are commonly used in automotive industry.

Furthermore, a statistical double-talk detection (DTD) decision circuit based on the normalized cross-correlation (NCC) between *x[n]* and *d[n]*. NCC is also called 'Pearson correlation coefficient' in statistics [28]. In case the far-end is the only talker, there will be a non-zero cross-correlation between *x[n]* and *d[n]*. However, when the near end talks too (i.e. DT occurs), the cross-correlation between *x[n]* and *d[n]* diminishes and approaches zero since *d[n]* would convey *s[n]* components as well. Accordingly, DT is detected if NCC drops below a certain threshold. Eq. (15) presents the NCC between *x[n]* and *d[n]* where *σxL*½ � *<sup>n</sup>* and *σdL*½ � *<sup>n</sup>* are the standard deviation (square root of variance) of *L* samples of *x[n]* and *d[n]*, respectively, and *cov*(*xL*½ � *n* , *dL*½ �Þ *n* is the covariance between them.

$$\text{NCC}\left(\mathbf{x}\_{L}[n], d\_{L}[n]\right) = \frac{\text{cov}(\mathbf{x}\_{L}[n], d\_{L}[n])}{\sigma\_{\mathbf{x}\_{L}[n]} \times \sigma\_{d\_{L}[n]}} \tag{15}$$

NCC can yield a number in the range [�1, +1], where +1 indicates perfect correlation and � 1 perfect anti-correlation between the two inputs while 0 shows a non-existing correlation. Here, we set the threshold of our DTD decision to 10�<sup>4</sup> using the method discussed in [28] by normalizing the false alarm probability (pf) to about 0.1.

To evaluate the presented AEC solution, the far-end party reads 10 HINT sentences while the driver (near-end party) is silent and the vehicle's engine is off. The system registers the incoming signal to the speaker (*x[n]*) while the microphone records *y[n]*. In this case *y[n] = d[n]* since the driver is silent (*s[n] = 0*) and there is no engine noise (*r[n] = 0*). The presented solution is applied on these signals and, as depicted in **Figure 5** below, the presented AEC solution manages to attenuate the echo received by the microphone significantly by a total of 25.54 dB according to Eqs. (10)–(13). **Figure 5(B)** shows ERLEs per each sentence and how the ERLE becomes stronger as the algorithm continues adapting. The results demonstrate compliance with ITU G.168 standards [27].

**Figure 5.** *A) Captured microphone data (*d[n]*) versus the output of the AEC (*e[n]*) while the far end is reading ten HINT sentences. The sentences are marked by numerical indicators. B) the echo attenuation achieved by the presented AEC solution in terms of ERLE per HINT sentence.*

#### **3.4 Post-processing acoustic echo suppression**

The minimum acceptable ERLE required by ITU G.168 (i.e. 20 dB) may not practically suffice since the echo might still be noticeable and irritating especially if the loudspeaker volume is set at a high level. As an example if the loudspeaker is set to generate sounds that are about 70 dB SPL loud, an ERLE of 20 dB would imply that there is an echo of 50 dB SPL (i.e. 70–20 = 50) being transmitted back to the far-end party which can be quite noticeable. Therefore, it is good practice in automotive industry to achieve much higher echo reduction i.e. typically over 40 dB.

The conventional NLMS-based adaptive AEC modules typically achieve maximum 30 dB ERLE, as shown in **Figure 5**. Therefore, to further improve the echo reduction, the remaining echo components (i.e. 'residual echo') are suppressed by means of acoustic echo suppression (AES) post processing. The simplest AES methods which have historically been used are based on attenuating the captured microphone signal (*d[n]* in **Figure 4**) whenever the farend is talking (i.e. whenever the magnitude of *x[n]* is over a reasonable threshold) [5]. A major shortcoming of this method is that, in case of double talk wherein both near end and far end are simultaneously talking, the near-end speech signal is also attenuated. Another issue is that such an approach is nonlinear. Speech recognition models require that the audio signal chain must be free of any nonlinearity [29]. Since adaptive AEC algorithms use linear filters to cancel echo, they could legitimately be used as a preprocessing stage to ASR systems. However, the use of nonlinear AES must be avoided in ASR applications. As a result, linear solutions, such as BSS techniques explained previously by Eqs. (10)–(13), have been deployed to perform the task of AES on the residual echo especially in speech recognition applications. A properly designed combination of conventional adaptive AEC and a post-processing AES must comfortably achieve echo reductions over 40 dB.

### **4. Discussions, conclusions, and prospects**

Hands-free telephony has been extensively offered in premium cars since early 2000s, and since then, audio signal processing modules have been deployed to enhance the speech signal quality by means of addressing issues such as acoustic

#### *Spatial Audio Signal Processing for Speech Telecommunication inside Vehicles DOI: http://dx.doi.org/10.5772/intechopen.105002*

echo, ambient noise, and reverberation [1]. Besides of hands-free telephony, speech dialog systems have been developed to enable drivers to communicate with vehicle functions by means of voice communication [2, 3]. In the core of such a speech dialog system, there is a neural-network based acoustic model that performs the speech recognition task. Speech recognition systems also demand high quality audio input which makes speech signal enhancement techniques necessary. Especially, online voice assistants rely on specific 'wake words' (also called 'hot words') to communicate with users. These are 'Ok Google!' for Google assistant, 'Alexa!' for Amazon Alexa, and 'Hey Siri!' for Siri. The ASR system should constantly listen for these wake words meanwhile music or speech signals might be simultaneously playing on the speakers. In order to detect the wake words while playing sounds, the system needs to benefit from a capable echo cancelation module to estimate and cancel the feedback from speaker(s) to the microphone(s) as well as a noise reduction module (such as beamforming) to minimize the reverberation and ambient noise in the captured signals.

In this chapter, the fundamentals of the filter-and-sum beamforming were described and two practical designs of dual-microphone fixed beamforming (endfire versus broadside) were presented inside a personal car and a truck, respectively. The fundamentals of beamforming were described for a general case although the applications were exclusively focused on dual microphones because that is the most common setup in vehicles. The directivity index, which is the gain of the beamforming on the desired DOA relative to all other directions, is a good measure of a beamformer's performance. A conventional multi-microphone fixed beamformer can achieve a directivity index of about 25 dB at best [30]. In real world, the directivity index turns out to be lower. In case of dual-microphone solution, the directivity index is minimal i.e. in the range of 10 to 12 dB. Multiple microphones can provide a sharper beam and potentially higher SNRIs.

Despite its modest directivity index, a well-designed beamforming system improves the quality of the sound substantially. One important benefit of beamforming, besides of the SNRI, is the reduction in the perceived reverberation. Reverberation is related to the sum of all sound reflections from the walls and surroundings of a given acoustic room and has been shown to have adverse effects on the speech intelligibility especially in case of hearing-impaired listeners [31]. Beamforming minimizes reverberation in the captured signal by means of geometrically dampening the sound reflections received from undesired directions and thereby facilitates speech intelligibility. Moreover, beamforming modules are in many cases followed by non-stationary noise reduction modules that adaptively suppress the noise (e.g. [7]). Together with the beamformer, an adaptive noise suppressor can achieve very good results in managing non-stationary noise.

Neural-based beamforming was also described in this chapter. This type of beamforming, wherein the steering filter coefficients are optimized jointly together with a neural-network speech model, has emerged in many speech recognition applications and shown remarkable success [13, 15, 16]. However, since the beamforming coefficients are optimized implicitly as a part of a speech recognition task, the success of this method in improving sound quality for a human listener is not entirely known and further studies are needed to evaluate this method for telephony and hearing-aid applications wherein human listeners are involved.

A large part of this chapter was dedicated to acoustic echo cancelation. The fundamentals of a conventional adaptive method based on NLMS was described. In this method, the acoustic path between the loudspeaker and microphone is modeled by an FIR filter and the adaptive process seeks to find the coefficients of this filter and subtract the echo from the captured signal. Adaptive NLMS-based acoustic echo cancelers are relatively easy to implement and are extensively in use. If

designed appropriately, this method can comfortably achieve ERLEs about 30 dB [5, 21, 22, 30]. Although this level is higher than the required level by the ITU guidelines [27], a higher ERLE becomes necessary in most automotive telephony applications. Therefore, acoustic echo suppression algorithms have been developed as post-processing modules to further reduce the residual echo.

The simplest and most common acoustic echo suppressors are implemented by means of applying a gain on the microphone signal and reducing this gain whenever the far-end party is talking. However, due to its nonlinear behavior, this approach cannot be used in speech recognition applications which require linearity of all audio components [29]. Instead, linear approaches such as BSS based on ICA appear to be suitable. The BSS method uses spatial cues to find mixing coefficients of a linear model and uses this information to de-mix the signals and segregate the source signals (in this case: echo versus near-end speech).

Although beamforming and echo cancelation are well-known problems that have been extensively studied since early 1960s [4, 5, 9, 30], it needs great efforts to tailor them to address new challenges. Therefore, new statistical optimization approaches and neural-network based solutions are being deployed to strengthen the conventional methods, whenever feasible. Automotive industry is expanding quickly and manufacturers are competing in providing vehicles that allow vehicle occupants to have independent conference calls simultaneously. Another competition frontier is speech recognition. Automotive manufacturers aim to provide user interfaces that are driven by voice. These interfaces allow the drivers to simply talk to their cars and do their daily errands (such online shopping, scheduling meetings, listening to audio books) while driving by solely voice commands. Prototypes of such online automotive voice assistants have just been introduced as Google [32] and Amazon entered the game [33] and have received a great attention from the media, and the public. These systems open up new scientific and technical challenges in human-machine interfacing, cloud-based and embedded speech recognition, and last but not least, spatial audio signal processing.

### **Acknowledgements**

Parts of this project has been funded by the innovation office at the Department of vehicle connectivity (VeCon) at Volvo group in Gothenburg, Sweden and some of the results were published in an M.Sc. thesis by Balaji Ramkumar in collaboration with Linköping University.
