**1. Introduction**

In 1990s, first telephony systems were introduced in vehicles to enable drivers to converse in hands-free phone calls through the vehicle's embedded microphones and loudspeakers while driving [1]. To assure the audio quality during the handsfree telecommunication, a number of speech signal processing techniques are widely used. Besides of hands-free telephony, speech dialog systems have been developed to enable drivers to communicate with their vehicle functions and media contents by means of voice communication [2, 3]. In the core of a speech dialog system, there is an acoustic model that performs the speech recognition task. Speech dialog systems require high quality audio input to assure the accuracy of the speech recognition.

In a vehicle audio system, the phone is connected to the infotainment head unit via a Bluetooth communication channel which allows the driver's speech (near end) to route from the microphones mounted inside the vehicle to the other side of the tele-communication network (far end). Vice versa, the speech signal received from the far end is played on the vehicle's loudspeakers.

A major problem that typically rises in this communication system is that the far end hears a replica of their own voice back from the vehicle with a certain delay (i.e. acoustic echo). The observed acoustic echo is due to the acoustic feedback from loudspeakers to the microphones in the vehicle [4–6]. Various acoustic echo cancelation (AEC) solutions have been developed to address this issue. Most of these AEC solutions use adaptive filters that aim to simulate the acoustic path between speakers and microphones and thereby estimate and subtract the echo from the received signal [4–6].

Another major problem is that the signals captured by the microphones are contaminated with ambient noise and reverberation. The ambient noise often consists of stationary noise sources (engine noise, road noise, windows vibrations), and non-stationary cross-talks from other car occupants. To address the stationary ambient noise issues, high-pass filters were used mainly to filter out the engine noise and structural vibrational components in the captured signal [1, 2]. To address non-stationary ambient noise, adaptive algorithms have been extensively developed (e.g. [7]).

Moreover, directional microphones have been previously used in vehicles to form a spatial focus toward the driver while attenuating signals arriving from other directions. These directional microphones were often Cardioid type which were usually mounted on the ceiling of the cabin, over the head of the driver, and directed toward the driver's mouth. Most common Cardioid microphones are electret condenser components and achieve this type of directivity by means of mechanical channels mounted in their membranes. In 2000s, a new generation of microphones, known as micro electromechanical system (MEMS), were introduced to the electronic industry providing superior performance in coding the sound while having a low-cost footprint [8]. Since the mobile phone industry started to extensively deploy MEMS in their products, this type of microphones has prevailed in most telecommunication applications e.g. in tablets, wearable devices, medical systems, and automobiles [8].

A major difference between MEMS and electret microphones is that MEMS microphones, due to their specific design and miniature structure, are omnidirectional i.e. are agnostic to directions since they treat sounds arriving from all directions equally. Therefore, the desirable directivity needs to be implemented by means of external post processing. To do so, a number of MEMS microphones are placed in a certain distance from each other forming an 'array' and specific array signal processing techniques are applied to exploit the time differences and relative phase shifts across the signals captured by the microphones in the array to amplify or attenuate sounds arriving from specific directions and thereby create the desired spatial directivity [9–11].

### **2. Spatial filtering: beamforming**

#### **2.1 Basic concepts**

A beamformer is a signal processing module that performs spatial filtering to separate signals that have overlapping temporal and frequency contents but originate from different spatial locations [9–12]. A conventional linear beamformer, as shown

*Spatial Audio Signal Processing for Speech Telecommunication inside Vehicles DOI: http://dx.doi.org/10.5772/intechopen.105002*

**Figure 1.**

*A linear beamformer that consists of an array of M microphones. The signal captured by jth microphone passes through the jth finite-impulse-response filter defined by its weights (W j). The direction of arrival (DOA) is shown by θ.*

in **Figure 1**, is a filter-and-sum system that consists of a number of filters that are applied on the input array signals and the results are thereafter summed. The task is to set the complex filter coefficients (*W <sup>j</sup>* in **Figure 1**) in a manner to amplify specific directions in the received array signals and suppress other directions.

From another perspective, a beamformer can be viewed as a multiple-input single-output system whose output *y[k]* is determined based on Eq. (1) below.

$$\mathcal{Y}[n] = \sum\_{j=1}^{M} \sum\_{p=1}^{L} W\_{j,p} \, \mathbf{x}\_j[n-p] \tag{1}$$

Eq. (1) can also be viewed as a summation of *M* finite impulse response (FIR) filters with *L* coefficients per each filter that are applied on input signals (*x <sup>j</sup> [n]*). Eq. (1) can be summarized into Eq. (2) where *T* denotes Hermitian (complex conjugate) transpose and *W* represents an *M* � *L* matrix of filter coefficients.

$$\mathbf{y}[n] = \mathbf{W}^T \ast \mathbf{x}[n] \tag{2}$$

#### **2.2 Conventional beamforming**

A conventional beamformer assumes that each described filter needs to apply a specific tap delay (*τ*) on the corresponding array signal to properly align the inputs to achieve the desired directivity in the output. In this sense, each FIR filter has the following frequency response. The first filter (*p* = 1) has no associated tap delay since the signal from the first microphone is considered the zero-phase reference.

$$r(\boldsymbol{\alpha}) = \sum\_{p=1}^{L} |\mathcal{W}\_p| \, e^{-j\boldsymbol{\alpha}\boldsymbol{\tau}(p-1)} \tag{3}$$

Assuming that the propagating sound pressure is a complex plane wave with the direction of arrival (DOA) θ and frequency ω, the tap delay at *pth* filter (*τp*) is a function of *θ*, and Eq. (3) can be re-written as below.

$$r(\alpha) = W\_p \, d(\alpha, \theta) \tag{4}$$

$$D(\boldsymbol{\alpha}, \theta) = \left[\mathbf{1}; \mathbf{e}^{j\alpha \tau\_2(\ \theta)}; \dots \mathbf{e}^{j\alpha \tau\_M(\ \theta)}\right]^T \tag{5}$$

The term *D(ω,θ)* is known as the array vector response. *D(ω,θ)* determines the spatial outcome of the beamformer and thus is also called steering vector or direction vector. The simplest solution is to apply a constant delay per array, a so called 'delayand-sum' algorithm. Accordingly, each array signal is delayed by *<sup>τ</sup><sup>p</sup>* <sup>¼</sup> ð Þ *<sup>p</sup>* � <sup>1</sup> *<sup>d</sup> <sup>c</sup>* sin ð Þ*θ* where *c* is the speed of sound, 343 m/s at 20C temperature, and *p* extends from 1 to *M*.

The distance between the array elements, *d*, is an essential geometric constraint that has a great effect on the performance of such a delay-and-sum configuration. An important limitation that is imposed on the performance of the beamforming due to the distance between microphones (*d*) is the 'spatial aliasing frequency' ( *f al*Þ that is calculated by *<sup>f</sup> al* <sup>¼</sup> *<sup>c</sup>* <sup>2</sup>*<sup>d</sup>* which gives the upper frequency limit of the delayand-sum system. This is because, at this frequency ( *f al*Þ, the phase shift at the microphones equals half the wavelength (*λ*Þ of the signal (see Figure 3 and Figure 4 of [12]). Therefore, to avoid spatial aliasing, the distance between the microphones (*d*) should be chosen carefully for the delay-and-sum beamformer to push *f al* above the frequency range of interest.

In more sophisticated beamformers, the tap delay values (*τp*) in Eqs. (4)–(5) are set as a function of angular frequency (*ω*) in a *filter-and-sum* configuration. The aim is to control the behavior of the system at different frequency ranges and assure a consistent directivity across the entire frequency range of interest. A well-designed filter-and-sum beamformer with tailored frequency-dependent tap delays, *τp*ð Þ *ω s*, can overcome the upper frequency barrier ( *f al*Þ to some good degree.

If the angles at which the interfering signals arrive is known, it is possible to design the steering vector so that the beamformer minimizes sound intensities (represented by statistical variance in the data) arriving from these specific angles. In this configuration, called linearly constrained minimum variance (LCMV) beamforming, the steering vector is designed to multiply null in given interference directions while amplifying the desired DOA.

**Figure 1** presents a one-dimensional beamformer which operates in *xy* plane as DOA (*θ*Þ is in that plane. However, if necessary, it is possible to add microphones on *z* axis where similar equations, Eqs. (1)–(5), can be written in the *xz* plane with a DOA in that plane. Accordingly, a two-dimensional beamformer would be created which filters the *xyz* space with regard to one DOA in *xy* plane and another DOA in *xz* plane.

From another perspective, **Figure 1** depicts a 'broadside' beamformer which is designed to form a beam toward the target which is located in the broadside plane of the microphone array. However, if the target is located along with the axis of the array (therefore θ = �90), then the configuration is called 'end-fire' [9]. In an endfire configuration, the summation in **Figure 1** is replaced by subtraction. Consecutively, each filter output is subtracted in Eq. (1) instead. Thus, an end-fire beamformer is also called a 'filter-and-subtract' or a 'differential' beamformer. This type of beamformer, which can be viewed as a special case of the general beamforming shown in **Figure 1**, forms a beam toward either above the array axis (θ = 90) or below the array axis (θ = �90).

#### *2.2.1 Fixed beamforming vs. adaptive beamforming*

In fixed beamforming, the DOA is known and time-invariant thus the steering vector, *D(ω,θ*), can be set for a known fixed geometry. A good example of fixed beamforming is in automotive industry where the target talker (driver) sits in a

### *Spatial Audio Signal Processing for Speech Telecommunication inside Vehicles DOI: http://dx.doi.org/10.5772/intechopen.105002*

fixed location and the DOA toward the microphones is predetermined. Fixed beamforming can be viewed as a 'data-independent' algorithm since the steering vector is designed solely based on the known geometry of the sound propagation and is independent of the received data. In contrast, in adaptive beamforming, DOA varies and the steering vector should adapt to the changes in DOA. For an example, an adaptive beamformer is needed in case the system is supposed to localize and adapt itself to capture signals from all car occupants (besides of the driver) who are sitting at different location inside the vehicle. In this case, the system should iteratively find the target talker first and then update its steering vector toward that target. Another example of an adaptive beamformer is in the 'cocktail party' problem wherein the target location can vary in the room and the system should constantly localize the target and the beamforming algorithm should adapt to the new DOA and other geometrical factors, accordingly. From this perspective, adaptive beamformers can be viewed as being 'data-dependent'systems since their parameters change according to variations in the received data. As a result, adaptive beamformers usually require substantial computational resources [10, 13, 14].

An adaptive beamformer is often accompanied by a pre-processing stage whose task is to localize the target and determine the new DOA. This 'localization'stage usually accomplishes its task by examining the data and finding optimum DOA that maximizes a specific metric such as signal strength, or speech intelligibility [10, 13– 15]. Alternatively, some localization algorithms are built on minimizing a specific cost function, such as noise and reverberation, in the signal. When the localization algorithms finds the DOA, the values in the steering vector (*D(ω,θ)*) should adapt to this new angle.

There are some relatively newer solutions that merge the 'localization' and 'beamforming'stages together. Warsitz and Haeb-Umbach [14] presented an algorithm that optimizes the FIR filter coefficients (denoted by *W* in Eqs. (1)–(2) above) by iteratively estimating and maximizing the cross power spectral density of the microphone signals. An important feature of this algorithm is that the filter coefficients are optimized directly without localizing the source. In other words, the DOA information is implicitly absorbed in the optimization problem although it is possible to extract the underlying DOA information from the results afterwards, if needed.

#### **2.3 Neural-based adaptive beamforming in speech recognition applications**

Speech signal enhancement (SSE) techniques, such as beamforming, have traditionally been performed as an independent pre-processing stage to speech recognition back ends [13, 15]. In this conventional setup, SSE algorithms are performed to improve the signal-to-noise ratio (SNR) by reducing ambient noise and reverberation in the captured signal. The output of the SSE stage is then fed into acoustic models, usually deep neural networks, which perform the automatic speech recognition (ASR) task.

In the last few years, adaptive beamforming algorithms have been designed that are tuned jointly together with the speech recognition backend [13, 15, 16]. To do so, the FIR coefficients (shown by *W* in **Figure 1** and also in Eqs. (1)–(2)) are jointly trained together with the parameters of the ASR model where the optimization is performed using a gradient learning algorithm. The goal of this optimization process is to find FIR coefficients that result in higher ASR accuracy.

Several neural-network approaches have been developed to address the ASR problem [15] but the most successful ASR models are currently built on the convolutional deep neural network (CL-DNN) concept [13, 15]. The input is filtered by a time-domain filterbank pre-processor, usually a Gammatone filterbank together with a nonlinearity function, which is supposed to loosely mimic the

human auditory periphery (cochlea) in terms of spectral feature extraction and compression [17]. The output is then fed into the CL-DNN model. The first stage in the CL-DNN model is the *fconv* layer that convolves the output signals across the filterbank channels and the results are pooled along the frequency axis. The next stage comprises a number of long short-term memory (LSTM) layers. LSTM network is a specific type of recurrent neural network that is tailored for recognizing sequential time-series data such as audio. The final stage is a single fully-connected DNN that consists of at least 1024 hidden units [13, 15, 16].

Sainath et al. [13, 16] presented a multi-microphone solution to incorporate the data captured by *M* microphone arrays into the CL-DNN model. They replaced each spectral channel of the Gammatone filterbank pre-processor with FIR filters that are connected to the microphones and are used for beamforming (identical to Eqs. (1) and (2) above). They essentially created a filter-and-sum beamformer per spectral channel. The difference is that the tap delays (*Tp*) and therefore DOA data are implicitly absorbed in the FIR coefficients similar to earlier works by Warsiz and Haeb-Umbach [10]. Sainath et al. [13, 16] trained the beamforming FIR coefficients together with the CL-DNN parameters using a gradient learning algorithm to maximize the ASR accuracy. Sainath et al. [16] showed that during the training, the FIR coefficients become optimized to extract both spectral and spatial features of the incoming speech signals. They showed that the multi-microphone ASR model with joint beamforming achieves an over 10% improvement in word error rate (WER) compared to its single-microphone counterpart.

Besides of excellent ASR accuracy, a major benefit of neural-network based beamforming is that the model is, to a great extent, independent of the array spacing whereas the conventional beamforming relies on the prior knowledge of the distance between microphones (*d*) to calculate the tap delays. Due to its remarkable success in ASR, neural-based beamforming is prevailing in all ASR systems that have access to multiple microphone input. A very good candidate for applying this technique is in automotive ASR systems wherein online voice assistants based on this technique are currently being designed and evaluated.

A potential shortcoming of the neural-based beamforming is that the source localization information (i.e. DOA) is implicitly embedded in the model and might not be extractable and interpretable in terms of physical geometry. This could impose a limitation in applications which require an explicit knowledge of the source location. Besides, an important distinction is that neural-based beamforming parameters are tuned solely based on ASR objectives and might not necessarily improve the audio quality (e.g. SNR) with regard to the human psychoacoustics [13, 15, 16]. Therefore, neural-network based beamforming is currently considered more applicable to speech recognition tasks rather than to applications such as telephony wherein human listeners are involved. The feasibility of neural-network based beamforming for telephony applications and its relation to human psychoacoustics need further investigations.

#### **2.4 Beamforming applications in automotive industry**

Beamforming techniques introduced into the automotive industry almost at the same time that the first automotive hands-free telephony and speech dialog systems were being devised [1]. Although there have been some studies using multiple microphones [18], it is by far more common to only have dual microphones available in vehicles for beamforming. There are mainly two reasons for this, the first one is the production costs, and the second one complications in the vehicle's interior design and excess wiring if multiple microphones are used. Therefore, in

*Spatial Audio Signal Processing for Speech Telecommunication inside Vehicles DOI: http://dx.doi.org/10.5772/intechopen.105002*

the following sections regarding automotive applications, two-microphone solutions are in focus.

**Figure 2**(A) shows a car with two dedicated microphones (marked M1 and M2) about 4.5 centimeters apart (*d = 4.8* cm) that have been mounted in the car ceiling. The DOA is ideally around 90 degrees according to the illustrated coordinates. To provide a fixed beamforming solution for this particular geometry, an 'end-fire' differential beamformer should be used since the desired source is located along the axis of the array (θ = �90). The input signals are filtered according to Eqs. (1)-(5) and then subtracted. The frequency-dependent tap delays for microphone M2 (i.e., *τ*2ð Þ *ω* ) were chosen to enable the steering vector to enhance sounds that arrive from the driver side (θ = 90).

**Figure 2** shows the beam patterns resulted from an end-fire filter-and-sum beamformer where the tap delays for M2 microphone (i.e. *τ*2ð ÞÞ *ω* have been adjusted as a function of frequency at several frequency channels covering the frequency range from 0.1 to about 7 kHz. **Figure 2(B)** shows the beam pattern at 1 kHz. This beam pattern demonstrates that sounds from θ = 90 (driver side) have passed through the system whereas sounds from θ = �90 = 270 have been substantially attenuated. A very similar beam pattern is shown by **Figure 2(C)** at 2 kHz

#### **Figure 2.**

*A) a car cabin geometry with a dual microphone mounted in the car ceiling. B) the beampattern achieved by the described end-fire (differential) beamformer at 1 kHz, C) at 2 kHz, and D) at 4 kHz. (A) dual microphones in a personal car, (B) beampattern at 1KHz, (C) beampattern at 2KHz and (D) beampattern at 4KHz.*

although the beam pattern at 4 kHz, shown in **Figure 2(D)**, deviates somewhat with minimal effect on overall performance.

The presented beamformer was tested in-situ by placing a head-and-mouth simulator system at the typical location of the driver's head and playing standard hearing-in-noise test (HINT) sentences [19] while the engine was running in idle mode creating some stationary background noise. The raw signals captured by the M1 microphone were recorded. The test was repeated while applying the described beamforming on the raw signals. The beamforming results were compared to the raw signal. The results showed a signal-to-noise ratio improvement (SNRI) of 5.7 dB (A) across frequencies between 0.1 and 8 kHz.

**Figure 3** shows the beamforming geometry in a large truck cabin wherein a dual microphone array is installed on the overhead compartment. The distance between the array and the driver's mouth is about 0.4 m and the DOA is approximately 30 degrees (*θ* = 30 in *zy* plane) although these numbers vary depending on the height and other biometrics of the driver. The distance between the two microphones (*d*) is 23 mm which yields a higher spatial aliasing frequency upper limit compared to the system shown in **Figure 2**. In this case, an end-fire beamformer can be designed to form a beam downwards toward the cabin's floor (θ = 90). The drawback is that an amount of engine noise and AC fan noise will also leak into the beamformer since these noise signals originate from the dashboard which is also located below the overhead compartment.

Alternatively, a broadside beamformer can be used to direct the beam toward the DOA of θ = 30. As a well-known drawback, the broadside configuration also amplifies the angle that is 180 degrees behind the DOA (i.e. 30 + 180 = 210

#### **Figure 3.**

*A truck cabin geometry with a dual microphone installed in an overhead compartment forming DOA of approximately 30 degrees (θ = 30) toward the mouth of a 180-cm long male driver. The yellow arrow shows the first-wave propagation from the driver's mouth whereas the green arrow shows the noise signal propagation (engine noise and AC fan noise).*

#### *Spatial Audio Signal Processing for Speech Telecommunication inside Vehicles DOI: http://dx.doi.org/10.5772/intechopen.105002*

degrees in this case). This is because the broadside beamforming, characterized by Eqs. (1–4), is agnostic to the 180-degrees axis and any sound coming from θ + 180 is treated equally as θ. However, since the overhead console behaves as a mechanical damper for sounds and vibrations coming from the roof and the backward direction, broadside solution appears to be a better solution in this practical case.

A broadside filter-and-sum has been devised with tailored frequency-dependent tap delays to facilitate a consistent beamforming toward the driver at frequencies between 0.1 and 8 kHz. The in-situ measurements and beam patterns are not finalized yet. However, the preliminary analysis indicates that the system can achieve an SNRI of about 6 dB when the engine is running in idle mode. The described beamformer functions on *xz* plane. However, a third microphone can be added on *y* axis next to the existing pair of microphones to perform beamforming on *xy* plane as well. This new beamformer on *xy* plane can be tuned to attenuate sounds arriving from the co-driver's side although adding multiple microphones are currently uncommon in vehicles.
