**3. Machine hearing research**

#### **3.1. Motivation to compare human and machine hearing systems**

There are several factors that have confounded cochlear implant research. Psychoacoustic experiments are often rendered inconclusive due to large individual variabilities in cochlear implant subjects. The physical limitations of uncontrollable test populations include variable nerve survival before implantation, inter-implant intervals and usage time, neuroplasticity after implantation, age at testing, and surgical insertion depths. These physical limitations have significant effects on the amount of masking [16, 17]. In addition, individual variability can arise due to cognitive factors or subjective testing protocols. For instance, the evaluation of cochlear implants can be unintentionally influenced by decision rules or dynamic ranges based on loudness judgment, visual feedback, sequential test order, unrealistic simulations, and even the content material used in subjective studies such as speech recognition [24, 25]. Therefore, alternative methods such as computational simulations and mathematical models should be used in order to account for these uncontrollable factors of individual variation.

We can only appreciate how sophisticated the human auditory system truly is when trying to simulate perceptual processing on a computer. By building a computational model, we gain new insight and develop quantitative ways to analyze each step of signal processing. Computational models are well suited to investigate how information from independent fibers are distributed and the extent to which distinct bandpass filterbanks are constructed within neural architectures [4]. In this chapter, we use artificial neural networks in order to measure response properties of auditory fibers using realistic representations of different integrative processes. Machine learning can provide algorithms for understanding learning in neural systems and can even benefit from these ongoing biological studies [26].

#### **3.2. Deep neural networks (DNNs)**

patterns of auditory nerve survival, which could cause the current to stimulate auditory fibers that are too apical or too basal from the intended location. "Dead regions" in auditory nerves may also prevent the adjacent masking electrodes from stimulating distinct place-frequencies. Second, different surgical procedures could result in variability between the insertion depths of a user's electrode array. This variability could be significant since neural activation patterns depend on the density of neurons in a particular region of the cochlea and on the radial dis-

In general, cochlear implant subjects have exhibited electric‐masking patterns that are much broader compared to what has been observed in normal hearing [14–20]. A reduction in the magnitude of contralateral versus ipsilateral masking functions was observed in [18], but this reduction was not as great as observed in normal hearing. Together, these findings of broader

**1.** Distorted *profile analysis*, where cochlear implant users are unable to adequately use across‐channel cues to compare the shape of the output of different auditory filters. **2.** Reliance on *temporal critical bands*, where cochlear implant users relied on filter bandwidths that were consistently broader than predicted by critical band theory.

**3.** Rapidly changing *transition bandwidths*, where cochlear implant users used separate and different auditory processes to handle either the narrow or wide bandwidths.

**4.** Distorted *volley theory* of encoding, where cochlear implant users were unable to combine

These findings all decrease confidence in previous estimates of peripheral filtering as well as the assumptions made in *critical band theory*, especially when combined with recent evidence in normal‐hearing listeners that suggest a flexible selection of spectral regions upon which to base across-frequency comparisons [21]. Furthermore, the wide bandwidths observed in the initial filters of the cochlear implant subjects directly contradicts the theory that extraction of envelope information should be constrained to a single auditory filter, as theorized in [22]. For these reasons, *transition bandwidths* [7, 23] are the most plausible solution as they explain patterns that were observed in both normal hearing and electric hearing experiments (as this concept allows an interplay to occur between temporal or spectral processes). In the section on machine hearing research, a novel method based on "*deep learning*" is utilized to prove the computational efficiency of transition bandwidths in artificial neural network systems.

There are several factors that have confounded cochlear implant research. Psychoacoustic experiments are often rendered inconclusive due to large individual variabilities in cochlear implant subjects. The physical limitations of uncontrollable test populations include variable nerve survival before implantation, inter-implant intervals and usage time, neuroplasticity

the phases of action potentials to analyze a greater frequency of sound.

**3. Machine hearing research**

**3.1. Motivation to compare human and machine hearing systems**

masking with cochlear implants both support and are supported by the concepts of:

tance between the electrode array and neural targets in the modiolus.

140 Advances in Clinical Audiology

#### *3.2.1. Automatic speech recognition: an auditory perspective*

There have been many attempts to incorporate principles of human hearing into machine systems [27]. The motivation for these previous attempts was simply that human perception is much more stable than machines over a range of sources of variability. Therefore, it was reasonable to expect that the functional modeling of the human subsystems could provide plausible direction for machine research. One of the first auditory‐inspired features (**Figure 2**) was based on the mel-scale warping of the spectral frequency axis (referred to as "FBANK"), which is then parameterized as mel‐frequency cepstral coefficients (referred to as "MFCC") [28]. The usual objective for selecting an appropriate representation is to compress the input data by eliminating the information that is not pertinent for analysis and to enhance those aspects of the signal that contributes significantly to the detection of differences. In *automatic speech recognition* (ASR), these MFCC features were shown to allow better suppression of insignificant spectral variation in the higher‐frequency bands. Concatenating other types of auditory-inspired spectro-temporal features with MFCCs can also boost performance [29]. In [30], cochlear implant speech synthesized from subband temporal envelope was shown to contain sufficient information to rival MFCC features in terms of accuracy. These acoustic simulations of cochlear implants [31] were subsequently proposed as general indicators to conduct useful subjective studies. In [32], the cross- disciplinary methods of cognitive science and machine learning were converged to promote the shared views of computational [33] foundations. Our study [32] expanded on [30] by comparing cochlear implant results using the Bayesian model of human concept learning [34] and proposed hidden Markov models (HMMs) for computationally predicting cochlear implant performance. In the next sections, we will further expand upon previous studies by introducing state-of-the-art tools based on *deep neural network* algorithms and presenting new results comparing the efficiency of profile analysis, temporal critical band, and transition bandwidths in cochlear implant simulations.

#### *3.2.2. Spectral filterbank (FBANK) features as input to deep learning systems*

Deep neural networks (DNNs) (**Figure 8**) make use of gradient-based optimization algorithms to adjust parameters throughout a multilayered network based on the errors at its input [35]. In DNNs, multiple processing layers learn representations of data with multiple levels of abstractions. In deep hierarchal structures, the internal layers of DNNs provide learned representations of the input data. The benefit of studying filterbank learning in DNNs is that the filterbank input can be viewed as an extra layer of the network, where these filterbank parameters are updated along with the parameters in subsequent layers [36, 37].

**Figure 8.** Structure of a deep neural network (DNN). Deep learning allows multiple layers of nonlinear processing.

In machine learning, speech is viewed as a two‐dimensional signal where the spatial and temporal dimensions have vastly different characteristics. For instance, the time‐dynamic information in the high‐frequency regions is different compared to low‐frequency regions. Although FBANK is popular, Sainath et al. [36, 37] argued that features designed based on the critical band theory might not guarantee appropriate frameworks for the end goal of reducing error rates. Since the power-spectra removes information from the signal by computing from a fixed window‐length, FBANK features often lack the necessary temporal information. By starting with a raw signal representation to learn filterbanks jointly in a DNN framework, the results computed in [36] share many similarities as concepts from psychoacoustic studies:


**3.** Consistent with the *volley theory*, the computational results showed the learned filters are high‐pass filters compared to mel‐filters (which are bandpass at high‐frequency regions).

#### *3.2.3. Temporal envelope bank (TBANK) features*

studies by introducing state-of-the-art tools based on *deep neural network* algorithms and presenting new results comparing the efficiency of profile analysis, temporal critical band,

Deep neural networks (DNNs) (**Figure 8**) make use of gradient-based optimization algorithms to adjust parameters throughout a multilayered network based on the errors at its input [35]. In DNNs, multiple processing layers learn representations of data with multiple levels of abstractions. In deep hierarchal structures, the internal layers of DNNs provide learned representations of the input data. The benefit of studying filterbank learning in DNNs is that the filterbank input can be viewed as an extra layer of the network, where these filterbank param-

In machine learning, speech is viewed as a two‐dimensional signal where the spatial and temporal dimensions have vastly different characteristics. For instance, the time‐dynamic information in the high‐frequency regions is different compared to low‐frequency regions. Although FBANK is popular, Sainath et al. [36, 37] argued that features designed based on the critical band theory might not guarantee appropriate frameworks for the end goal of reducing error rates. Since the power-spectra removes information from the signal by computing from a fixed window‐length, FBANK features often lack the necessary temporal information. By starting with a raw signal representation to learn filterbanks jointly in a DNN framework, the results computed in [36] share many similarities as concepts from psychoacoustic studies: **1.** Consistent with *critical band theory*, the computational results showed a similarity between

**Figure 8.** Structure of a deep neural network (DNN). Deep learning allows multiple layers of nonlinear processing.

**2.** Consistent with *transition bandwidths*, the computational results showed the learned filters had multiple peaks in the mid-frequency regions (indicating that multiple important

critical frequencies are being picked up, rather than just one like the mel).

learned and mel‐filters in the low‐frequency regions.

and transition bandwidths in cochlear implant simulations.

142 Advances in Clinical Audiology

*3.2.2. Spectral filterbank (FBANK) features as input to deep learning systems*

eters are updated along with the parameters in subsequent layers [36, 37].

Our work in [13, 32] derived an alternative input feature (**Figure 4**) for ASR based on temporal envelope bank (referred to as "TBANK") which was inspired by *temporal critical bands* [6] and the broad temporal masking patterns of cochlear implants [16, 17]. The TBANK features have been evaluated as an input feature for DNNs [38] and as a temporal alignment feature for DNNs [39]. In the present study, we will combine both FBANK and TBANK features to improve the temporal dimension and its correlation with the frequency or spatial-domain properties in DNNs. **Figure 9** shows FBANK+TBANK (referred to as ⧠⧠, "double‐BANK") features, which were inspired by psychoacoustic results showing the flexible usage of across‐ channel cues [21], transition bandwidths [7, 23], and the volley theory [8].

**Figure 9.** A simplified FBANK+TBANK (referred to as ⧠⧠, "*double-BANK*") representation of speech.

In Section 3.3, we will use the same procedures in [38, 39] for the Aurora‐4 robustness task (with a cochlear implant speech processor available at: www.tigerspeech.com/angelsim).

#### **3.3. Computational results**

#### *3.3.1. Comparison of FBANK and TBANK on a computational ASR task*

"Raw" TBANK features were derived from 32 channels of band envelope (**Figure 4**) via white-noise carriers [38]. These features were designed to preserve temporal and amplitude cues in each spectral band, but remove the spectral detail within each band as explained in [12]. ∆ and ∆∆ dynamic features were computed from derivative values with respect to time [40]. In **Table 1**, context‐dependent DNN‐HMMs were trained using 40‐dimensional FBANK (described in [36]), 120‐dimensional FBANK+∆+∆∆ (described in [41]), and our 80-dimensional ⧠⧠ (FBANK+TBANK) input representation. It should be noted that the computational cost of changing the size of the input layer is negligible. **Table 1** shows inclusion of TBANK in the ⧠⧠ (FBANK+TBANK) input features yielded a 14% improvement compared to FBANK and an 11% improvement over the FBANK+∆+∆∆ representation.

**Table 1.** DNN performance (error rate %) on clean training set.

#### *3.3.2. Comparison of FBANK and TBANK on temporal alignment task*

**Table 2** shows error rates for *Gaussian mixture model* (GMM)-HMMs when trained and tested on TBANK (**Figure 4**) alignment features (**Figure 10**) with white-noise carrier [13, 39]. TBANK models models aligned the training data to create senone labels for training the DNN. The results in **Table 2** show the temporally aligned DNN gives fewer errors when subsequently trained and tested on FBANK features.


**Table 2.** Comparison of different tree‐building features to generate a state‐level alignment on the training set. TBANK features had 16, 24, or 32 band envelopes via white-noise carrier.

**Figure 10.** Generation of temporal alignment features using extracted TBANK, as in [39].

(described in [36]), 120‐dimensional FBANK+∆+∆∆ (described in [41]), and our 80-dimensional ⧠⧠ (FBANK+TBANK) input representation. It should be noted that the computational cost of changing the size of the input layer is negligible. **Table 1** shows inclusion of TBANK in the ⧠⧠ (FBANK+TBANK) input features yielded a 14% improvement compared to FBANK

and an 11% improvement over the FBANK+∆+∆∆ representation.

*3.3.2. Comparison of FBANK and TBANK on temporal alignment task*

**Table 1.** DNN performance (error rate %) on clean training set.

**Tree‐building features Error % (GMM) Error % (DNN)**

MFCC 5.08 2.88 16 band envelopes 5.44 **2.80** 24 band envelopes **5.03 2.63** 32 band envelopes 5.90 **2.82**

features had 16, 24, or 32 band envelopes via white-noise carrier.

trained and tested on FBANK features.

Note: **Bold** indicates better score.

Note: **Bold** indicates better score.

144 Advances in Clinical Audiology

**Table 2** shows error rates for *Gaussian mixture model* (GMM)-HMMs when trained and tested on TBANK (**Figure 4**) alignment features (**Figure 10**) with white-noise carrier [13, 39]. TBANK models models aligned the training data to create senone labels for training the DNN. The results in **Table 2** show the temporally aligned DNN gives fewer errors when subsequently

**Table 2.** Comparison of different tree‐building features to generate a state‐level alignment on the training set. TBANK

Designing a representation that both preserves relevant detail in the speech signal and also provides stability/invariance to distortions is a nontrivial task. Therefore, **Figure 11** derives slowly varying amplitude modulation (AM) and frequency modulation (FM) from speech to design novel features (referred to as frequency amplitude modulation encoding "FAME") with different modulations, as proposed in [42] and computed in ASR [31, 39]. In the FAME condition, the FM is smoothed in terms of both rate and depth and then modulated by the AM. The "slow" FM tracks gradual changes around a fixed frequency in the subband. The FAME stimuli are obtained by additionally frequency modulating each of the band's center frequency before amplitude modulation and subband summation. Finally, FAME stimuli were used to derive alternative features via extracted TBANK (**Figure 10**) for tree building and temporal alignment in GMM systems.

**Figure 11.** Signal processing diagram of *frequency amplitude modulation encoding* (FAME).

Compared to band envelope features (**Figure 4**), **Table 3** shows AM and FM lowered the error rate in GMM systems used during forced alignment to generate frame‐level DNN training labels. These results expand upon [31] and provide additional evidence that FAME preserves more of the relevant detail compared to other carriers.


**Table 3.** Comparison of alternative TBANK features for GMM alignment systems.

**Table 4** shows error rates for GMM systems trained and tested on an additive configuration of ⧠⧠ (MFCC + FAME) (**Figure 12**). This configuration was inspired by a frequency‐dependent model that explains the loudness function in human auditory systems [43]. In this two‐stage model, the first stage of processing is performed by a mechanical mechanism in the cochlear (for high-frequency stimuli) and by a neural mechanism in the cochlear nucleus (for lowfrequency stimuli). **Table 4** shows the DNN gives fewer errors during time alignment with this additive configuration of ⧠⧠ (MFCC + FAME) when subsequently trained and tested on FBANK. By digitally adding the different high‐frequency FAME information (via TBANK) to the low-frequency MFCC information, **Table 4** shows this additive ⧠⧠ (MFCC + FAME) feature representation allowed a better alignment in GMM systems during the generation of senone training labels for training the DNN.

**Figure 12.** A digitally additive configuration of ⧠⧠ alignment features (MFCC + FAME).

**Table 5** provides further analysis [39] of deletion, substitution, or insertion errors to quantify the effects of the digitally additive ⧠⧠ (MFCC + FAME) configuration. Misclassification leads to substitution errors. An imperfect segmentation leads to deletion errors (when some sounds are completely missed) or insertion errors (from extra boundaries). By reducing the extra segment boundaries, **Table 5** shows how front‐end FAME processing using the FM extraction at high-frequency regions solves the segregating and binding problems [42] in ASR systems. The ⧠⧠ results also demonstrate the computational efficiency of multiple filterbanks, which supports temporal critical bands [6] and transition bandwidths [7, 23].


**Table 4.** MFCC vs. ⧠⧠ alignment feature (three additive FAME bands at high‐frequency regions).


**Table 5.** Error type (deletion, substitution, insertion) analysis.

#### *3.3.3. Discussion relating human hearing with machine hearing research*

Many computational algorithms [9, 27, 28, 29, 30, 31, 35, 40, 41, 42] have been inspired by auditory processing pathways in the human nervous system. Traditionally, the critical band theory is commonly accepted for baseline DNN features [35] due to the ability of MFCC and FBANK to allow better suppression of insignificant spectral variation in the higher‐frequency bands. However, recent progress in deep learning systems has allowed computational models that are composed of multiple layers of parallel processing to learn representations of filterbank features with multiple levels of abstraction. In fact, some DNN researchers have questioned the efficiency of spectral features derived from critical band theory. In [36], error rates were shown to improve by using a filterbank learning approach rather than just having a fixed set of filters. Therefore, [36] was the first computational study to contradict the energy detector or power spectrum models of critical band theory using purely quantitative and statistical results.

In the present study, we compared masking in cochlear implants and ⧠⧠ (FBANK+TBANK) input representations in deep learning. The data provides statistical evidence supporting the efficiency of profile analysis, temporal critical bands, and transition bandwidths. Therefore, results in both human hearing and machine hearing oppose the historically accepted critical band theory. Furthermore, all of these findings decrease confidence in previous estimates of peripheral filtering as presumed [3, 19, 22] and adopted in [27, 28]. Moreover, the similarity and compatibility of the results in both human and machine hearing could provide new insight into the ability to process sound and may lead to advances in cochlear implant methods [44] or alternative neural network architectures [45]. For example, [36] indicated that using a nonlinear perceptually motivated log function was appropriate in deep learning, since their results showed that using log nonlinearity with positive weights was preferable.

#### **4. Conclusions**

**Table 4** shows error rates for GMM systems trained and tested on an additive configuration of ⧠⧠ (MFCC + FAME) (**Figure 12**). This configuration was inspired by a frequency‐dependent model that explains the loudness function in human auditory systems [43]. In this two‐stage model, the first stage of processing is performed by a mechanical mechanism in the cochlear (for high-frequency stimuli) and by a neural mechanism in the cochlear nucleus (for lowfrequency stimuli). **Table 4** shows the DNN gives fewer errors during time alignment with this additive configuration of ⧠⧠ (MFCC + FAME) when subsequently trained and tested on FBANK. By digitally adding the different high‐frequency FAME information (via TBANK) to the low-frequency MFCC information, **Table 4** shows this additive ⧠⧠ (MFCC + FAME) feature representation allowed a better alignment in GMM systems during the generation of

**Table 5** provides further analysis [39] of deletion, substitution, or insertion errors to quantify the effects of the digitally additive ⧠⧠ (MFCC + FAME) configuration. Misclassification leads to substitution errors. An imperfect segmentation leads to deletion errors (when some sounds are completely missed) or insertion errors (from extra boundaries). By reducing the extra segment boundaries, **Table 5** shows how front‐end FAME processing using the FM extraction at high-frequency regions solves the segregating and binding problems [42] in ASR systems. The ⧠⧠ results also demonstrate the computational efficiency of multiple filterbanks, which supports temporal critical bands [6] and transition bandwidths [7, 23].

**Tree‐building features (GMM) Deletion, substitution, insertion (DNN) Deletion, substitution, insertion**

MFCC 19, 189, 64 13, 114, 27 MFCC + FAME 20, **177**, **61** 18, **101**, 28

**Tree‐building features WER% (GMM) WER% (DNN)**

**Table 4.** MFCC vs. ⧠⧠ alignment feature (three additive FAME bands at high‐frequency regions).

MFCC 5.08 2.88 MFCC + FAME **4.82 2.75**

Note: **Bold** indicates less errors.

Note: **Bold** indicates better score.

**Table 5.** Error type (deletion, substitution, insertion) analysis.

**Figure 12.** A digitally additive configuration of ⧠⧠ alignment features (MFCC + FAME).

senone training labels for training the DNN.

146 Advances in Clinical Audiology

In this chapter, we presented psychoacoustic results that support a recent theory in auditory processing [7, 23]: that the auditory system is actually composed of multiple filterbanks in the processing of sound (instead of just a solitary peripheral filterbank as previously assumed). Psychoacoustic results using electric stimulation in cochlear implant users suggest distorted *profile analysis* (where users are unable to adequately use across-channel cues to compare the shape of the output of different auditory filters), a reliance on *temporal critical bands* (where users relied on filter bandwidths that were consistently broader than predicted by critical band theory), rapidly changing *transition bandwidths* (where users employed separate and different auditory processes to handle either narrow or wide bandwidths), and a distorted *volley theory* of encoding (where users were unable to combine phases of action potentials to analyze a greater frequency of sound). In addition, the results from our deep learning system confirmed the computational effectiveness of combining both spectral filterbanks (FBANK) and temporal filterbanks (TBANK). The combined input representations (each with its own filtering properties) are formed into ⧠⧠ (double‐BANK) features to improve the processing of information in multiple parallel processes. These ⧠⧠ features all outperformed FBANK features in deep neural network (DNN) systems.
