**2. Human hearing research**

#### **2.1. Auditory masking**

#### *2.1.1. Spectral masking*

The masking phenomenon of one tone by another provides quantitative data on frequency selectivity and the dynamical theory of the cochlea [1]. In a psychoacoustic experiment, the testing stimulus is called the probe, the sound that interferes with the detection of the probe is called the masker, and the amount of masking refers to the amount by which the hearing threshold of the probe is raised in the presence of the masker. The method of measuring a threshold shift is straightforward. First, the detection threshold of the probe is determined. Next, the threshold shift of the probe is determined in the presence of the masker. Auditory masking curves have established wide‐ranging mathematical relationships between sensory behavioral responses and even the activity of single neurons [2]. In general, the probe is most easily masked by sounds with frequency components that are close to the probe.

#### *2.1.2. Critical band theory*

their excitation along the length of the basilar membrane (**Figure 1**). The connecting nerve fibers show a bandpass response to the input signal, where the density of firings for a particular fiber varies with the stimulus intensity over a certain range. Basic information from a sound is then extracted and passed to subsequent stages for perceptual machinery to present its own construction of reality. Noninvasive methods are needed to investigate the influence of these higher-level perceptual processes on the properties of the cochlea. For example, the method of

This chapter will be divided into two sections. In the section on human hearing research, we will predict neural firing characteristics from the perspective of psychoacoustic "*masking*" experiments. In the section on machine hearing research, we will compare artificial neural network input from the perspective of machine learning experiments. Experimental data is presented in both human hearing and machine hearing to supplement the incompleteness of current neurophysical methods by providing new insight into the stages of processing.

**Figure 1.** Spatial arrangement of cochlear hair cells along the basilar membrane (base to apex).

The masking phenomenon of one tone by another provides quantitative data on frequency selectivity and the dynamical theory of the cochlea [1]. In a psychoacoustic experiment, the testing stimulus is called the probe, the sound that interferes with the detection of the probe is called the masker, and the amount of masking refers to the amount by which the hearing threshold of the probe is raised in the presence of the masker. The method of measuring a threshold shift is straightforward. First, the detection threshold of the probe is determined. Next, the threshold shift of the probe is determined in the presence of the masker. Auditory masking curves have established wide‐ranging mathematical relationships between sensory behavioral responses and even the activity of single neurons [2]. In general, the probe is most easily masked by sounds with frequency components that are

**2. Human hearing research**

**2.1. Auditory masking** *2.1.1. Spectral masking*

132 Advances in Clinical Audiology

close to the probe.

psychophysical inference is often used to fill in the gaps of physiological knowledge.

The "*critical band theory*" [3] has played an important role in hypothesizing how the auditory system resolves the components of complex sound. In classic band‐widening experiments, the threshold of a sinusoidal probe is measured as a function of the bandwidth of a noise masker. First, a noise with a constant power density is centered at the probe frequency. As the bandwidth increases, the total noise power increases (which presumably has effects on the threshold for detecting the probe).

Masking curves have shown that the threshold of the probe increases at first, but flattens off as the addition of more noise (at a greater distance from the probe frequency) produces no additional masking. The bandwidth at which the probe threshold ceases to increase is called the "*critical bandwidth*." To account for these observations, the listener is assumed to make use of a filter with a center frequency close to the probe when detecting the probe in noise. According to this critical bandwidth theory, the noises outside the range of the filter should presumably have no effect on detection. If this filter passes the signal and removes much of the noise, then only the components of the noise that passes through the filter should have any effect in masking the probe. Therefore, thresholds should correspond to a certain signal-to-noise ratio at the output of the auditory filter.

According to this *power spectrum model* of masking, all stimuli are represented by their longterm power spectra (or the relative phases of the components) while short‐term fluctuations in the masker are ignored. **Figure 2** shows the typical estimates of energy detection. Although *energy detection models* remain fundamental to theories of auditory perception, the axiom that energy only passes by a single auditory filter has been contradicted multiple times [4–7]. These findings violate the fundamental assumption of critical band theory and therefore challenge previous estimates of peripheral filtering.

**Figure 2.** Energy detector model where the basilar membrane behaves as if it contains a bank of bandpass filters with overlapping passbands, where each point along the basilar membrane corresponds to a filter with a different center frequency. This Fourier‐transform‐based log filterbank with spectral coefficients distributed on a mel‐scale is often referred to as "FBANK" in audio‐coding applications involving machine learning. In the section on machine hearing research, the computational efficiency of FBANK will be evaluated in deep learning systems.

#### *2.1.3. Profile analysis*

According to the critical bandwidth theory, a tone added to noise should be detected by an increase in the energy from a single auditory filter centered at the signal frequency. On the contrary, experimental manipulations (e.g., roving-level procedures) that degrade energy cues in tone‐in‐noise detection tasks show no effects on detection thresholds. Listeners must therefore rely on alternative cues instead of just spectral analysis of a stimulus to explain the data of level‐invariant detection (where single‐channel energy cues are severely disrupted). In "*profile analysis*" [5], this process is described as detecting changes in the overall shape of the spectrum. With across‐channel cues, listeners are able to compare the shape or profile of the outputs of different auditory filters to enhance signal detection.

#### *2.1.4. Temporal critical bands*

Temporal discrimination tasks offer an alternative to spectral critical bands. In a temporal model, the detection cues are thought to be temporal in nature and based on changes in the cadence of neural discharge. These temporal models contrast with energy detection models that assume a rate‐place neural code. Recently, auditory filter bandwidths were measured for a temporal process using an amplitude‐modulation (AM) detection task [6]. The critical bandwidth for a temporal process (referred to as "*temporal critical band*") was observed to be consistently greater than that predicted by the critical band theory. Therefore, these findings decrease confidence in previous estimates of peripheral filtering.

#### *2.1.5. Transition bandwidths*

Discontinuous threshold functions also contradict spectral critical bandwidths by implying that the discrimination tasks evoke different and separate auditory processes. For instance, "*transition bandwidths*" [7] assume that envelope cues dominate at narrow bandwidths, while across-channel level comparisons dominate at wide bandwidths. This concept stresses that there are changes in the underlying process, unlike the constraining boundaries of a solitary process (as hypothesized in critical bandwidth or energy integration theories). For transition bandwidths, the changes to another dominant auditory process are thought to be due to a central mechanism (whereas critical bandwidths are only associated with the periphery). Therefore, transition bandwidths allow for multiple filtering processes to occur.

#### *2.1.6. The volley theory*

The divergence of positions between spectral bandwidths and temporal bandwidths shares similar controversies as the *place theory* and the *temporal theory* of pitch perception. The place theory states that the perception of sound depends on where each component of frequency produces vibrations. The temporal theory states that the perception of sound depends on the temporal patterns of neurons responding to sound in the cochlea. The "*volley theory*" [8] postulates that groups of neurons in the auditory system respond to firing action potentials that are slightly out-of-phase with one another so that they can be combined to encode and send a greater frequency of sound to the brain for analysis, as shown in **Figure 3**. In the next sections, we describe the importance of resolving these theories and assumptions to improve real-world solutions for data *compression* and speech processing. For instance, we will compare the efficiency of systems that use only one filterbank or multiple filterbanks.

**Figure 3.** Temporal properties of nerve firing according to the "volley theory" of temporal coding.

#### **2.2. Filterbanks**

contrary, experimental manipulations (e.g., roving-level procedures) that degrade energy cues in tone‐in‐noise detection tasks show no effects on detection thresholds. Listeners must therefore rely on alternative cues instead of just spectral analysis of a stimulus to explain the data of level‐invariant detection (where single‐channel energy cues are severely disrupted). In "*profile analysis*" [5], this process is described as detecting changes in the overall shape of the spectrum. With across‐channel cues, listeners are able to compare the shape or profile of the

Temporal discrimination tasks offer an alternative to spectral critical bands. In a temporal model, the detection cues are thought to be temporal in nature and based on changes in the cadence of neural discharge. These temporal models contrast with energy detection models that assume a rate‐place neural code. Recently, auditory filter bandwidths were measured for a temporal process using an amplitude‐modulation (AM) detection task [6]. The critical bandwidth for a temporal process (referred to as "*temporal critical band*") was observed to be consistently greater than that predicted by the critical band theory. Therefore, these findings

Discontinuous threshold functions also contradict spectral critical bandwidths by implying that the discrimination tasks evoke different and separate auditory processes. For instance, "*transition bandwidths*" [7] assume that envelope cues dominate at narrow bandwidths, while across-channel level comparisons dominate at wide bandwidths. This concept stresses that there are changes in the underlying process, unlike the constraining boundaries of a solitary process (as hypothesized in critical bandwidth or energy integration theories). For transition bandwidths, the changes to another dominant auditory process are thought to be due to a central mechanism (whereas critical bandwidths are only associated with the periphery). Therefore, transition bandwidths allow for multiple filtering processes to occur.

The divergence of positions between spectral bandwidths and temporal bandwidths shares similar controversies as the *place theory* and the *temporal theory* of pitch perception. The place theory states that the perception of sound depends on where each component of frequency produces vibrations. The temporal theory states that the perception of sound depends on the temporal patterns of neurons responding to sound in the cochlea. The "*volley theory*" [8] postulates that groups of neurons in the auditory system respond to firing action potentials that are slightly out-of-phase with one another so that they can be combined to encode and send a greater frequency of sound to the brain for analysis, as shown in **Figure 3**. In the next sections, we describe the importance of resolving these theories and assumptions to improve real-world solutions for data *compression* and speech processing. For instance, we will compare the efficiency of systems that use only one filterbank or multiple filterbanks.

outputs of different auditory filters to enhance signal detection.

decrease confidence in previous estimates of peripheral filtering.

*2.1.4. Temporal critical bands*

134 Advances in Clinical Audiology

*2.1.5. Transition bandwidths*

*2.1.6. The volley theory*

#### *2.2.1. Audio coding and data compression*

Audio coding formats that use lossy data compression take advantage of human auditory masking properties [9]. For instance, the MP3 format hides noises under the signal spectrum based on the masking property that sounds near the threshold of another sound will either be completely masked or reduced in loudness. These auditory masking properties also play critical roles in both speech coding applications and objective quality measures. In the next section, we will cover the impact of auditory masking on the coding of *cochlear implants* (devices that require data compression due to the electroneural bottleneck).

#### *2.2.2. Cochlear implants*

The cochlear implant is a surgically implanted electronic device that restores partial hearing to a person who is deaf or hearing impaired [10–12]. This neural prosthesis provides similar functions of the inner ear by electrically stimulating the auditory nerve. Cochlear implants consist of an external microphone, a speech processor, a transmitter, an internal receiver, and a multielectrode array stimulator. The microphone is placed on the ear and picks up incoming sounds from the environment. The speech processor filters these incoming sounds into different frequency channels and sets the appropriate electrical stimulation parameters. Next, a transmitter coil powers and transmits the processed sound signals through the skin to the internal receiver. Finally, the receiver converts the signals into electric impulses that stimulate an electrode array. Electrode arrays are surgically coiled within the scala tympani of the cochlea so that individual electrode plates can electrically stimulate different regions of the auditory nerve. A sparse electric representation is sufficient for the restoration of hearing.

Cochlear implant performance is satisfactory in quiet settings, but the abnormal perception of electric pitch limits the performance in noise. Typical users only detect over a 10% change in pitch compared to normal-hearing listeners who easily detect <1% change. Electric pitch is degraded because only a limited number of electrodes (∼22 electrodes) can be inserted into the cochlea versus the >3000 inner hair cell transducers in a normal cochlea. In addition, the current spread from each electrode is uncontrollably broad and large areas of nerves can be unintentionally activated. Spectral mismatches can also occur from degraded nerve survival or inaccurate frequency-to-electrode allocation [11].

Speech recognition in noise becomes especially difficult without the ability to adequately separate components of sound from interfering sources. **Figure 4** shows the cochlear im plant coding scheme [12] that was discovered to greatly improve speech recognition. This sparse representation at the auditory periphery is unique as it presents electric pulse stimulations that feature both (1) temporal envelope information and (2) place information. Section 3.2.3 will discuss how this coding is simulated as temporal envelope bank (TBANK) features in deep learning systems [13].

**Figure 4.** Temporal properties of a cochlear implant processor. In machine learning, temporal features can be derived from extracted temporal envelope bank (referred to as "TBANK").

#### **2.3. Auditory masking in cochlear implants**

To optimize current settings, psychoacoustic experiments were designed to investigate how the human auditory system processes complex sound interactions from electric stimulations. Specifically, auditory masking was investigated using electric stimulations as the probe or masker to understand how electric stimulations separate into individual sound sources. The diversified subject population with different types of hearing loss and electric configurations also provide alternative testing paradigms to reevaluate previous masking results obtained from normal hearing subjects. By measuring electric stimulations, the research field can gain new insight to study the interactions of peripheral and central auditory systems. In this section, we will review previous comparisons of ipsilateral *electric-on-electric* masking, *electricon-acoustic* masking, and also contralateral *electric-on-electric* masking. We will then compare auditory masking curves in cochlear implants with the recently proposed concepts of profile analysis, temporal critical band, and transition bandwidths in normal hearing.

#### *2.3.1. Comparison of electric*‐*on*‐*electric masking*

Similar to the observations in normal hearing, electric masking studies [14] have shown that the amount of forward masking increases by decreasing the spatial separation between the probe electrodes and the electric pulses from adjacent maskers. Both the amount of masking and the spread of neural excitation increase with electric masker levels. Cochlear implant excitation patterns were also shown to have a spatial bandpass characteristic with a peak in the region of the masked electrode [15].

#### *2.3.2. Comparison of electric*‐*on*‐*acoustic masking*

current spread from each electrode is uncontrollably broad and large areas of nerves can be unintentionally activated. Spectral mismatches can also occur from degraded nerve survival

Speech recognition in noise becomes especially difficult without the ability to adequately separate components of sound from interfering sources. **Figure 4** shows the cochlear im plant coding scheme [12] that was discovered to greatly improve speech recognition. This sparse representation at the auditory periphery is unique as it presents electric pulse stimulations that feature both (1) temporal envelope information and (2) place information. Section 3.2.3 will discuss how this coding is simulated as temporal envelope bank (TBANK)

To optimize current settings, psychoacoustic experiments were designed to investigate how the human auditory system processes complex sound interactions from electric stimulations. Specifically, auditory masking was investigated using electric stimulations as the probe or masker to understand how electric stimulations separate into individual sound sources. The diversified subject population with different types of hearing loss and electric configurations also provide alternative testing paradigms to reevaluate previous masking results obtained from normal hearing subjects. By measuring electric stimulations, the research field can gain new insight to study the interactions of peripheral and central auditory systems. In this section, we will review previous comparisons of ipsilateral *electric-on-electric* masking, *electricon-acoustic* masking, and also contralateral *electric-on-electric* masking. We will then compare auditory masking curves in cochlear implants with the recently proposed concepts of profile

**Figure 4.** Temporal properties of a cochlear implant processor. In machine learning, temporal features can be derived

Similar to the observations in normal hearing, electric masking studies [14] have shown that the amount of forward masking increases by decreasing the spatial separation between the

analysis, temporal critical band, and transition bandwidths in normal hearing.

or inaccurate frequency-to-electrode allocation [11].

features in deep learning systems [13].

136 Advances in Clinical Audiology

**2.3. Auditory masking in cochlear implants**

from extracted temporal envelope bank (referred to as "TBANK").

*2.3.1. Comparison of electric*‐*on*‐*electric masking*

Electric-on-acoustic masking can also be measured for cochlear implant users who have preserved residual acoustic hearing following implantations (**Figure 5**). In a unilateral cochlear implant user with functional hearing preserved in the implanted ear, electric stimulations were observed to interact in the peripheral and central auditory system [16]. The masking growth function in **Figure 5** shows the detection thresholds of an electrode increased when the level of a 125‐Hz acoustic masker increased from 90 to 110 dB. The 250‐Hz acoustic masker also elevated electric detection in a similar manner. This data is consistent with the central theory of auditory masking and even provides new supporting evidence since the acoustic stimulations had to have been confined to the functional hair cells or nerves (as there is no known mechanism that acoustic stimulations could have directly activated a nerve fiber).

**Figure 5.** Ipsilateral masking data from [16]. The left panel shows a schematic representation of the acoustic (A) masking and the electric (E) masking mechanisms in hybrid hearing. The right panel shows the masking of an electrode probe by acoustic maskers at 125 or 250 Hz.

#### *2.3.3. Comparison of contralateral electric-on-electric masking*

Contralateral electric-on-electric masking can also be measured in bilateral cochlear implant users [17]. **Figure 6** shows the complete set of central masking data, with threshold elevation normalized so that each function peaks at 1. Each of the bilateral subjects was tested twice, alternating the ear used as the masker or probe (*n* = 14). As shown in **Figure 6**, the contralateral masking electrodes elevated the detection thresholds in both the left and the

**Figure 6.** Central masking data measured and replotted for seven bilateral cochlear implant subjects from [17]. The curves are sorted into panels according to the location of the probe electrode number (black contacts) in either the (A) fixed left ear or (B) fixed right ear. Thin lines show the individual central masking curves and the thick lines show the mean data for each fixed probe electrode location. All curves are normalized so that the peak threshold is equal to 1.

right ears. The threshold elevation peaks generally occurred between interaural pairings sharing the same electrode number (which corresponds to electrodes with similar insertion depths across ears).

**Figure 7** presents the same data to show the growth of masking as a function of the maskerprobe electrode separation across ears. Masker-probe separation is calculated by subtracting the differences between the masker and probe electrode numbers. Place‐matched masking conditions are categorized as "0" since both the masker and probe electrode numbers were identical across ears. When categorized in this manner, **Figure 7** shows the amount of central masking diminished with masker‐probe electrode separation. In [17], this data was reorganized to analyze the growth of masking. A two‐way repeated measure analysis of variance (ANOVA) showed a significant main effect of masker‐probe electrode separation and threshold elevation [*F*(2.122, 27.581) = 3.667, *p* = 0.036]. There was also a significant main effect of masker‐probe electrode separation, ear used as the probe electrode, and threshold elevation [*F*(2.563, 33.323 = 9.472, *p* < 0.001]. The masking growth pattern for each ear was also fitted with exponential equations and displayed similar spatial constants and significant *R*<sup>2</sup> values (*R*<sup>2</sup> > 0.97). The results demonstrate that the amount of central masking diminished with masker-probe electrode separation at similar rates on both sides.

**Figure 7.** Threshold elevation as a function of the interaural electrode offset between the masker and the probe (*n* = 14). "Apical" refers to all masking conditions where the masker was apically positioned from the probe, whereas "basal" refers to all conditions where the masker was basally positioned from the probe. The 1st and 2nd implanted ears refer to the ears on each side of a participating subject that were sequentially implanted in two separate surgeries.

**Figures 6** and **7** show bilateral cochlear implant stimulation contralaterally masked in a placedependent manner. For electric-on-electric signals, the average thresholds peaked when the position of the masker and probe electrodes were place-matched across ears and diminished with electrode separation. These place‐dependent findings have also been reconfirmed in a recent work [18] using different electrode arrays and testing apparatus. However, both studies [17–18] directly counter a previous conclusion in [19] that central masking with bilateral users gives rise to increased threshold, but not in a place-dependent manner (as is the case for contralateral masking in normal hearing). This previously accepted hypothesis in [19] was most likely concluded from data that was obscured by a limited test population (*n* = 2), electrical malfunctions (arrays with multiple electrical shorts), subjects reporting discomfort (uncomfortably high-pitched sensations), and electrodes that were inserted with primitive surgical techniques (offsets in electrode place accuracy of 6–9 electrodes).

#### *2.3.4. Comparison of masking in cochlear implants and normal hearing*

right ears. The threshold elevation peaks generally occurred between interaural pairings sharing the same electrode number (which corresponds to electrodes with similar insertion depths

**Figure 6.** Central masking data measured and replotted for seven bilateral cochlear implant subjects from [17]. The curves are sorted into panels according to the location of the probe electrode number (black contacts) in either the (A) fixed left ear or (B) fixed right ear. Thin lines show the individual central masking curves and the thick lines show the mean data for each fixed probe electrode location. All curves are normalized so that the peak threshold is equal to 1.

**Figure 7** presents the same data to show the growth of masking as a function of the maskerprobe electrode separation across ears. Masker-probe separation is calculated by subtracting the differences between the masker and probe electrode numbers. Place‐matched masking conditions are categorized as "0" since both the masker and probe electrode numbers were identical across ears. When categorized in this manner, **Figure 7** shows the amount of central masking diminished with masker‐probe electrode separation. In [17], this data was reorganized to analyze the growth of masking. A two‐way repeated measure analysis of variance (ANOVA) showed a significant main effect of masker‐probe electrode separation and threshold elevation [*F*(2.122, 27.581) = 3.667, *p* = 0.036]. There was also a significant main effect of masker‐probe electrode separation, ear used as the probe electrode, and threshold elevation [*F*(2.563, 33.323 = 9.472, *p* < 0.001]. The masking growth pattern for each ear was also fitted with exponential equations and displayed similar spatial constants and

diminished with masker-probe electrode separation at similar rates on both sides.

> 0.97). The results demonstrate that the amount of central masking

across ears).

138 Advances in Clinical Audiology

significant *R*<sup>2</sup>

values (*R*<sup>2</sup>

It will be important to optimize the configurations of cochlear implant stimulation as it has been reported that electrical pulse trains and acoustic sine waves do not fuse or merge well into a single percept [17, 20]. The presence of electric masking has indicated regions where electric and acoustic signals share similar frequencies as normal acoustic masking. However, the data in **Figure 6** show a large amount of individual variability as many of the bilateral participants displayed unmatched masking patterns across ears. This individual variability could be the result of several factors. First, cochlear implant users are likely to have irregular patterns of auditory nerve survival, which could cause the current to stimulate auditory fibers that are too apical or too basal from the intended location. "Dead regions" in auditory nerves may also prevent the adjacent masking electrodes from stimulating distinct place-frequencies. Second, different surgical procedures could result in variability between the insertion depths of a user's electrode array. This variability could be significant since neural activation patterns depend on the density of neurons in a particular region of the cochlea and on the radial distance between the electrode array and neural targets in the modiolus.

In general, cochlear implant subjects have exhibited electric‐masking patterns that are much broader compared to what has been observed in normal hearing [14–20]. A reduction in the magnitude of contralateral versus ipsilateral masking functions was observed in [18], but this reduction was not as great as observed in normal hearing. Together, these findings of broader masking with cochlear implants both support and are supported by the concepts of:


These findings all decrease confidence in previous estimates of peripheral filtering as well as the assumptions made in *critical band theory*, especially when combined with recent evidence in normal‐hearing listeners that suggest a flexible selection of spectral regions upon which to base across-frequency comparisons [21]. Furthermore, the wide bandwidths observed in the initial filters of the cochlear implant subjects directly contradicts the theory that extraction of envelope information should be constrained to a single auditory filter, as theorized in [22]. For these reasons, *transition bandwidths* [7, 23] are the most plausible solution as they explain patterns that were observed in both normal hearing and electric hearing experiments (as this concept allows an interplay to occur between temporal or spectral processes). In the section on machine hearing research, a novel method based on "*deep learning*" is utilized to prove the computational efficiency of transition bandwidths in artificial neural network systems.
