*4.1.3 Study description: exp-stitt*

Stitt et al. [10], a 2019 study on accommodation to non-individual HRTF. 16 participants trained on auditory localisation, each completing 10 sessions of 12 min each over a span of 10–20 weeks. The worst-match HRTF selection, training game, stimulus, and tested audio source positions during the localisation task evaluation at the end of each training session were the same as those of **exp-parseihian**. Participants were divided into 2 groups: 4 training with individual HRTFs (**grp-stittindiv**) and 8 with worst-match HRTFs (**grp-stitt-worst**). An additional 8 participants trained for only 4 sessions with their worst-match HRTFs are not considered in the current analysis.

### *4.1.4 Study description: exp-steadman*

Steadman et al. [15], a 2019 study on accommodation to non-individual HRTF. 27 participants trained on auditory localisation, each completing 9 sessions of 12 min each over a span of 3 d. A localisation task evaluation was conducted at the beginning and end of each day as well as between each training session the first day, testing participants on 12 positions distributed on a sphere using a 1.6 s stimulus merging bursts of white noise and speech signal. All participants trained with the same randomly-matched HRTF selected from the 7 LISTEN database of **expparseihian**. Participants were distributed in 3 groups, training on various gamified and interactive versions of an audio localisation game, aggregated as one group in the current analysis (**grp-steadman-random**). An additional 9 participants, acting as a control group not undertaking training, are also not considered in the current analysis, as well as the results of a parallel evaluation task performed on another HRTF than that used during training.

#### *4.1.5 Study description: exp-poirier*

Poirier-Quinot and Katz [9], a 2021 study on accommodation to non-individual HRTF. 12 participants trained on auditory localisation (**grp-poirier-best**), each completing 3 sessions of 12 min each over a span of 3–5 d. Participants trained using a best-match HRTF selected from the 7 LISTEN database of **exp-parseihian**, though the simplified subjective selection method was only concerned with identifying the best-match HRTF. An additional 12 participants trained with their best-match HRTF in a reverberant condition are not considered in the current analysis. Each session consisted of an interactive audio localisation game followed by a localisation task evaluation testing 20 positions distributed on a sphere using the same stimulus as in **exp-parseihian**.

#### **4.2 Application of the methodology**

#### *4.2.1 Time alignment of evaluation sessions*

In all these experiments, the training sessions lasted for 12 min, except for **exp-majdak** where both training and evaluation were performed in a single block of 20–30 min. According to **exp-majdak**, the evaluation itself took half that time, leaving a per-session training duration equivalent to that of the other studies. A time realignment across experiments was executed such that the evaluation sessions

#### *HRTF Performance Evaluation: Methodology and Metrics for Localisation Accuracy… DOI: http://dx.doi.org/10.5772/intechopen.104931*

compared are separated by equivalent training durations. Thus, the sessions have been renumbered to account for changes in protocol.

In the analysis, evaluation sessions are numbered from 1 to 11, each separated by a 12 min training. **Exp-poirier** and **exp-parseihian** only performed 3 training sessions, hence the missing data-points in subsequent figures. Likewise, **exp-stitt** and **exp-majdak** did not report pre-training performances, missing session 1 datapoints. Finally, the number of evaluations in **exp-steadman** spreads out from session 4 onward, switching from an evaluation session after each training to an evaluation at the beginning and end of each 3-sessions training day.

#### *4.2.2 Evaluation task characterisation*

The space coverage of target positions evaluated during the localisation task of each study are reported in **Figure 5**. The high density of the grid of **exp-majdak** results in a very low average *scangle* compared to those of the other experiments. Its comparatively high standard deviation is due to the absence of test positions in the bottom part of the sphere (polar gap). For comparison, a homogeneous grid with the same number of points would have yielded *scangle* ¼ 0*:*5° � 0*:*003. Distribution homogeneity is also responsible for the lower *scangle* standard deviation value observed in **exp-poirier** compared to that of **exp-parseihian** and **exp-stitt**. Finally, **exp-steadman**, with fewer test points and a polar gap in the bottom hemisphere, has the highest *scangle* value and standard deviation.

As could be expected, all the grids present high *scshape* values, being overall evenly distributed on the sphere. Grid density around polar gaps impacts the metric, explaining why **exp-poirier** value is higher than that of **exp-majdak** while both grids are evenly distributed: removing polar gap contributions in these grids would yield *scshape* values of 0.91 and 0.84 respectively.

Two different reporting methods were used in the five studies: head pointing (**exp-majdak** and **exp-steadman**) and hand pointing (**exp-majdak**, **expparseihian**, **exp-steadman**, **exp-poirier**). This should have little to no impact on the comparative analysis however, as both methods lead to similar reporting biases [32]. **exp-parseihian**, **exp-stitt**, and **exp-poirier** used the same stimulus: a 180 sequence of three white noise bursts. **Exp-majdak** used a slightly longer, unique burst of 500 ms, and **exp-steadman** used a 1.6 s stimulus composed of both white noise bursts and speech signal. All these stimuli are likely to present the transient energy and the broad frequency content necessary for auditory space discrimination [43, 44]. The difference in stimulus duration may have

**Figure 5.**

*Space coverage statistics of the evaluation task in the selected studies (a) majdak, (b) parseihian/stitt, (c) poirier and (d) steadman.*

repercussions in the analysis, as the participants can initiate more head movements to facilitate auditory localisation during the presentation of longer stimuli [45]. While adaptive rendering (*i.e.* dynamic cues) was disabled during stimulus presentation in **exp-parseihian**, **exp-stitt**, and **exp-poirier**, this is not explicitly stated in **exp-majdak** and **exp-steadman**.

#### *4.2.3 Assessing the global extent of localisation error*

The evolution of great-circle angle error across studies and training sessions is reported in **Figure 6**. Besides the clear benefit of training observed in all studies, the metric also highlights the overall positive impact of HRTF quality on initial performance. Interestingly, while the results from **exp-parseihian** suggest a similar intra-HRTF quality/performance relationship, it reports larger great-circle angle errors compared to those of the other experiments. This point already illustrates how differences in evaluation protocols or inter-participant variations may complicate the comparison of results across studies, as discussed in Section 4.3.

#### *4.2.4 Assessing the critical localisation confusions*

Much like the great-circle error, precision confusion rates can be used to assess performance evolution during training, as illustrated in **Figure 7**. Trends observed on initial precision rates and their evolution reflect the observation made on the greatcircle error analysis. Precision rates and great-circle angle values are indeed highly correlated across training sessions, with correlation coefficients in [1*:*0:0*:*9] for all studies. As each confusion rate aggregates all the responses of a participant during an evaluation session however, their CI is by construction often wide enough to confuse the analysis compared to that based on great-circle errors.

This widening of the CIs is particularly apparent in the comparison of the other confusion rates, reported in **Figure 8** for the evaluation that took place after the first training session. While a trend indeed suggests that the amount of confusions increases with decreasing HRTF quality, overlapping CIs often prevent any definite conclusion. Observing these rates can still help inform the analysis, as the poor performance of **grp-parse-indiv** on great-circle error observed in the previous section can be partly attributed to their high in-cone confusion rates, while their offcone confusion rate is on par with that of **grp-stitt-indiv** and **grp-majdak-indiv**.

Maybe the most interesting use of confusion rates is to decompose the overall performance evolution. As illustrated by its confusion rate evolution in **Figure 9**, **grp-stitt-worst** performance evolution observed in **Figure 6** should, confusion wise, mainly be attributed to improvements in front-back confusions during training.

#### **Figure 6.**

*Great-circle error mean and CI evolution across sessions and experiments. The great-circle error value for random responses is of 90° for all experiments.*

*HRTF Performance Evaluation: Methodology and Metrics for Localisation Accuracy… DOI: http://dx.doi.org/10.5772/intechopen.104931*

**Figure 7.**

*Precision confusion rates mean and CI evolution across sessions and experiments. Grp-parse-indiv was removed from the figure, composed of only 2 participants, resulting in a CI so large it confused the whole plot.*

**Figure 8.** *Confusion rates after the first training session across experiments.*

**Figure 9.** *Confusion rates mean and CI evolution across sessions for grp-stitt-worst.*

#### *4.2.5 Assessing the local extent of localisation error*

Results of the confusion classification indicate that roughly 50% of responses were within the vicinity of the target (precision errors) after the first training session across experiments. The analysis here focuses on these responses, assessing local accuracy issues to complete that on localisation confusions.

**Figure 10** reports local great-circle errors across training sessions and experiments. Looking once more at **grp-stitt-worst**, their local accuracy did not improve during training, oscillating around 25°. The improvement seen on overall greatcircle error for that group can therefore be solely attributed to the reduction in front-back confusions reported in the previous section. Likewise, the 10° improvement on overall great-circle error observed for **grp-parse-worst** between sessions 2 and 3 can be attributed to a reduction in confusion rates, as it does not appear on local great-circle error. Separating the contribution of confusions from that of local accuracy also reveals a significant difference between **grp-stitt-indiv** and **grp-**

**Figure 10.** *Local great-circle error mean and CI evolution across sessions and experiments.*

**majdak-indiv** improvement of local great-circle error between sessions 2 and 6, not visible on global great-circle error.

#### *4.2.6 Horizontal and vertical decomposition of the localisation error*

Local lateral error evolution across sessions for all experiments is reported in **Figure 11a**. As expected, initial performances indicate that participants using individual HRTF were quite apt at lateral localisation, accustomed as they were to the presented ITD and ILD cues. **Exp-poirier**, **exp-stitt**, and **exp-parseihian** used a similar ITD adjustment scheme, slightly improved in its last iteration for **exppoirier** compared to that of **exp-stitt**, itself an incrementation on that of **expparseihian**. As such, the progression of initial lateral errors between **grp-parseworst**, **grp-stitt-worst**, and **grp-poirier-best** can be expected. The performance of **grp-steadman-random**, on par with that of participants using ITD-adjusted or individual HRTFs, could be either attributed to the small number of evaluation positions (similar to that used during training), or to the 1.6 s burst and voice stimulus used as compared to the 180 ms to 500 ms burst trains used in the other experiments.

Participants trained with individual HRTF did not improve much on local lateral error overall, starting at ≈11° after the first training session and only improving to at ≈9° after the last. Comparison of performance evolution between groups training with a worst-match HRTF (**grp-parse-worst** and **grp-stitt-worst**) against that of groups training with a best-match HRTF (**grp-parse-best** and **grp-poirier-best**) suggests a positive impact of HRTF quality on potential local lateral error improvement. It would also seem that the ITD adjustment applied in **exp-parseihian** and **exp-stitt** was not sufficient to compensate for poor HRTF quality regarding lateral localisation accuracy.

Focusing on local lateral compression evolution, **Figure 11b** reveals a systematic over-estimation of the lateral angle across experiments, *i.e.* participants overall reported targets closer to the inter-aural axis poles than they truly were. Analysis of session 2, after the first training session, indicates that 62% of the 73 participants presented an overall lateral compression of less than 5°, against only 4% presenting one above 5°.

Local polar error evolution across sessions for all experiments is reported in **Figure 12a**. Overall performance was still a function of HRTF quality, but for **grpparse-indiv** poor performance prior to training and **grp-steadman-random**, on par with **exp-stitt** and **exp-majdak** control groups using individual HRTFs. The impact of training is hardly more pronounced than that observed on local lateral error. Training still helped lower local polar error overall, with even participants using individual HRTFs slightly improving during training: **grp-stitt-indiv** and **grp-** *HRTF Performance Evaluation: Methodology and Metrics for Localisation Accuracy… DOI: http://dx.doi.org/10.5772/intechopen.104931*

**Figure 11.** *(a) local lateral error, and (b) local lateral compression evolution across sessions and experiments.*

**majdak-indiv** gained ≈3° in local polar accuracy over the course of training, roughly identical to the improvement observed on local lateral accuracy. Note here that an analysis based on the overall polar error, *i.e.* taking into account confusions, would have suggested ≈12° improvement after training for these two groups. Finally, most of the improvement on local polar error occurred during the early stage of the training, decreasing of ≈7° between sessions 1 and 2 in average over all experiments, not considering **exp-stitt** and **exp-majdak** as participants were not tested prior to training, and of only ≈7° between sessions 2 and 4.

The analysis of local elevation compression also reveals a stronger tendency to under-estimate target elevation, *i.e.* responses closer to the horizontal plane than the true target, than that observed on local lateral compression. Across experiments, 38% of the 73 participants presented a local elevation compression of more than 5° after the first training session, compared to 14% for elevation dilation. A trend suggests that local elevation compression is quickly corrected during the first training session and remains at a relatively constant value regardless of the method or number of training sessions. The surprisingly high plateau reached by **grp-majdakindiv** compared to **grp-stitt-indiv**, also training on individual HRTFs, could be attributed to the the difference in tested grid positions: **exp-majdak** presented far more targets near the 90° elevation pole than **exp-stitt**.

#### *4.2.7 Decompose the analysis across sphere regions*

This section illustrates how splitting results analysis across sphere regions might highlight spatial imbalances in performance. To avoid further cluttering the chapter, only two example decompositions will be presented: confusion rates based on sphere regions, and local great-circle error based on individual target locations.

Decomposition of confusion rates based on the regions defined in Section 3.1.6 is illustrated **Figure 13**. Results displayed are aggregated over all five studies, to focus the analysis on general binaural localisation behaviours. The first noticeable result is

**Figure 12.** *Participants (a) local polar error, and (b) local elevation compression across training sessions and experiments.*

**Figure 13.** *Evolution of confusion rates across sessions, decomposed based on sphere regions, aggregated over all experiments.*

that targets in the front-down region were the most susceptible to front-back and in-cone confusions initially, resulting in a very low precision rate (30% vs. 47% and more for the other regions) prior to the first training session. Interestingly, confusion rates in the front-down region were systematically higher than those in the front-up region, for all but off-cone confusions. The initial rate of front-back confusions of targets in front of participants, more than twice that of targets behind them, is likely due to the absence of visual feedback during the localisation task, increasing likelihood of perceiving a sound as behind if they cannot see its source, regardless of HRTF cues.

A second interesting result is the negligible evolution of front-back confusions for targets in the back regions throughout training (*i.e.* back-to-front). While the precision rate of all regions increased, and front-back confusions dropped for front regions, training seemed to have no impact on front-back rates in the back region. Analysis of per-region accuracy however revealed that the local great-circle error decreased evenly across regions, from ≈25° in session 1 to ≈21° in session 11.

*HRTF Performance Evaluation: Methodology and Metrics for Localisation Accuracy… DOI: http://dx.doi.org/10.5772/intechopen.104931*

**Figure 14.**

*Evolution of mean response locations across targets and sessions in exp-poirier. Hollow circles represent target positions. Filled circles represent mean response locations, surrounded by standard error ellipses computed using Kent distributions.*

These observations suggest that future training programs could be improved by focusing slightly more on reducing front-back and in-cone confusions in the frontdown region. Stagnating rates, such as that of front-back confusions in the back-up region, around 15% across sessions, would also suggest that there is room for improvement in the design of didactic training programs that would aid participants towards reaching 0% confusion rates.

Further refining the analysis, **Figure 14** focuses on the assessment of mean response locations for each target presented in **exp-poirier**. Mean response locations were obtained by summing local great-circle error vectors as discussed in Section 3.2.6. Their positions relative to targets, and the evolution of these positions during training, provides a thorough characterisation of participant's local accuracy evolution on the sphere. Additionally, the lateral and elevation compression effects observed in Section 4.2.6 are clearly visible, where mean responses are generally biased towards the interaural axes and/or the horizontal plane.

#### *4.2.8 Handling initial performance offsets*

This additional step in the analysis can be seen as an extension of the evaluation task characterisation proposed in Section 4.2.2 specific to the assessment of localisation performance *evolution*. It presents some of the techniques that exist to compare said evolution despite unbalanced initial conditions across studies or groups of participants.

Techniques have been proposed to conduct training efficiency analysis on unbalanced initial conditions. Stitt et al. [10] for example applied per-participant arithmetic normalisation, based on group baseline performances. Realigning initial conditions, this technique allows to focus the analysis on relative improvement, as illustrated in **Figure 15**.

Another technique for relative improvement comparison, used for example by Majdak et al. [31] and Poirier-Quinot and Katz [9], is to compare the coefficients of a regression applied on performance evolution. As mentioned in Section 2.3.2, two main regression models have been adopted to fit said evolution depending on the training stages represented in the data. **Figure 16** illustrates how both can be fitted to local great-circle error evolution across experiments. Groups performance evolution was first fitted to the exponential form in **Figure 16a**, resorting to the linear

**Figure 15.** *Great-circle error evolution across sessions and experiments. Data normalised (subtraction) with group mean results of session 2 as reference*.

#### **Figure 16.**

*Regressions on local great-circle error evolution across training and experiments, (a) exponential regression "y*<sup>0</sup> � *exp* ð Þþ �*sessionID=τ c", and (b) linear regression "a* � *sessionID* þ *b". y*<sup>0</sup> *represents the initial performance, τ the improvement time constant, and c the long term performance. b represents the initial performance, a the improvement rate.*

form in (b) when the evolution did not follow an exponential form, resulting in regression parameters CIs so wide as to prevent any meaningful interpretation. The use of a regression is particularly attractive, as it reduces the performance evolution analysis to a simple high level coefficient comparison, coefficients that can usually be interpreted in simple terms such as initial performance or improvement rate.

As mentioned, these techniques are generally applied to compensate for unbalanced initial performance. Although they are perfectly valid to assess the impact of HRTF quality or training efficiency on *relative* improvement, the scope of any conclusion made using them is greatly limited as the potential improvement margin naturally depends on initial performance.

*HRTF Performance Evaluation: Methodology and Metrics for Localisation Accuracy… DOI: http://dx.doi.org/10.5772/intechopen.104931*

## **4.3 Discussion**

As illustrated throughout Section 4.2, drawing clear cut conclusions from the comparison of results from several studies is difficult at best. Most of the time, it is simply impossible, generally because of uncontrolled variations across test conditions. These variations, limiting both intra- and inter-study analysis, are discussed in this section.

#### *4.3.1 Evaluation task*

Variations in the evaluation protocols and procedures between studies in the literature present a challenge for comparing the multiple experiments. Different experimental design choices, such as reporting method, spectral content and duration of the stimulus, and evaluation grid, have a direct impact on the baseline performance of participants [32]. For example, given the choice by **exp-steadman** to use a random-match HRTF, the notable results of **grp-steadman-random** compared to those of the other groups could be attributed to the training program. However, the 1.6 sec stimulus (that may have enabled the use of head movements during the evaluation) may also have contributed to the improved performance of **grp-steadman-random** compared to the other studies that used 180 or 500 ms bursts [46].

The use of a unique grid for localisation tasks across studies would assuredly simplify results comparisons. Said grid could, for example, be designed to be homogeneously distributed on the sphere [35]. For more flexible test conditions, a series of test grids of increasing point densities could be defined, where test positions of any given grid would be present on its higher density neighbours, easing down-sampling for comparison. Regarding the stimulus used or the reporting method, a simple solution would be to settle on those that respectively optimise localisation accuracy [47] and minimise reporting bias [32]. Pending the adoption of common practices, the bias induced by those design choices could technically be assessed from the results of a control group using individual HRTFs.

Another issue when comparing performance evolution across studies is the alignment of the evaluation sessions for fair comparison. As proposed in Section 4.2.1, a simple solution is to align them based on training duration. Time alignment would seem a better option than its alternative, based on the number of positions presented during the training. Time is of direct interest for end-users, and an alignment based on presented positions would bias the analysis in favour of slower exploratory training paradigms.

Finally, the merging of both evaluation and training sessions, as used in **expmajdak**, is not ideal in the context of inter-study comparison. Although this practice allows for a more granular analysis of performance evolution, it systematically leads to confusing analysis compared to studies alternating between training and evaluation sessions. Additionally, it would seem that the alternating design imposes a lesser constraint on the training paradigm itself, allowing for implicit learning strategies not focused on target localisation [48].

#### *4.3.2 Intra- and inter-participant variations*

Variations between participants' performance is an issue common to most psychophysical studies studies. Two aspects of these variations can become critical in the context of HRTF learning studies.

The first aspect concerns imbalances in initial participant performance across tested conditions. As discussed in Section 4.2.8, such imbalance is likely to weaken or void conclusions resulting from the analysis. For within experiment comparisons, a simple solution is to run a pre-training evaluation session, to then create groups of equivalent performance based on the metrics used in the analysis. The problem naturally worsens when dealing with inter-study analysis. The use of a control group using individual HRTF is again advised to serve as a baseline reference for the comparative analysis.

The second aspect concerns the difference in participants' immediate sensitivity to HRTF quality, and their ability to adapt to a non-individual HRTF. Both have been discussed in previous studies, where some participants were more prone to instantly benefit from a best-match HRTF [49] or to adapt to a poorly matched HRTF [10]. To avoid missing out on interesting behaviours due to the variance introduced by some participants, it is recommended to conduct a second pass of the analysis on sub-groups, for example aggregated based on their improvement rate [10]. Although the conclusions from the sub-group analysis may be weaker compared to an overall analysis, the technique provides readers with a more thorough understanding of the training as well as the potential advantages and limitations of the tested conditions.

#### *4.3.3 Procedural versus perceptual learning*

In the present context, procedural learning refers to participants becoming familiar with the various aspects of the localisation task, resulting in a performance improvement that is not due to an accommodation to HRTF specific cues (perceptual learning). As of yet, there exists no model for *a posteriori* dissociating the contribution of both types of learning to performance evolution. Intra-study comparisons would most likely not be affected since one could generally assume that the procedural learning has a similar impact on all tested conditions. However, by not allowing the procedural learning to plateau before the first evaluation, the generalisation of a study conclusions become problematic when one needs to compare the results from various studies based on different protocols.

Results of control groups generally prove extremely valuable during inter-study comparison. Participants only taking part in the evaluation and not the training, as in **exp-steadman**, can provide a good insight on the impact of the evaluation task implementation on performance across experiments. Even better, the inclusion of a control group using their own HRTF, as in **exp-stitt** and **exp-parseihian**, provides a solid baseline to dissociate procedural from perceptual learning during both intraand inter-study analysis.

Additionally, simple experimental design choices can be applied to avoid having to deal with certain forms of procedural training. The proprioceptive adjustment required for accurately reporting perceived positions [14] can for example be greatly accelerated by using a natural 3D reporting method coupled to a visual pointer [9], as well as providing a reference grid to help orientation in the sphere [31]. Thorough beta testing can further eliminate design flaws that participants can exploit to improve their performance, such as the use of too small a set of test positions, or unconstrained tracking allowing for small head movements during the stimulus presentation phase of the localisation task.

Other aspects of procedural training, such as having participants focus on the listening task, can only be removed by introducing a pre-experimental training session. Such a session was applied in **exp-majdak**, where participants trained for approximately 30 min on a localisation task coupling visual feedback and stereo panning. This pre-experimental training likely contributed to the smooth improvement in great-circle error by **grp-majdak-indiv** from session 2 onward compared to the disjointed improvement observed for **grp-stitt-indiv** between sessions 2 and 3

*HRTF Performance Evaluation: Methodology and Metrics for Localisation Accuracy… DOI: http://dx.doi.org/10.5772/intechopen.104931*

in **Figure 6**. Paradoxically, the only limitation of the pre-training proposed in **expmajdak**, which did not use actual binaural signals, is that it does not familiarise participants with binaural rendering. Pending formal evidence, one may assume that there exists an adaptation process during which participants will grow consistent in their localisation estimation, even in the absence of feedback, much like the effect observed on HRTF quality ratings reported by Andreopoulou and Katz [50]. Regardless of whether this adaptation should be labelled as perceptual or procedural training, it will still interfere with the evaluation of training efficiency itself.

Overall, it is reasonable to assume that one could design a pre-training session that accommodates procedural learning in roughly 15 min, even taking into account this last point, and relaxing the time constraint imposed in **exp-majdak**. This session however still takes a non-negligible amount of time, which will contribute to participant fatigue and loss of focus. Because of this, it is likely that most experimental designs will continue to include aspects of procedural learning as a shared effect, equally impacting all tested conditions. An alternative solution would be to conduct a set of studies to measure and model the various aspects of procedural learning in the present context, so that its contribution to performance evolution could be dissociated from that of perceptual improvement even in the absence of a pre-training session.

### **5. Conclusion**

This chapter presented a methodology for the assessment of auditory localisation accuracy in the context of HRTF selection and learning tasks. Based on existing metrics and decomposition schemes, the methodology consists of a series of steps guiding analysis towards the creation of comprehensive and repeatable performance assessments. A collected case-study was then proposed that compared the results of five contemporary experiments on HRTF learning and illustrates how the methodology can be applied to better understand participant performances and their evolution.

The initial intent of this chapter was to propose a set of metrics and an analysis workflow that would be adopted and adapted by the community to standardise the evaluation of localisation performance. In time, the standardisation would help simplify the comparison of results from different studies, allowing to assess hypotheses and draw conclusions beyond the scope of the constituting studies. While the proposed case-study provides a glimpse at the benefits of such standardisation, it is limited by one of, if not the most, major issue of inter-study comparison: the lack of a reference between tested conditions. Without this reference, conclusions drawn from the analysis can hardly be generalised, much like those that would result from a comparison between language learning techniques without *a priori* knowledge of participants learning abilities, or how different is the language learnt compared to their mother tongue.

As of now, the only applicable solution to provide such reference across studies is to systematically add a control group composed of participants using their own HRTF to the experiment. A large enough group composed of experts and novices alike would indeed provide a stable reference that can be used to assert a certain equivalence in *e.g.* the evaluation task before proceeding to inter-study performance comparison. However, this solution is rarely practical due to the complexity of the HRTF measurement process, which is the main incentive for HRTF learning in the first place. A somewhat less constraining, yet highly unlikely, scenario would be the creation and adoption of a unique evaluation platform, shared across all studies to formalise future HRTF selection methods and training program comparisons.

With luck, the issue will solve itself as the next generation of HRTF individualisation techniques render selection and training obsolete. In the meantime, methodologies such as the one proposed here should help improve the rigour of studies and consequently the understanding of the fundamental issues regarding auditory localisation and spatial hearing accommodation to non-individual HRTFs and their applications.
