**5. Robust estimator and outlier detection in high-dimensional medical imaging**

The statistical analysis of medical images is challenging, not only because of the high-dimen‐ sionality and low signal-to-noise ratio of the data, but also due to varieties of errors in the image acquisition processes, such as scanner instabilities, acquisition artifacts, and issues associated with the experimental protocol [44]. Furthermore, populations under study typically present high variability [45, 46], and therefore the corresponding imaging data may have uncommon though technically correct observations. Such outliers deviating from normality could be numerous. With emergence of large medical imaging databases, develop‐ ing automated outlier detection methods turns out to be a critical preprocessing step for any subsequent statistical analysis or group study. In addition, medical imaging data are usually strongly correlated [47]; outlier detection approaches based on multivariate models are thus crucial and desirable. Procedures using the classical MCD estimator are not well-suited for such high-dimensional data.

In [48], several extensions to the classical outlier detection framework are proposed to handle high-dimensional imaging data. Specifically, the MCD robust estimator were modified so that it can be used for detecting outliers when the number of observations is small compared to the number of available features. This is achieved through introducing regularization in the definition and estimation of the MCD. Three regularization procedures were presented and compared: *l* <sup>1</sup> regularization (*RMCD* - *l* <sup>1</sup>); *l* 2 regularization or ridge regularization (*RMCD* - *l* 2); and random projections (*RMCD* - *RP*). The idea of *RMCD* - *RP* is to run the MCD estimator on datasets of reduced dimensionality, and this dimensionality reduction is done by projecting to a randomly selected subspace. In addition, the parametric approach of the regularized MCD estimators is compared to a non-parametric procedure, the One-Class SVM algorithm (see Section 4). Experimental results on both simulated and real data show that *l* <sup>2</sup> regularization performs generally well in simulations, but random projections outperform it in practice on non-Gaussian, and more importantly, on real neuroimaging data. One-Class SVM works well on unimodal datasets, and it has a strong potential if their parameters can be set correctly.

Outlier detection methods described above can serve as a statistical control on subject inclusion in neuroimaging. However, sometimes it is controversial regarding whether or not outliers should be discarded, and, if so, what tolerance to use. An alternative strategy is to utilize outlier-resistant techniques for statistical inference, which would also compensate for inexact hypotheses including data normality and homogeneous dataset. Robust techniques are especially useful when a large number of regressions are tested and assumptions cannot be evaluated for each individual regression, as with neuroimaging data.

Both individual subject and group analyses are required in neuroimaging. At a typical single subject level, a multiple regression model is used for the time series data at each voxel [49, 50], and outliers (or other assumption violations) in the time series would impact the model fitness. Robust regression can minimize the influence of these outliers. At the group level, after spatial normalization, a common strategy is to first save the regression parameters for each subject at each voxel and then perform a test on the parameter values. Robust regression used at this level can minimize the influence of outlying subjects. Wager et al. [51] used simulations to evaluate several robust techniques against ordinary least squares regression, and apply robust regression to second-level group analyses in three real fMRI datasets. Experimental results demonstrate that robust Iteratively Reweighted Least Squares (IRLS) at the second level is computationally efficient; it increases statistical power and decreases false positive rates when outliers are present. Without the presence of outliers, IRLS controls false positive rates at an appropriate level. In summary, IRLS shows significant advantages in group data analysis and in the hemodynamic response shape estimation for fMRI time series data.
