**1. Introduction**

High-dimensional remote sensing imagery, such as hyperspectral (HS) imagery, generates huge data volumes, consisting of hundreds of contiguous spectral bands. Several difficulties are caused by this high dimensionality. First, the Hughes phenomenon [1] can occur when classifying such data, even though modern classifiers such as support vector machines (SVM) and random forests (RF) are less sensitive to it [2, 3] except when very few training data are available [4]. Second, important computing times are required to process such high-dimensional data. Third, storing data requires huge volumes. Last, displaying high-dimensional imagery can be necessary, while human vision is limited to three colours [5, 6].

Thus, this study focuses on the comparison of several FS **criteria** (presented in Section 2.1) for **supervised** classification problems (that is to say when classes and their ground truth are taken into account). To have a fair comparison, all these criteria will be optimized using the **same generic optimization** algorithms. It was here decided to use such generic optimization heuristics, in the context of sensor design, since such methods enable to easily control the number of bands to select and to add additional constraint within the band extraction process as in the second part of the study. The use of generic optimization methods necessarily excludes of the comparison feature ranking criteria (such as ReliefF [11, 12]) and FS methods where the score and the optimization method are strongly related to, for instance, SVM-RFE [13]. All criteria will be tested on several classic hyperspectral

*Spectral Optimization of Airborne Multispectral Camera for Land Cover Classification…*

Even though hybrid approaches involving several criteria exist [14, 15], FS methods and criteria are often differentiated between 'filter' (independent from any classifier), 'wrapper' (related to the classification performance of a classifier) and 'embedded' (related to the quality of classification models estimated by a classifier, but not directly to classification accuracy). It is also possible to distinguish supervised and unsupervised ones, especially for filters, that is to say whether a notion of classes is taken into account or not. All approaches mentioned below are summed up in **Tables 1** and **2**. Nevertheless, it must be kept in mind that hybrid approaches involving several criteria belonging to these different FS criteria categories often exist, as, for instance, in [14] or [15], where features are selected based on a wrapper method, respectively, guided or associated with filter criteria (mutual

Filter methods compute relevance scores independently from any classifier. Some filter methods are ranking approaches: features are ranked according to an individual score of importance. Such individual feature scores can be supervised or

*State of the art of feature selection criteria: the criteria that work with the FS criteria evaluation framework*

information between selected bands and between the ground truth).

data sets.

*2.1.1 Filter*

**Table 1.**

**71**

*used in this study are underlined [16–47].*

**2.1 FS: state of the art**

*DOI: http://dx.doi.org/10.5772/intechopen.88507*

Hyperspectral data consist of hundreds of contiguous spectral bands, but most of these adjacent bands are highly correlated to each other. Thus a subset of wellchosen bands is generally sufficient for a specific problem. This enables to design adapted superspectral sensors dedicated to such specific land cover classification. **Spectral optimization** (SO) or optimal band extraction (BE) consists in identifying the most relevant spectral band subsets for such specific applications. Spectral optimization is a specific dimensionality reduction (DR). DR aims at reducing data volume minimizing the loss of useful information and especially of class separability. Dimensionality reduction techniques can be separated into **feature extraction** (FE) and **feature selection** (FS) categories.

FE consists in reformulating and summing up original information. Principal component analysis (PCA), minimum noise fraction (MNF), independent component analysis (ICA) and linear discriminant analysis (LDA) are examples of state-of-the-art feature extraction techniques. On the opposite, FS selects the most relevant features for a problem. When applied to HS data, it is named band selection (BS) and compared to FE; it enables to keep the physical meaning of the selected bands. For instance, in spectroscopy, FS has sometimes been performed by specialists identifying specific absorption bands or spectrum behaviour corresponding to a material, and this knowledge has then been used in expert systems (e.g. [7] for specific minerals, [8] for asbestos, [9] for asphalt or [10] for urban materials). At the end, SO is at the interplay between FS and FE as it aims at optimizing both band positions (FS) along the spectrum and width (FE).

This study aims at defining a SO strategy to design superspectral sensors dedicated to specific land cover classification problems. SO and FS are optimization problems involving both a **metric** (that is to say a score measuring the relevance of band subsets) to optimize and an **optimization** strategy. This study first focuses on the choice of a FS relevance score suitable for generic optimization heuristics. Both classification performance and selection stability will be considered. As an intermediate result, band importance profiles are considered providing hints about the relevance of the different parts of the spectrum. Once the FS criterion chosen, this chapter copes with the optimization of bandwidth, applying FS within a hierarchy of groups of adjacent bands.

### **2. FS: requirements and state of the art**

In the state of the art, FS is often a first step in a specific classification workflow, while the context of this work is the design of superspectral sensors dedicated to specific land cover classification problems. Thus the selected band subset must be as efficient as possible **for most classifiers and not only for the used FS criteria**. Thus, their ability to discriminate between classes using selected feature subsets (that is to say their classification performance) independently from any classifier has to be considered to assess the FS criteria quality. Furthermore, the **stability** of the proposed solutions has also to be considered. Last but not least, in this sensor design context, constraints about the **maximum** number of bands to select exist. To sum it up, a good FS criterion for sensor design has to **be parsimonious, making it possible to select stable band subsets discriminant for most classifiers**. Thus, for a fair analysis, FS criteria must be compared for a same selected band subset size, and results must be evaluated according to different classifiers. Besides, computing time was not considered as an important criterion in this specific context of sensor design, where FS is not a preprocessing in a classification workflow.

*Spectral Optimization of Airborne Multispectral Camera for Land Cover Classification… DOI: http://dx.doi.org/10.5772/intechopen.88507*

Thus, this study focuses on the comparison of several FS **criteria** (presented in Section 2.1) for **supervised** classification problems (that is to say when classes and their ground truth are taken into account). To have a fair comparison, all these criteria will be optimized using the **same generic optimization** algorithms. It was here decided to use such generic optimization heuristics, in the context of sensor design, since such methods enable to easily control the number of bands to select and to add additional constraint within the band extraction process as in the second part of the study. The use of generic optimization methods necessarily excludes of the comparison feature ranking criteria (such as ReliefF [11, 12]) and FS methods where the score and the optimization method are strongly related to, for instance, SVM-RFE [13]. All criteria will be tested on several classic hyperspectral data sets.

#### **2.1 FS: state of the art**

Even though hybrid approaches involving several criteria exist [14, 15], FS methods and criteria are often differentiated between 'filter' (independent from any classifier), 'wrapper' (related to the classification performance of a classifier) and 'embedded' (related to the quality of classification models estimated by a classifier, but not directly to classification accuracy). It is also possible to distinguish supervised and unsupervised ones, especially for filters, that is to say whether a notion of classes is taken into account or not. All approaches mentioned below are summed up in **Tables 1** and **2**. Nevertheless, it must be kept in mind that hybrid approaches involving several criteria belonging to these different FS criteria categories often exist, as, for instance, in [14] or [15], where features are selected based on a wrapper method, respectively, guided or associated with filter criteria (mutual information between selected bands and between the ground truth).

#### *2.1.1 Filter*

Filter methods compute relevance scores independently from any classifier. Some filter methods are ranking approaches: features are ranked according to an individual score of importance. Such individual feature scores can be supervised or


#### **Table 1.**

*State of the art of feature selection criteria: the criteria that work with the FS criteria evaluation framework used in this study are underlined [16–47].*


unsupervised approaches also use results of a PCA selecting the most similar

*Spectral Optimization of Airborne Multispectral Camera for Land Cover Classification…*

Other filter approaches associate a score to feature subsets. In unsupervised case, [25] also performs a constrained energy minimization to select a set of bands having minimum correlation between each other. In supervised cases, separability measures such as Bhattacharyya or Jeffries-Matusita (JM) distances can be used in order to identify the best feature subsets for separating classes [30, 35, 45, 52]. Other separability measures based on the minimum estimated abundance covariance (related to the ability of the band subset to correctly unmix several sources) have

High-order statistics from information theory such as divergence, entropy and mutual information can also be used to select the best feature sets achieving the minimum redundancy and the maximum relevance, either in unsupervised situations as in [6, 22] or in supervised ones as in [14, 17, 54–56]. Martínez-Usó et al. [22] first clusters 'correlated' features and then selects the most representative feature of each group. Le Moan et al. [6] selects the three bands belonging to three red, green and blue spectral domains so that their correlation is minimized. In supervised cases, [14, 17, 54, 55, 57] select the set of bands that are more correlated to the ground truth and less correlated to each other. The most difficult is then to balance

The orthogonal projection divergence [16] is another way to measure correlation between bands by the extent to which it is possible to express one band as a linear combination of the already selected bands. Last, [20] uses support vector clustering

To sum it up, there are many various filter criteria corresponding to different approaches. Ranking methods according to an individual feature importance score remain limited, especially the ones only based on a supervised score, since they are not aware of the dependencies between selected features. Filter approaches associating a score to feature subsets are more interesting. Supervised and unsupervised approaches can be distinguished. Unsupervised approaches are interesting, but in a classification context, there is still a risk to select features that will not be all useful

Wrapper relevance score associated with a feature set simply corresponds to its corresponding classification performance (measured by an accuracy score). Examples of such scores can be found in [14, 15, 58, 59] using SVM classifier, [60, 61] maximum likelihood classifier, [21] random forests, [46] spectral angle mapper or

Embedded FS methods are also related to a classifier, but feature selection is performed using a feature relevance score different from a classification accuracy. Most of the time, embedded approaches directly select features during the classifier training step. Several types of embedded FS approaches can be distinguished [62]. Some embedded approaches are regularization based models. A classifier is trained according to an objective function where a fit-to-data term that minimizes the classification error is associated with a regularization function, penalizing models when the number of features increases or forcing model coefficients associated with some features to be small. Features with the coefficients close to 0 are eliminated. Examples of some approaches can be found in [23, 31, 63]. They also

applied to features in order to identify the most relevant ones.

features to the first PCA [46, 51].

*DOI: http://dx.doi.org/10.5772/intechopen.88507*

also been used as in [53].

for the classification problem.

[26] a target detection algorithm.

both criteria.

*2.1.2 Wrapper*

*2.1.3 Embedded*

**73**

#### **Table 2.**

*Pros and cons for the different families of FS criteria.*

unsupervised. For instance, the well-known ReliefF score [11, 12] or scores measuring the correlation between features and ground truth [29] are supervised ones. However, such individual feature importance measures do not take into account the correlations between selected features. Thus, a feature subset composed of the *n* best features according to such measures is not necessarily an optimal solution, in the sense that it is not parsimonious.

Other ranking methods are unsupervised: they use importance measures calculated from a feature extraction technique. For instance, [48] ranks bands according to a score of importance calculated from PCA decomposition. Correlated bands are then removed according to a divergence measure. Du et al. and Hasanlou et al. [49, 50] have a similar approach using ICA instead of PCA. Other

*Spectral Optimization of Airborne Multispectral Camera for Land Cover Classification… DOI: http://dx.doi.org/10.5772/intechopen.88507*

unsupervised approaches also use results of a PCA selecting the most similar features to the first PCA [46, 51].

Other filter approaches associate a score to feature subsets. In unsupervised case, [25] also performs a constrained energy minimization to select a set of bands having minimum correlation between each other. In supervised cases, separability measures such as Bhattacharyya or Jeffries-Matusita (JM) distances can be used in order to identify the best feature subsets for separating classes [30, 35, 45, 52]. Other separability measures based on the minimum estimated abundance covariance (related to the ability of the band subset to correctly unmix several sources) have also been used as in [53].

High-order statistics from information theory such as divergence, entropy and mutual information can also be used to select the best feature sets achieving the minimum redundancy and the maximum relevance, either in unsupervised situations as in [6, 22] or in supervised ones as in [14, 17, 54–56]. Martínez-Usó et al. [22] first clusters 'correlated' features and then selects the most representative feature of each group. Le Moan et al. [6] selects the three bands belonging to three red, green and blue spectral domains so that their correlation is minimized. In supervised cases, [14, 17, 54, 55, 57] select the set of bands that are more correlated to the ground truth and less correlated to each other. The most difficult is then to balance both criteria.

The orthogonal projection divergence [16] is another way to measure correlation between bands by the extent to which it is possible to express one band as a linear combination of the already selected bands. Last, [20] uses support vector clustering applied to features in order to identify the most relevant ones.

To sum it up, there are many various filter criteria corresponding to different approaches. Ranking methods according to an individual feature importance score remain limited, especially the ones only based on a supervised score, since they are not aware of the dependencies between selected features. Filter approaches associating a score to feature subsets are more interesting. Supervised and unsupervised approaches can be distinguished. Unsupervised approaches are interesting, but in a classification context, there is still a risk to select features that will not be all useful for the classification problem.

#### *2.1.2 Wrapper*

Wrapper relevance score associated with a feature set simply corresponds to its corresponding classification performance (measured by an accuracy score). Examples of such scores can be found in [14, 15, 58, 59] using SVM classifier, [60, 61] maximum likelihood classifier, [21] random forests, [46] spectral angle mapper or [26] a target detection algorithm.

#### *2.1.3 Embedded*

Embedded FS methods are also related to a classifier, but feature selection is performed using a feature relevance score different from a classification accuracy. Most of the time, embedded approaches directly select features during the classifier training step. Several types of embedded FS approaches can be distinguished [62].

Some embedded approaches are regularization based models. A classifier is trained according to an objective function where a fit-to-data term that minimizes the classification error is associated with a regularization function, penalizing models when the number of features increases or forcing model coefficients associated with some features to be small. Features with the coefficients close to 0 are eliminated. Examples of some approaches can be found in [23, 31, 63]. They also

include the L1-SVM [64] and the least absolute shrinkage and selection operator (LASSO) FS [18, 63] approaches. Such approaches are fast and efficient. However, it can be more difficult to adapt them, for instance, to take into account additional constraints, since FS criterion and optimization method are linked.

search (SFFS) or sequential backward floating search (SBFS) are proposed in [44]. Serpico and Bruzzone [24] also proposes a variant of these methods called steepest

*Spectral Optimization of Airborne Multispectral Camera for Land Cover Classification…*

Among stochastic optimization strategies used for feature selection, several algorithms have been used for feature selection, including genetic algorithms [14, 15, 26, 37, 59], particle swarm optimization (PSO) [53, 58], clonal selection

In the specific case of hyperspectral data, adjacent bands are often very correlated to each other. Thus, hyperspectral band selection faces the problem of the clustering of the spectral bands. Band clustering/grouping has sometimes been performed in association with individual band selection. For instance, [15] first groups adjacent bands according to conditional mutual information and then performs band selection with the constraint that only one band can be selected per cluster. Su et al. [66] performs band clustering applying k-means to band correlation matrix and then iteratively removes the too inhomogeneous clusters and the bands that are too different from the representative of their cluster. Martínez-Usó et al. [22] first clusters 'correlated' features and then selects the most representative feature of each group, according to the mutual information. Chang et al. [40]

performs band clustering using a more global criterion taking specifically

correlation coefficients between bands belonging to a same cluster.

**3. Which band selection criterion?**

**3.1 Compared FS criteria**

*3.1.1 Filter FS criteria*

(e.g. ReliefF).

Let *μ<sup>i</sup>*

**75**

*3.1.1.1 Separability*

*Fij* ¼

ð**w** ! � *μ<sup>i</sup>* ! �*μ* ! *j* <sup>2</sup>

*<sup>t</sup>* **w**

! Σ*<sup>i</sup>* þ Σ*<sup>j</sup>* **w**

into account the existence of several classes. Simulated annealing is used to maximise a cost function defined as the sum, over all clusters and over all classes, of

This study is a comparison of **FS criteria that can be optimized using generic optimization heuristics**, thus excluding several specific embedded or ranking approaches. The following FS criteria (listed in **Table 3**) were evaluated.

Filter criteria are independent from any classifier. Only scores assessing the

Separability measures are used to identify the feature subsets achieving the best class distinction. Fisher, Bhattacharyya and Jeffries-Matusita measures [30, 35, 45, 52] are such scores. They were used assuming Gaussian class models.

! and Σ*<sup>i</sup>* be the mean and covariance matrices of the spectral distribution of class *i*. Fisher separability between classes *i* and *j* is defined in equation (1)

! where **<sup>w</sup>**

Bhattacharyya separability between classes *i* and *j* is defined by equation 2.

! ¼ Σ*<sup>i</sup>* þ Σ*<sup>j</sup>* �<sup>1</sup>

*μ* ! *<sup>i</sup>* � *μ<sup>j</sup>*

! (1)

relevance of feature subsets were considered, excluding filter FS methods ranking features independently according to an individual feature score

[61], ant colony [65] or even simulated annealing [30, 40].

ascent (SA) algorithms.

*DOI: http://dx.doi.org/10.5772/intechopen.88507*

Other embedded approaches use the built-in mechanism for feature selection in the training algorithm of some classifiers. For instance, random forests (RF) [41] and decision trees can be considered as performing an embedded feature selection, since, when splitting a tree node, only the most discriminative feature according to Gini impurity criterion is used among a feature subset randomly selected [41]. This FS eliminates the less useful features, but there is no guarantee to select a parsimonious feature subset: redundant features can be selected.

Some embedded approaches also provide feature importance measures, such as random forest classifier [41]. It is processed on samples left out of the bootstrapped samples and is based on the permutation decrease accuracy: the importance of a feature is estimated by randomly permuting all its values in these samples for each tree, as the difference averaged over all the trees between prediction accuracy before and after permuting this feature. Other embedded approaches providing feature importance use them in a pruning process that first uses all features to train a model, before progressively eliminating some of them while maintaining model performance. SVM-RFE [13] is a well-known embedded approach where the importance of the different features in a SVM model is considered. Such approach has been extended to multiple kernel SVM by [32], associating a different kernel to each feature, estimating the model and then using the weights associated with these kernels as feature importance measures.

Other approaches do not calculate a score of importance for each feature individually, but evaluate the relevance of sets of features. Such scores often measure the generalization performance of the obtained model. Thus, the FS is not directly performed during the training step, but uses an intermediate result of the training step. For instance, [37, 59] use the generalization performance, e.g. the margin of a SVM classifier, as a separability measure to rank sets of features. The out-of-bag error rate of a random forest [41] can also be considered as such score. These scores are calculated for feature subsets and measure the generalization performance of the model provided by the classifier. Thus, they can be considered as an alternative between filter separability measures and wrapper scores.

Embedded approaches can also be extended to unmixing methods, as, for instance, in [43] where band selection is integrated into an endmember and abundance determination algorithm by incorporating band weights and a band sparsity term into an objective function.

#### *2.1.4 Optimization methods*

Another issue for band selection is to determine the best set of features corresponding to a given criteria. An exhaustive search is often impossible, especially for wrapper techniques. Therefore, heuristics have been proposed to find a near-optimal solution without visiting the entire solution space. Optimization methods can be either specific to a FS method (as for most embedded ones) or generic. Generic optimization methods can be divided into two groups: sequential and stochastic.

Several incremental search strategies have been detailed in [44], including the sequential forward search (SFS) starting from one feature and incrementally adding another feature making it possible to obtain the best score or on the opposite the sequential backward search (SBS) starting for all possible features and incrementally removing the worst feature. Variants such as sequential forward floating

*Spectral Optimization of Airborne Multispectral Camera for Land Cover Classification… DOI: http://dx.doi.org/10.5772/intechopen.88507*

search (SFFS) or sequential backward floating search (SBFS) are proposed in [44]. Serpico and Bruzzone [24] also proposes a variant of these methods called steepest ascent (SA) algorithms.

Among stochastic optimization strategies used for feature selection, several algorithms have been used for feature selection, including genetic algorithms [14, 15, 26, 37, 59], particle swarm optimization (PSO) [53, 58], clonal selection [61], ant colony [65] or even simulated annealing [30, 40].

In the specific case of hyperspectral data, adjacent bands are often very correlated to each other. Thus, hyperspectral band selection faces the problem of the clustering of the spectral bands. Band clustering/grouping has sometimes been performed in association with individual band selection. For instance, [15] first groups adjacent bands according to conditional mutual information and then performs band selection with the constraint that only one band can be selected per cluster. Su et al. [66] performs band clustering applying k-means to band correlation matrix and then iteratively removes the too inhomogeneous clusters and the bands that are too different from the representative of their cluster. Martínez-Usó et al. [22] first clusters 'correlated' features and then selects the most representative feature of each group, according to the mutual information. Chang et al. [40] performs band clustering using a more global criterion taking specifically into account the existence of several classes. Simulated annealing is used to maximise a cost function defined as the sum, over all clusters and over all classes, of correlation coefficients between bands belonging to a same cluster.
