**Variable Selection and Feature Extraction Through Artificial Intelligence Techniques**

Silvia Cateni, Marco Vannucci, Marco Vannocci and Valentina Colla

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/53862

#### **1. Introduction**

102 Multivariate Analysis in Management, Engineering and the Sciences

New Jersey: Pearson Prentice Hall, 2007.

New York: Springer, 2002.

v. 31, n. 3, 1999.

28/08/2008.

Day, 1976.

Universidade Federal do Paraná, Curitiba, 2002.

106 f. Dissertação (Mestrado em Construção Civil). – Setor de Ciências Exatas,

[12] ITAIPU – Itaipu Binacional. Disponível em http://www.itaipu.gov.br/. Acessoem

[13] Statgraphics Plus 5.1 – Statgraphics Plus 5.1, Statistical Graphics Corp., Rockville, 2001. [14] Box, G.E.P.; Jenkins, G.M. Time Series Analysis, forecasting and control. Ed. Holden

[15] Johnson, R.A.; Wichern, D.W. Applied Multivariate Statistical Analysis. Sixth Edition.

[16] Hair Jr, J.F.; Anderson, R.E.; Tatham, R.L.; Black, W.C. Análise Multivariada de Dados.

[17] Freitas, A. A. Data Mining and Knowledge Discovery with Evolutionary Algorithms.

[18] Diniz, C. A. R.; Louzada Neto, F. Data mining: uma introdução. São Paulo: ABE, 2000. [19] Jain A. K., Murty M. N., Flynn P. J. Data clustering: a review. ACM Computing Surveys.

Tradução de: Santanna, A. S.; Chaves Neto, A. Porto Alegre: Bookman, 2005.

The issue of variable selection has been widely investigated for different purposes, such as clustering, classification or function approximation becoming the focus of many research works where datasets can contain hundreds or thousands variables. The subset of the potential input variables can be defined through two different approaches: feature selection and feature extraction. Feature selection reduces dimensionality by selecting a subset of original input variables, while feature extraction performs a transformation of the original variables to generate other features which are more significant. When the considered data have a large number of features it is useful to reduce them in order to improve the data analysis. In extreme situations the number of variables can exceed the number of available samples causing the so-called problem of *curse of dimensionality* [1], which leads to a decrease in terms of accuracy of the considered learning algorithm when the number of features increases. The main reason for seeking for data reduction include the need to reduce calculation time of a given learning algorithm, to improve its accuracy [2] but also to deepen the knowledge of the considered problem, by discovering which factors actually affect it. A high number of contributions based on artificial intelligence, genetic algorithms, statistical approaches have been proposed in order to develop novel efficient variable selection methods that are suitable in many application areas. Section 1 and Section 2 provide a preliminary review of traditional and Artificial Intelligence–based feature extraction techniques and variable selection in order to demonstrate that Artificial Intelligence are often capable to outperform the widely adopted traditional methods, due to their flexibility and to their possibility of self-adapting to the characteristics of the available dataset. Finally in Section 4 some concluding remarks are provided.

© 2012 Cateni et al., licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2012 Cateni et al., licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### **2. Feature extraction**

Feature extraction is a process that transforms high dimensional data into a lower dimensional feature space through the application of some mapping. Brian Ripley [3] gives the following definition of the feature extraction problem:

"*Feature extraction is generally used to mean the construction of linear combinations αTx of continuous features which have good discriminatory power between classes*".

Variable Selection and Feature Extraction Through Artificial Intelligence Techniques 105

The main reason for the use of PCA concerns the fact that PCA is a simple non-parametric method used to extract the most relevant information from a set of redundant or noisy data. This method reduces the number of available variables by eliminating the last principal components that do not significantly contribute to the observed variability. Also, PCA is a linear transformation of data that minimizes the redundancy (which is measured through the covariance) and maximizes the information (which is measured through the variance).

2. the principal components are uncorrelated to each other and also the redundant

While the PCA is unsupervised (i.e. it does not take into account class labels), the Linear Discriminant Analysis (LDA) is a popular supervised technique which is widely used in computer-vision, pattern recognition, machine learning and other related fields [6]. LDA performs an optimal projection by maximizing the distance between classes and minimizing the distance between samples within each class at the same time [7]. This approach reduces the dimensionality preserving as much of the class discriminatory information as possible. The main limitation of this approach lies in the fact that it can produce a limited number of feature projections (that is equal to the number of classes minus one). If more features are needed some other method should be employed. Moreover LDA is a parametric method and it fails if the discriminatory information lies not in the mean values but in the variance of data. When the dimensionality of data overcomes the number of samples, which is known as *singularity problem*, Linear Discriminant Analysis is not an appropriate method. In these cases the data dimensionality can be reduced by applying the PCA technique before LDA. This approach is called PCA+LDA [8, 9]. Other solutions dealing with the singularity problem include regularized LDA (RLDA) [10], null space LDA (NLDA) [11], orthogonal

Latent Semantic Analysis (LSA) was introduced by Deerwester et al. in 1990 [14] as a variant of the PCA concept. Firstly LSA was presented as a text analysis method when the features are represented by terms occurring in the considered text [2]. Subsequently LDA has been employed on image analysis [15], video data [16] and music or audio analysis [17]. The main objective of the LSA process is to produce a mapping into a "latent semantic space" also called *Latent Topic Space.* LSA finds co-occurrences of terms in documents to provide a mapping into the latent topic space where documents can be connected if they contain few terms in common respect to the original space. Recently Chen et al. [18] proposed a new method called Sparse Latent Semantic Analysis which selects only few relevant words for each topic giving a compact representation of topic-word relationships. The main advantage of this approach lies in the computational efficiency and in the low memory required for

The principal components are new variables with the following properties:

centroid method (OCM) [12], uncorrelated LDA (ULDA) [13].

information is removed.

**2.3. Latent Semantic Analysis** 

**2.2. Linear Discriminant Analysis** 

1. each principal component is a linear combination of the original variables;

In Neural Network research, as well as in other disciplines included in the Artificial Intelligence area, an important problem is finding a suitable representation of multivariate data. Feature extraction is used in this context in order to reduce the complexity and to give a simpler representation of data representing each component in the feature space as a linear combination of the original input variables. If the extracted features are suitably selected, then it is possible to work with the relevant information from the input data using a reduced dataset. The most popular feature extraction technique is the Principal Component Analysis (PCA) but many alternatives in the last years are been proposed. In the following sub-paragraphs several feature extraction approaches are proposed.

#### **2.1. Principal Component Analysis**

The Principal Component Analysis (PCA) was introduced by Karl Pearson in 1901 [4]. PCA consists into an orthogonal transformation to convert samples belonging to correlated variables into samples of linearly uncorrelated features. The new features are called *principal components* and they are less or equal to the initial variables. If data are normally distributed, then the principal components are independent. PCA mathematically transforms data by referring them to a different coordinate system in order to obtain on the first coordinate the first greatest variance and so on for the other coordinates [5]. Figure 1 shows an example of PCA in 2D. The original coordinate system (x,y) is transformed into the feature space (x', y') in order to have the maximum variance in the x' direction.

**Figure 1.** Example of PCA in 2D.

The main reason for the use of PCA concerns the fact that PCA is a simple non-parametric method used to extract the most relevant information from a set of redundant or noisy data. This method reduces the number of available variables by eliminating the last principal components that do not significantly contribute to the observed variability. Also, PCA is a linear transformation of data that minimizes the redundancy (which is measured through the covariance) and maximizes the information (which is measured through the variance). The principal components are new variables with the following properties:


#### **2.2. Linear Discriminant Analysis**

104 Multivariate Analysis in Management, Engineering and the Sciences

the following definition of the feature extraction problem:

in order to have the maximum variance in the x' direction.

**2.1. Principal Component Analysis** 

**Figure 1.** Example of PCA in 2D.

Feature extraction is a process that transforms high dimensional data into a lower dimensional feature space through the application of some mapping. Brian Ripley [3] gives

In Neural Network research, as well as in other disciplines included in the Artificial Intelligence area, an important problem is finding a suitable representation of multivariate data. Feature extraction is used in this context in order to reduce the complexity and to give a simpler representation of data representing each component in the feature space as a linear combination of the original input variables. If the extracted features are suitably selected, then it is possible to work with the relevant information from the input data using a reduced dataset. The most popular feature extraction technique is the Principal Component Analysis (PCA) but many alternatives in the last years are been proposed. In the

The Principal Component Analysis (PCA) was introduced by Karl Pearson in 1901 [4]. PCA consists into an orthogonal transformation to convert samples belonging to correlated variables into samples of linearly uncorrelated features. The new features are called *principal components* and they are less or equal to the initial variables. If data are normally distributed, then the principal components are independent. PCA mathematically transforms data by referring them to a different coordinate system in order to obtain on the first coordinate the first greatest variance and so on for the other coordinates [5]. Figure 1 shows an example of PCA in 2D. The original coordinate system (x,y) is transformed into the feature space (x', y')

*continuous features which have good discriminatory power between classes*".

following sub-paragraphs several feature extraction approaches are proposed.

"*Feature extraction is generally used to mean the construction of linear combinations αTx of* 

**2. Feature extraction** 

While the PCA is unsupervised (i.e. it does not take into account class labels), the Linear Discriminant Analysis (LDA) is a popular supervised technique which is widely used in computer-vision, pattern recognition, machine learning and other related fields [6]. LDA performs an optimal projection by maximizing the distance between classes and minimizing the distance between samples within each class at the same time [7]. This approach reduces the dimensionality preserving as much of the class discriminatory information as possible. The main limitation of this approach lies in the fact that it can produce a limited number of feature projections (that is equal to the number of classes minus one). If more features are needed some other method should be employed. Moreover LDA is a parametric method and it fails if the discriminatory information lies not in the mean values but in the variance of data. When the dimensionality of data overcomes the number of samples, which is known as *singularity problem*, Linear Discriminant Analysis is not an appropriate method. In these cases the data dimensionality can be reduced by applying the PCA technique before LDA. This approach is called PCA+LDA [8, 9]. Other solutions dealing with the singularity problem include regularized LDA (RLDA) [10], null space LDA (NLDA) [11], orthogonal centroid method (OCM) [12], uncorrelated LDA (ULDA) [13].

#### **2.3. Latent Semantic Analysis**

Latent Semantic Analysis (LSA) was introduced by Deerwester et al. in 1990 [14] as a variant of the PCA concept. Firstly LSA was presented as a text analysis method when the features are represented by terms occurring in the considered text [2]. Subsequently LDA has been employed on image analysis [15], video data [16] and music or audio analysis [17]. The main objective of the LSA process is to produce a mapping into a "latent semantic space" also called *Latent Topic Space.* LSA finds co-occurrences of terms in documents to provide a mapping into the latent topic space where documents can be connected if they contain few terms in common respect to the original space. Recently Chen et al. [18] proposed a new method called Sparse Latent Semantic Analysis which selects only few relevant words for each topic giving a compact representation of topic-word relationships. The main advantage of this approach lies in the computational efficiency and in the low memory required for storing the projection matrix. In [18] the authors compare the Sparse Latent Semantic Analysis with LSA and LDA through experiments on different real world datasets. The obtained results demonstrate that Sparse LSA has similar performance with respect to LSA but it is more efficient in the projection computation, storage and it better explains the topicworld relashionships.

Variable Selection and Feature Extraction Through Artificial Intelligence Techniques 107

transform data into a new set. Variable selection points out all the inputs affecting the phenomenon under consideration and it is an important data pre-processing step in different fields such as machine learning [25-26], pattern recognition [27, 28], data mining [29], medical data [30] and many others. Variable Selection has been widely performed in applications such as function approximation [31], classification [32-34] and clustering [35]. The difficulty of extracting the most relevant variables is due mainly to the large dimension of the original variables set, the correlations between inputs which cause redundancy and finally the presence of variables which do not affect the considered phenomenon and thus, for instance in the case of the development of a model predicting the output of a give system, do not have any predictive power [36]. In order to select the optimal subset of input

 **Relevance.** The number of selected variables must be checked in order to avoid the possibility to have too few variables which do not convey relevant information. **Computational efficiency.** If the number of selected input variables is too high, then the computational burden increases. This is evident when an artificial neural network is performed. Moreover including redundant and irrelevant variables the task of training an artificial neural network is more difficult because irrelevant variables add noise and

**Knowledge improvement.** The optimal selection of input variables contributes to a

To sum up, the optimal set of input variables will contain the fewest number of variables needed to describe the behaviour of the considered system or phenomenon with the

If the optimal set of input variables is identified, then a more accurate efficient, inexpensive

In literature variable selection methods are classified into three categories: filter, wrapper

Filter approach is a pre-processing phase which is independent of the learning algorithm that is adopted to tune and/or build the system (e.g. a predictive model) that exploits the selected variables as inputs. Filters are computationally convenient but they can be affected

by overfitting problems. Figure 2 shows a generic scheme of the approach.

variables the following key considerations should be taken into account:

slow down the training of the network.

deeper understanding of the process behaviour.

minimum redundancy and with informative variables.

and more easy interpretable model can be built.

**Figure 2.** Generic scheme of filter methods.

and embedded methods.

**3.1. Filter approach** 
