**4. Results and discussion**

**2.** Construct the *t* = 1…*T* trees.

belonging to *T t*.

number *ns*

*3.3.2. Linear SVM*

where *Ns*

has to be expected.

gin has to be expected.

*<sup>w</sup><sup>T</sup> xi*

sional feature space [41, 42], that is:

*w* = ∑

is the number of the support vectors and (*xi*

procedure to assign this parameter but it has to be expected that:

**b.** Perform the split of node *i* by testing one of the *L* = √

by adding the values of *ΔG*(*i*) of all nodes *i* where the feature *f*

of examples is *ns* <sup>≤</sup> 0.1*N*.

32 Emotion and Attention Recognition Based on Biological Signals and Images

After training, the importance *rm* of each feature *f*

**a.** Create a training set *T t* with *N* examples by sampling with replacement the original data set. The out‐of‐bag data set *Ot* is formed with the remaining examples of *T* not

**c.** Repeat step 2b until the tree *t* is complete. All nodes are terminal nodes (leafs) if the

**3.** Repeat step 2 to grow next tree if *t* ≠*T*. In this work *T* = 500 decision trees were employed.

Sorting the values *r* by decreasing order, it is possible to identify the relative importance of the

Linear SVM parameters define decision hyperplanes or hypersurfaces in the multidimen‐

*g*(*w*) = *wT x* + *b* = 0 (4)

where *x* ≡ *f* denotes the vector of features, *w* is known as the weight vector and *b* is the threshold. The optimization task consists of finding the unknown parameters *wm*, *m* = 1,…,*F* and *b* [43]. The position of the decision hyperplane is determined by vector *w* and *b*: the vector is orthogo‐ nal to the decision plane and *b* determines its distance to the origin. For the Linear SVM the vector *w* can be explicitly computed and this constitutes an advantage as it decreases the com‐ plexity during the test phase. With the optimization algorithm the Lagrangian values, 0 ≤ <sup>λ</sup>*<sup>i</sup>* <sup>≤</sup> *C* are estimated [43]. The training examples, known as support vectors, are related with the

> *i Ns*

ing label {‐1,1}. The threshold *b* is estimated as an average of the projected supported vectors

– If *C* is large, the misclassification errors are relevant during optimization. A narrow margin

– If *C* is small, the misclassification errors are not relevant during optimization. A large mar‐

 corresponding to *C* ≠ 0. The value of *C* needs to be assigned to run the training optimiza‐ tion algorithm and controls the number of errors allowed versus the margin width. During the optimization process, *C* represents the weight of the penalty term of the optimization function that is related to the misclassification error in the training set. There is no optimal

, *yi*

features. The *F* = 20 least relevant features are eliminated from the feature vector *f*.

nonzero Lagrangian coefficients. The weight vector then can be computed

\_\_

*F* randomly selected features.

*<sup>m</sup>* in the ensemble of trees can be computed

*yi* λ*<sup>i</sup> xi* (5)

) is the support vector and correspond‐

*<sup>m</sup>* is used to perform a split.

To ease interpreting the following results, it is possible to link wrapper methods to some statistical contrasts (e.g. *t*‐test) used by psychologists to test which EEG features change depending on the experimental condition. Note that in the two cases the goal is to perform a dimension reduction before the classification step. For instance, another alternative method‐ ology would consist of transforming the initial vector of features to low‐dimensional space by performing a singular value decomposition [45]. Both approaches can be considered as filter techniques to reduce the dimension of an initial feature vector. In the former, statistical analysis, the dimension reduction is achieved according to a parameter from a classifier and each feature is taken individually to check how its value influences the classification outcome. In the latter, machine learning approach, the significant features, selected from the initial vec‐ tor, are obtained after comparing two sets of features belonging to two different conditions and checking a statistical value. Classification techniques have the advantage of dealing with the set of features as a whole without needing a complementary observation (belonging to another condition). Therefore, the results obtained by wrapper methods can complement the conclusions drawn by applying other statistical tests, indicating the most relevant features related to specific processes, e.g., affective valence processing.

Considering feature elimination and the concomitant number of relevant features, as can be seen from **Figures 3**–**6**, the accuracy of the wrapper classifiers improves with a decreasing number of relevant features in both, inter or intrasubject classification strategies. In all cases, the system achieves 80% accuracy rate using random forest whereas the system reaches values close to 100% by means of SVM when the classifier has less than 100 relevant features as input.
