**3.2. Feature extraction**

meant to automatically recognize emotions. Interestingly, a recent review about affective computing systems [7] emphasizes the advantages of using frequency‐based features instead

In our valence detection system, we have addressed the problem of selecting the most rel‐ evant features to define the scalp region of interest by including a wrapper‐based classifica‐ tion block. Feature extraction is based on ERD/ERS measures computed in short intervals and is performed either on signals averaged over an ensemble of trials or on single‐trial response signals, in order to carry out inter and intrasubject analysis, respectively. The subsequent wrapper classification stage is implemented using two different classifiers: an ensemble classifier, i.e., a random forest and an SVM. The feature selection of algorithm is wrapped around the classification of algorithm recursively identifying the features which do not contribute to the decision. These features are eliminated from the feature vector. This goal is achieved by applying an importance measure, which depends on the parameters of the classifier. The two variants of the system were implemented in MATLAB also using some facilities of open source software tools like EEGLAB [32], as well as random forest and

A total of 26 female volunteers participated in the study (age 18‐62 years; mean = 24.19; SD = 10.46). Only adult women were chosen in this experiment to avoid gender differences [21, 34, 35]. All participants had normal or corrected to normal vision and none of them had a history of severe medical treatment, neither psychological nor neurological disorders. This study was carried out in compliance with the Helsinki Declaration and its protocol was approved by the Department of Education from the University of Aveiro. All participants

Each one of the selected participants was comfortably seated at 70 cm from a computer screen (43.2 cm), alone in an enclosed room. The volunteer was instructed verbally to watch some pictures, which appeared on the center of the screen and to stay quiet. No responses were required. The pictures were chosen from the IAPS repository. A total of 24 images with high arousal ratings (>6) were selected, 12 of them with positive affective valence (7.29 ± 0.65) and the other 12 with negative affective valence (1.47 ± 0.24). In order to match as closely as pos‐ sible the levels of arousal between positive and negative valence stimuli, only high arousal pictures were shown, avoiding neutral pictures. **Figure 1** shows the representation of the

Three blocks with the same 24 images were presented consecutively and pictures belonging to each block were presented in a pseudorandom order. In each trial, a fixation single cross was presented on the center of the screen during 750 ms, after which an image was presented during 500 ms and finally, a black screen during 2250 ms (total duration = 3500 ms). **Figure 2**

of the ERP components.

SVM packages [33].

signed informed consents before their inclusion.

stimuli in arousal/valence space.

shows a scheme of the experimental protocol.

**3.1. Data set**

**3. Materials and methods**

28 Emotion and Attention Recognition Based on Biological Signals and Images

The signals (either single trials or average segments) are filtered by four 4th‐order bandpass Butterworth filters. *K* = 4 filters are applied following a zero‐phase forward and reverse digi‐ tal filter methodology not including any transient (see *filtfilt* MATLAB function [36]). The four frequency bands have been defined as: *δ* = Z[0.5, 4] Hz, *θ* = Z[4, 7] Hz, *α* = Z[8, 12] Hz and *β* = Z[13, 30] Hz. From a technical point of view, ERD/ERS computation reduces significantly the initial sample size per trial (800 features corresponding to the time instants) to a much smaller number, optimizing the design of the classifier. For each filtered signal, the ERD/ERS is estimated in *I* = 9 intervals following the stimulus onset and with a duration of 150 ms and 50% of overlap between consecutive intervals. The reference interval corresponds to the 150 ms pre‐stimulus period. For each interval, the ERD/ERS is defined as

$$f\_h = \frac{E\_h - E\_k}{E\_h} = \mathbf{1} - \frac{E\_k}{E\_h} \tag{1}$$

where *Erk* represents the energy within the reference interval and *Eik* is the energy in the *i*th interval after stimulus in the *k*th band, for *i* <sup>=</sup> 1,2,…, 9 and *k* <sup>=</sup> 1,…,4. Note that when *Erk* >*Eik* , then *f ik* is positive, otherwise it is negative. And furthermore notice that the measure has an upper bound *f ik* ≤ 1 because energy is always a positive value. Energies *Eik* are computed by adding up instantaneous energies within each of the *I* = 9 intervals of 150 ms duration. The energy *Erk* is estimated in an interval of 150 ms duration defined in the pre‐stimulus period. Generally, early poststimulus components are related to an increase of power in all bands due to the evoked potential contribution and this increase is followed by a general decrease (ERD), especially in the alpha band, which can be modulated by a perceptual enhancement as a reaction to relevant contents by the presence of high arousal images [31].

In summary, each valence condition can be characterized by *<sup>f</sup> ikc* , where *i* is the time inter‐ val, *k* is the characteristic frequency band and *c* refers to the channel. A total of *M* = *I* × *K* × *C* = 9 × 4 × 21 = 756 features is computed for the multichannel segments related to one condition. Following, the features *<sup>f</sup> ikc* will be concatenated into a feature vector with com‐ ponents *<sup>f</sup> <sup>m</sup>*, *m* <sup>=</sup> 1,…,*M*, with *M* <sup>=</sup> 756.

#### **3.3. Classification using wrapper approaches**

The target of any feature selection method is the selection of the most pertinent feature subset which provides the most discriminant information from a complete feature set. In the wrapper approach, the feature selection algorithm acts as a wrapper around the classification algorithm. In this case, the feature selection consists of searching a relevant subset of features from high‐ dimensional data sets using the induction algorithm itself as part of the function‐evaluating features [37]. Hence, the parameters of the classifier serve as scores to select (or to eliminate) features and the corresponding classification performance is the guide to an iterative proce‐ dure. The recursive feature elimination strategy using a linear SVM‐based classifier is a wrap‐ per method usually called support vector machine recursive feature elimination (SVM‐RFE) [38]. This strategy was introduced when the data sets had a large number of features com‐ pared to the number of training examples [38], but it was recently applied for class‐imbalanced data sets [39]. A similar strategy can be applied with other learning algorithms, for instance random‐forest that has an embedded method of feature selection. The random forest is an ensemble of binary decision trees where the training is achieved by randomly selecting subsets of features. Therefore, computing a variable using parameters of the classifier, which somehow reflect the importance of each input (feature) of the classifier, an iterative procedure can be developed. Assuming that this variable importance is *rm*, the steps of the wrapper method are:


Accuracy is the proportion of true results (either positive or negative valence) in the test set. The leave‐one‐out strategy assumes that only one example of the data set forms the test set while all the remaining belong to the training set. This training and test procedure is repeated so that all the elements of the data set are used once as test set (step 3 of the wrapper method). Then, after computing the model of the classifier with the complete data, the importance of each feature is estimated (steps 4 and 5).

As mentioned before, random forest and linear SVM are classifiers that can be applied in a wrapper method approach and used to estimate *r*m. For convenience, the next two subsections review the relevant parameters of both classifiers and their relation to the variable importance mechanism.

#### *3.3.1. Random forest*

*f*

30 Emotion and Attention Recognition Based on Biological Signals and Images

condition. Following, the features *<sup>f</sup> ikc*

ponents *<sup>f</sup> <sup>m</sup>*, *m* <sup>=</sup> 1,…,*M*, with *M* <sup>=</sup> 756.

**3.3. Classification using wrapper approaches**

*ik*

where *Erk*

energy *Erk*

upper bound *f*

valence).

strategy.

then *f ik* *ik* <sup>=</sup> *Erk* <sup>−</sup> *<sup>E</sup>* \_\_\_\_\_\_*ik Erk*

interval after stimulus in the *k*th band, for *i* <sup>=</sup> 1,2,…, 9 and *k* <sup>=</sup> 1,…,4. Note that when *Erk*

≤ 1 because energy is always a positive value. Energies *Eik*

adding up instantaneous energies within each of the *I* = 9 intervals of 150 ms duration. The

Generally, early poststimulus components are related to an increase of power in all bands due to the evoked potential contribution and this increase is followed by a general decrease (ERD), especially in the alpha band, which can be modulated by a perceptual enhancement as

val, *k* is the characteristic frequency band and *c* refers to the channel. A total of *M* = *I* × *K* × *C* = 9 × 4 × 21 = 756 features is computed for the multichannel segments related to one

The target of any feature selection method is the selection of the most pertinent feature subset which provides the most discriminant information from a complete feature set. In the wrapper approach, the feature selection algorithm acts as a wrapper around the classification algorithm. In this case, the feature selection consists of searching a relevant subset of features from high‐ dimensional data sets using the induction algorithm itself as part of the function‐evaluating features [37]. Hence, the parameters of the classifier serve as scores to select (or to eliminate) features and the corresponding classification performance is the guide to an iterative proce‐ dure. The recursive feature elimination strategy using a linear SVM‐based classifier is a wrap‐ per method usually called support vector machine recursive feature elimination (SVM‐RFE) [38]. This strategy was introduced when the data sets had a large number of features com‐ pared to the number of training examples [38], but it was recently applied for class‐imbalanced data sets [39]. A similar strategy can be applied with other learning algorithms, for instance random‐forest that has an embedded method of feature selection. The random forest is an ensemble of binary decision trees where the training is achieved by randomly selecting subsets of features. Therefore, computing a variable using parameters of the classifier, which somehow reflect the importance of each input (feature) of the classifier, an iterative procedure can be developed. Assuming that this variable importance is *rm*, the steps of the wrapper method are: **1.** Initialize: create a set of indices *M* = {1,2,…,*M*} relative to the available features and set *F* = *M*. **2.** Organize data set *X* by forming the feature vectors with the feature values whose index is in set *M*, labeling each feature vector according to the class it belongs (negative or positive

**3.** Compute the accuracy of the classifier using the leave‐one‐out (LOO) cross‐validation

**4.** Compute the global model of the classifier using the complete data set *X*.

represents the energy within the reference interval and *Eik*

a reaction to relevant contents by the presence of high arousal images [31].

In summary, each valence condition can be characterized by *<sup>f</sup> ikc*

<sup>=</sup> <sup>1</sup> <sup>−</sup> *<sup>E</sup>*\_\_\_*ik Erk*

is positive, otherwise it is negative. And furthermore notice that the measure has an

is estimated in an interval of 150 ms duration defined in the pre‐stimulus period.

(1)

>*Eik* ,

is the energy in the *i*th

, where *i* is the time inter‐

will be concatenated into a feature vector with com‐

are computed by

The random forest algorithm, developed by Breiman [40], is a set of binary decision trees, each performing a classification, being the final decision taken by majority voting. Each tree is grown using a bootstrap sample from the original data set and each node of the tree randomly selects a small subset of features for a split. An optimal split separates the set of samples of the node into two more homogeneous (pure) subgroups with respect to the class of its elements. A measure for the impurity level is the Gini index. By considering that *ω<sup>c</sup>* , *c* = 1…*C* are the labels given to the classes, the Gini index of node *i* is defined as

$$\mathcal{G}(\mathbf{i}) = \mathbf{1} - \sum\_{\boldsymbol{\alpha}=\mathbf{i}}^{\mathcal{C}} \left( \mathbf{P} \left( \boldsymbol{\omega}\_{\boldsymbol{\alpha}} \right) \right)^{2} \tag{2}$$

where *P*(*ωc*) is the probability of class *ω<sup>c</sup>* in the set of examples that belong to node *i*. Note that *G*(*i*) <sup>=</sup> 0 when node *i* is pure, e.g., if its data set contains only examples of one class. To perform a split, one feature *f <sup>m</sup>* is tested *<sup>f</sup> <sup>m</sup>* > *f* 0 on the set of samples with *n* elements which is then divided into two groups (left and right) with *nl* and *nr* elements. The change in impurity is computed as

$$
\Delta \mathbf{G(i)} = \mathbf{G(i)} - \left(\frac{n\_i}{n} \mathbf{G(i\_l)} - \frac{n\_r}{n} \mathbf{G(i\_l)}\right) \tag{3}
$$

The feature and value that results in the largest decrease of the Gini index is chosen to per‐ form the split at node *i*. Each tree is grown independently using random feature selection to decide the splitting test of the node and no pruning is done on the grown trees. The main steps of this algorithm are

**1.** Given a data set *T* with *N* examples, each with *F* features, select the number *T* of trees, the dimension of the subset *L* < *F* of features and the parameter that controls the size of the tree (it can be the maximum depth of the tree, the minimum size of the subset in a node to perform a split).

	- **a.** Create a training set *T t* with *N* examples by sampling with replacement the original data set. The out‐of‐bag data set *Ot* is formed with the remaining examples of *T* not belonging to *T t*.
	- **b.** Perform the split of node *i* by testing one of the *L* = √ \_\_ *F* randomly selected features.
	- **c.** Repeat step 2b until the tree *t* is complete. All nodes are terminal nodes (leafs) if the number *ns* of examples is *ns* <sup>≤</sup> 0.1*N*.

After training, the importance *rm* of each feature *f <sup>m</sup>* in the ensemble of trees can be computed by adding the values of *ΔG*(*i*) of all nodes *i* where the feature *f <sup>m</sup>* is used to perform a split. Sorting the values *r* by decreasing order, it is possible to identify the relative importance of the features. The *F* = 20 least relevant features are eliminated from the feature vector *f*.

#### *3.3.2. Linear SVM*

Linear SVM parameters define decision hyperplanes or hypersurfaces in the multidimen‐ sional feature space [41, 42], that is:

$$\mathbf{g(w) = w^T x + b = 0} \tag{4}$$

where *x* ≡ *f* denotes the vector of features, *w* is known as the weight vector and *b* is the threshold.

The optimization task consists of finding the unknown parameters *wm*, *m* = 1,…,*F* and *b* [43]. The position of the decision hyperplane is determined by vector *w* and *b*: the vector is orthogo‐ nal to the decision plane and *b* determines its distance to the origin. For the Linear SVM the vector *w* can be explicitly computed and this constitutes an advantage as it decreases the com‐ plexity during the test phase. With the optimization algorithm the Lagrangian values, 0 ≤ <sup>λ</sup>*<sup>i</sup>* <sup>≤</sup> *C* are estimated [43]. The training examples, known as support vectors, are related with the nonzero Lagrangian coefficients. The weight vector then can be computed

$$\mathbf{w} = \sum\_{i}^{N\_i} y\_i \lambda\_i \mathbf{x}\_i \tag{5}$$

where *Ns* is the number of the support vectors and (*xi* , *yi* ) is the support vector and correspond‐ ing label {‐1,1}. The threshold *b* is estimated as an average of the projected supported vectors *<sup>w</sup><sup>T</sup> xi* corresponding to *C* ≠ 0. The value of *C* needs to be assigned to run the training optimiza‐ tion algorithm and controls the number of errors allowed versus the margin width. During the optimization process, *C* represents the weight of the penalty term of the optimization function that is related to the misclassification error in the training set. There is no optimal procedure to assign this parameter but it has to be expected that:


Note that this is important because in a real application linearly separable problems are not to be expected and it is more realistic to perform an optimization where misclassifications are allowed. In the following simulations, the parameter *C* = 1 and the software MATLAB is used [44].

The relevance of the *m*th entry of the feature vector is then determined by the correspond‐ ing value *wm* in the weight vector. In particular if |*wm*| <sup>≠</sup> 0, the corresponding feature do not contribute to the value of *g*(*w*) [38]. Then, setting *<sup>r</sup> <sup>m</sup>* <sup>≡</sup> *wm* for the SVM classifier and sorting the absolute values, the importance of the features is found out.
